Uncovering latent jet substructure

We apply techniques from Bayesian generative statistical modeling to uncover hidden features in jet substructure observables that discriminate between different a priori unknown underlying short distance physical processes in multi-jet events. In particular, we use a mixed membership model known as Latent Dirichlet Allocation to build a data-driven unsupervised top-quark tagger and $t\bar t$ event classifier. We compare our proposal to existing traditional and machine learning approaches to top jet tagging. Finally, employing a toy vector-scalar boson model as a benchmark, we demonstrate the potential for discovering New Physics signatures in multi-jet events in a model independent and unsupervised way.

Introduction.The use of jet substructure techniques in studying large area jets has played an important role in identifying hadronic decays of Higgs and electroweak gauge bosons in runs 1 and 2 of the LHC [1][2][3][4].These techniques have also been used efficiently to tag jets arising from top quarks [5][6][7][8][9][10][11][12][13][14][15].In the last few years, machine learning (ML) tools have extended the application of jet substructure in tagging jets at the LHC [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] through the use of Neural Networks (NNs) to process and 'learn' from vast amounts of training data.Since these approaches rely on theoretical predictions for pure signal and background training data sets (typically through Monte Carlo (MC) generators), they (a) are exposed to MC mismodeling of realistic events as reconstructed from real data and detectors; (b) require exact model knowledge of both expected signal and backgrounds.This limits their use in searches for a priori unknown new phenomena in LHC jet events.
There have been recent advances in unsupervised or semi-supervised ML techniques, based on NNs designed to be able to separate signal and background events in mixed samples, and could therefore be run directly on experimental data without the need for pure MC training samples, see e.g.refs.[33][34][35][36][37][38] and [39][40][41][42][43].They rely on categorizing and comparing datasets with different expected signal and background admixtures or identifying anomalous events inside large datasets.While these approaches ameliorate the model dependence of fully supervised ML, they are still potentially susceptible to correlated systematics (i.e.detector) effects and/or subject to large look-elsewhere effects.In addition, they generally work best when applied on very large datasets.Consequently their performance may suffer when looking for effects in tails of distributions.
In this Letter, we outline a new technique to classify jets and events in situ within a single mixed event sample, using tools developed in a branch of ML called generative statistical modeling [44,45].Developed primarily to identify emergent themes in collections of documents, these models infer the hidden (or latent) structure of a document corpus using posterior Bayesian inference based on word and theme co-occurence [46].Translated into the language jet physics, one assumes that observable jet substructure histograms (words) in events (documents) are generated by drawing from pure distributions (themes) of varying proportions.This allows to construct so called statistical mixed membership models of jet substructure. 1Furthermore assuming that each event is a mixture of only few pure distributions and that within each of these only few histogram bins have high cooccurrence, such models can be solved using techniques of Latent Dirichlet Allocation (LDA) [48].Finally, with a trained model at hand, one can define robust parametric jet and event classifiers by inferring on the pure distribution proportions in tested events.
In the following we first present the main ingredients of our proposal in more detail.Then we discuss two proof of principle implementations based on benchmark examples: an unsupervised top quark jet tagger and t t event classifier, as well as an unsupervised new physics (NP) search strategy able to identify boosted neutral scalar bosons decaying to pairs of W 's (previously studied in Refs.[37,38,49]).We compare them to existing conventional and ML approaches and also outline possible further improvements and future directions.
Generative Bayesian Models of Jet Substructure.We start by considering the formation of a jet stemming from an initial hard seed, as a sequential combination of QCD showering (followed by fragmentation and hadronization) and possibly massive particle decays.Next we assume that some relevant information on this intertwined sequence of processes can be recovered by looking at the clustering history of a jet-clustering algorithm.This is in fact the basis for many conventional taggers of massive jets [1,8,13].
Within this very simplified picture of jet formation and observation we can draw interesting parallels to so called mixed membership models describing generation of documents in the context of text analysis [48], or genotypes in population studies [50].In particular, we assume that the observable distribution bins in a clustering profile are populated by drawing from a few 'pure' distributionsthemes -corresponding to different contributing physical processes.The likelihood of populating a certain distribution bin o, given a theme t can then be described by a multinomial distribution p(o|t, β) (a multi-category generalization of the binomial distribution, where the number of categories is given by the number of bins in the distribution and is parametrized by a set of parameters β).In addition, we assume that the likelihood of a given theme contributing to any given event (and thus jet) p(t|ω) is also described by some multinomial distribution (parametrized by variables ω), where the number of categories now corresponds to the number of themes.The ω's themselves are drawn from a probability distribution p(ω|α), reflecting the theme proportions in the dataset and parametrized by the hyperparameter α.In this picture the themes (β) as well as theme proportions (ω) are hidden variables reflecting the thematic structure of the studied event sample.With a given model, the probability that a certain event or jet is generated can be written as a compact expression in terms of the latent variables.For example, the likelihood of generating a jet j with a clustering history observables (o 1 , o 2 , . . ., o n ) is just Statistical models defined in this way are generative in that given the latent variables (themes and theme proportions) the best model will be the one that best reproduces a set of events, i.e. has the best generative power.Therefore, the task of finding the latent variables from a set of training events is specifically to invert the above expression and use the set of events to find the best fit for the latent variables.This can in fact be done using posterior Bayesian inference, i.e.
where p(x|a) is the likelihood of observing x given a latent variable a, while p(a) and p(a|x) are the prior and posterior distributions of the latent variable itself.The main insight here is that p(ω|α) in Eq. ( 1) is a conjugate prior to the multinomial likelihood p(t|ω) and thus forms the multi category generalization of the beta distribution -the Dirichlet distribution.The model is thus called LDA and can be solved approximately (trained) in an iterative manner using variational inference [45,48] or Gibbs sampling [51].
A trained model consists of the latent variables inferred from the training data and the generative model used in constructing Eq. ( 1).Given a new event or jet we can use the trained model to determine the proportion (ω t (j)) of a theme t present in the new event or jet j, by maximizing the likelihood function for the document being generated while keeping the theme distributions (β) fixed.In general we can do this for each theme t, but in the case that the model contains only two themes (e.g.t = 0, 1) it suffices to choose just one, since t ω t (j) = 1.In this case we can define a simple classifier e.g.h(j) = ω 1 (j) based on the proportion of one of the themes in the jet or event.Since the goal is to efficiently discriminate signal from background and not vice versa, one needs to choose the appropriate theme.This can be done by comparing the theme histograms with expectations for pure (e.g.MC generated) signal and background or by directly inferring (running the classifier) on such pure samples.We demonstrate both approaches in the following examples.
Unsupervised top tagger.Our first proof of principle example is a tagger discriminating between boosted hadronically decaying top quarks and QCD jets.Working with a single mixed (t t and QCD) multi-jet event sample we first need to construct the relevant jet substructure observable histograms (o).We do this by clustering the jets in an event using the Cambridge-Aachen (CA) [52,53] algorithm with a large radius R. We then proceed to uncluster the jets by reversing each step in the clustering, iteratively separating each (sub)jet into two objects j 0 → j 1 j 2 .Ordering the subjets by their invariant mass m j1 > m j2 (and following the standard approach of refs.[1,3]), we define the relevant clustering observables at each clustering step as where p T,i is the transverse momentum of a given object is the so called planar distance between j 1 and j 2 (φ i and η i being the azimuthal angle and pseudorapidity of j i , respectively).The declustering step is then iteratively repeated on both j 1,2 .The procedure is terminated once m j0 < m min , where m min is an algorithm parameter, which we choose to lie below the lowest massive resonance state of interest.In the case of the top tagger, we fix m min = 30 GeV m W , but have checked that lowering this threshold by a factor of a few does not significantly affect the results.The output of such a procedure is a (typically a rather sparse) four-dimensional histogram of o j which can be defined either per jet or even per event.After mapping individual histogram bins into words, we feed individual jets or events as documents into an LDA implementation using the software package Gensim [54,55], fixing the number of themes to two (ω 0,1 ).Further technical details of the required binning and mapping of data onto (one-dimensional) text vocabularies compatible with Gensim, as well as a detailed analysis of the convergence of the algorithm when applied on sparse jet substructure data will be presented elsewhere [56].Here we only focus on the consistency and stability of the resulting trained models.For this purpose we use the k-folding method with k = 10.This involves splitting the training data into k different mutually exclusive blocks and then running the training k times on event samples built from k−1 blocks, with the combination changing on each training run.The performance of the tagger is tested on events or jets from the remaining block.
In order to evaluate the performance of the tagger and compare it to existing methods, we construct a receiver operating characteristic (ROC) curve for our tagger.This is the only step where one needs to rely on access to pure samples (either MC generated or pre-tagged in some other way using observables orthogonal to o j ).In particular, we construct the ROC curve by performing the classification on such pure samples while continuously varying the threshold of the theme proportion defining the classifier h(j).This is done for all k sets of results and we calculate the median mis-tag rate (ε b ) for each signal efficiency (ε s ), as well as the mean absolute deviation of the mis-tag rate to evaluate the stability and consistency of the tagger.
Our training samples for the QCD di-jet background and the (hadronic) t t signal both consist of ∼ 84, 000 13 TeV pp collision events, where the final state particles are clustered into R = 1.5 CA jets with p T in the range [350,450] GeV.The samples are generated using aMC@NLO 2.6.1 [57] interfaced with Pythia 8.2 [58] for showering and hadronization, while jet clustering is performed using FastJet 3.2.0[59].Note that no grooming is performed on the jets.We have also checked explicitly that applying jet (sub)cluster energy smearing consistent with the parametric fast detector simulation of ATLAS implemented in Delphes 3.4.1 [60] has no significant effect on our results.
We train the top tagger on four test cases: supervised, and unsupervised mixed samples with S/B = 1, 1/9, 1/99.In the supervised case we collapse the pure samples into single documents such that they are processed by the algorithm in a single block, essentially providing the labelling of the data required in supervised algorithms.For the different S/B ratios each jet or event is represented by a single document.However, we inform the tagger to search for certain S/B ratios by setting the hyperparameters of the Dirichlet distribution accordingly, i.e. α = [0.5, 0.5], [0.9, 0.1], and [0.99, 0.01].Note that these may not be the optimal choices, but they are based on the intuition from the values of S/B and give a useful parameterization to demonstrate the performance of the algorithm.We also stress that O(1) variations in α have only a small effect on the performance of the algo- rithm provided that the hierarchy in the elements of α approximately reflect the S/B ratio, and that the elements are smaller than one.More details on the dependence of the algorithm on these hyperparameters, and how to determine their optimal values without prior knowledge of the S/B ratios, will be presented elsewhere [56].In Fig. 1 (upper panel) we plot the ROC curves for our top jet taggers, where separate documents are represented by individual jets, and compare these to various supervised taggers in the literature [8,22,23].We see that the taggers perform well and with relatively small variance, with the supervised tagger performing the best.An interesting observation is that at high background rejection rates (1/ b O(few)) the taggers trained on smaller S/B perform slightly better than the tagger trained on the S/B = 1 sample, although the differences are comparable to the estimated uncertainties.This is essentially because the algorithm is designed to discern features in the jet substructure, which are subsequnetly used to tag jets and events.In the supervised and S/B = 1 case the algorithm discovers features in top jets both near m j0 ∼ m t and m j0 ∼ m W (see the right plot in Fig. 2), while in the lower S/B cases the algorithm is only able to identify m j0 ∼ m t as relevant.
On the other hand, lower m j0 regions generically feature more prominently in QCD jets (see left plot in Fig. 2).Thus, while a very accurate determination of the features near m j0 ∼ m W in the supervised case helps the performance of the tagging algorithm, the worse resolution in the unsupervised S/B = 1 case leads to worse tagging performance compared to lower S/B examples.We see that the performance of the unsupervised taggers is comparable to the original JH top tagger [8], although it falls short in comparison to the others.note that the observables we use mostly match those used in the JH top tagger, hence the similar performance is indeed encouraging.In Fig. 1 (lower panel) we plot the ROC curves for our t t event classifiers, where a single document now contains all jets within the selected p T region in an event, and again compare these to the top jet taggers in the literature.To make the comparison with other taggers fair, we re-scale those results by defining an event tagging efficiency ( e ) in terms of the jet tagging efficiency ( j ) and the fraction of events in our pure samples with one (f 1 ) and two (f 2 ) jets passing the selection cuts 2 , 2 We have checked that the fractions of events with zero or more e = (2 j − 2 j )f 2 + j f 1 .This means in practice that tagging an event as t t requires at least one jet in the event to be tagged as a top jet.The ROC curves do not change significantly under this re-scaling, instead the points move along a trajectory towards higher efficiencies approximately equal to that of the ROC curve for jet tagging.We see again that the classifier performs very well in all cases, performing as well as the JH top tagger even for low S/B.
We observe that the LDA algorithm performs relatively better when characterizing and tagging events than jets, mainly due to the larger amount of substructure (words) in each document.With more data per document it is easier for the algorithm to identify co-occurrences between the different features shared by jets in the same event.For this reason it is also easier for the trained model to infer the correct thematic structure from events, than from jets.
The themes discovered by the unsupervised training algorithm contain valuable information about the substructure of the events or jets.In Fig. 2 we plot the substructure probability distributions of the two themes discovered by the top jet tagger (with S/B = 1) projected onto the plane of m j0 and m j1 /m j0 .We observe that while the distribution on the left-hand side plot (the "QCD" theme) is fairly unremarkable (mostly monotonic and smooth) and peaks towards (m j → 0, m j1 /m j0 → 1), the theme on the right-hand side plot (the "t t" theme) clearly exhibits a heavily weighted feature at both m j0 ∼ m t and m j0 ∼ m W , even identifying the W subjet arising from the decay of the top quark within the jet resulting in a mass drop of m j1 /m j0 ∼ m W /m t 0.45.On the other hand, the broad m j1 /m j0 ∼ 0.2 0 feature at m j0 ∼ m W is expected due to the fact that the mass drop is defined with the heaviest daughter subjet in the numerator thus skewing the m j1 /m j0 distribution away from zero.
Unsupervised NP search.As a second example, we consider a NP model [61,62] containing a heavy W boson plus a heavy scalar φ.Signal events thus consist of resonant W production (at m W = 3 TeV), followed by W → W φ decays (where we choose m φ = 400 GeV m W such that both the W and the φ coming from W decays are boosted).Finally, the scalar further decays as φ → W + W − .Using the same event generation, jet clustering/de-clustering procedure, observable basis o j , and the same LDA tagging algorithm as before, we apply our procedure to the all-hadronic final state of this NP process in a region dominated by QCD background.The same model has been previously studied using the unsupervised ML approach called classification without labels (CWoLa) [37,38].It is based on mixed sample classification using phase space regions with vastly different S/B ratios processed by deep NNs.In order to than two jets passing the selection cuts are negligible.
quantitatively compare our results to CWoLa, our signal and background event samples mirror directly those in Ref. [38].In particular, we consider just the signal region, 2730 ≤ m jj ≤ 3189 GeV, and cut jets with p T below 400 GeV.The 30 GeV cut on the subjet invariant mass is also applied, just as in the top tagger case.After the selection cuts we work with ∼ 60, 000 events in both the signal and background samples.We train three different taggers; a supervised tagger, and two taggers with S/B = 1.1 × 10 −2 and 5.8 × 10 −3 .The Dirichlet hyperparameters α are chosen in the same way as in the previous section, i.e. α = [0.5, 0.5], [0.989, 0.011], and [0.942, 0.058].To evaluate the robustness of the taggers we again employ the k-folding procedure with k = 10.
In the upper plot of Fig. 3 we show the ROC curves for our taggers and compare the results to those from CWoLa [38].We see that in most of the parameter space the LDA-based tagger outperforms the CWoLa tagger, most notably at high signal efficiencies.
In the lower plot of Fig. 3 we also show the probability distributions of the discovered themes in the plane of m j0 and m j1 /m j0 for the LDA model trained on event samples with S/B = 1.1 × 10 −2 .Features in the subjet mass at m j0 ∼ m W and at m j0 ∼ m φ are clearly discernible in one of the themes (the "φW " theme), as well as mass drops related to the decays of the heavy scalar and the W bosons.
Conclusions.We have demonstrated a new unsupervised ML technique for disentangling signal and background events in mixed samples by identifying features in jet substructure observables that differentiate between the two.To do so we have mapped jet substructure distributions onto a LDA model, a generative probabilistic model (mixed membership model) widely used in Bayesian statistics approaches to unsupervised ML.Assuming that the kinematic observable distributions within jets or events are sampled from a fixed set of (latent) themes, LDA can learn the thematic structure that most likely generated the observed data (the later being either in the form of reconstructed real LHC events or un-labeled MC-generated samples).Furthermore, we have shown that the learned structure from a two-theme LDA model can be used to build unsupervised jet taggers or event classifiers that efficiently discriminate between signal and background in previously unseen data.
As a first example we have trained a two-theme LDA model on MC-generated event samples consisting of different mixtures of pp → t t and QCD di-jet events.Our results show that the top-jet taggers and t t event classifiers constructed from the discovered themes have a very good discrimination power when applied to previously unseen pure samples, even if trained on data with S/B ratios as low as 1%.Our results are in some cases comparable even with fully supervised taggers in the literature.In addition we have explored the viability of LDA discovering NP phenomena in multi-jet events.Using a benchmark NP (vector W -scalar φ) model we have studied pp → W → φW → W W W with hadronically decaying W bosons and a (boosted) new scalar φ with mass m φ m W .The resulting LDA event classifiers from training samples with S/B as low as a few per-mille, when applied to pure samples, produce excellent signal efficiencies and QCD rejection rates that can outperform other existing approaches.
Besides being a fully unsupervised ML technique, one advantage of performing LDA on jet clustering history observables, is the possibility of interpreting the thematic structure discovered by the model from the data.In both examples presented here, the features in the probability distributions over the kinematical observables of the two uncovered themes match to a high degree the expected features of the underlying hard processes -hadronic decays of top-quarks (or φ → W + W − ) and the QCD back-ground, respectively, allowing for an intuitive and physical understanding of the high tagging performance as demonstrated by the ROC curves.
The analysis presented here is a first exploration of what can be achieved when applying probabilistic mixed membership models to high-energy collider data.For example, with the addition of more jet substructure observables the discriminating power of the LDA classifiers could be further optimized and increased.Furthermore, relaxing the fixed number of themes of the LDA model applied to mixed event samples could allow to classify multiple backgrounds together with the signal.In future work we will also detail how these techniques can be employed as part of a broad search strategy for new phenomena in multi-jet invariant mass spectra with the aim of performing unsupervised data-driven searches for NP at high p T .

Figure 2 .
Figure 2. 2D projected probability distributions (in the plane of mj 0 and mj 1 /mj 0 ) of the two latent themes discovered in mixed (S/B = 1) QCD and t t event samples with fat-jets satisfying pT ∈ [350, 450] GeV.

Figure 3 .
Figure 3. (Upper plot) ROC curves comparing the performance of the LDA event classifier to CWoLa [38].(Lower plot) 2D projected probability distributions (in the plane of mj 0 and mj 1 /mj 0 ) of the two latent themes discovered in mixed (S/B = 1.1 × 10 −2 ) QCD and W event samples with invariant mass 2730 ≤ mjj ≤ 3189 GeV with fat-jets satisfying pT > 400 GeV.