Training toward significance with the decorrelated event classifier transformer neural network

Experimental particle physics uses machine learning for many tasks, where one application is to classify signal and background events. This classification can be used to bin an analysis region to enhance the expected significance for a mass resonance search. In natural language processing, one of the leading neural network architectures is the transformer. In this work, an event classifier transformer is proposed to bin an analysis region, in which the network is trained with special techniques. The techniques developed here can enhance the significance and reduce the correlation between the network's output and the reconstructed mass. It is found that this trained network can perform better than boosted decision trees and feed-forward networks.


I. INTRODUCTION
Experimental particle physics is often performed by colliding particles and observing their interaction with detectors.Particles generated from the collisions are investigated by reconstructing the detector data.A common method used in particle searches is to search for a resonance in the reconstructed mass distribution.An example of this resonance is shown in Fig. 1.Signal events create a peak that can be seen above background events.To estimate the number of signal and background events, the mass distribution is fitted with a peaking signal and a smooth background function.The fitted mass region is set to be larger than the signal peak width to estimate the background more precisely.This method is referred to as "bump hunting" and was used to search for the Higgs boson [1,2].The sensitivity of the method can be quantified in terms of significance [3,4].A larger sig- from the H → Z ℓ + ℓ − γ decay, where a bump can be seen due to the presence of the Higgs boson particle.The Higgs boson cross section was scaled up by 100 to make the bump visible.Right: mass of reconstructed Higgs boson candidates from the H → Z ℓ + ℓ − γ decay with the nominal Higgs boson cross section, where the bump cannot be seen due to the background.* jbkim@charm.physics.ucsb.edunificance corresponds to a smaller probability that the bump was created by random statistical fluctuations of the background.
To increase the expected significance of the bump hunting method, requirements are applied to create a search region to suppress the background, while preserving the signal.The sensitivity of the analysis can be further enhanced by "binning", where the search region is divided into multiple bins according to other variables.However, if the binning affects the reconstructed mass distribution of the background, estimating the number of signal events can become difficult.An extreme case would be one where the background peaks similarly to the signal.Therefore, a desirable characteristic of binning in the other variables is that it does not affect the shape of the background distribution.Binning can be performed using distinguishing features of the signal [2], or by using machine-learning techniques [1,5,6] that classify the signal and background events.A common machine-learning technique used for this task is the boosted decision tree (BDT).

II. EVENT CLASSIFIER TRANSFORMER NEURAL NETWORK
Transformer neural networks [7] are the leading neural network architecture in the natural language processing field.The architecture has been applied to particle physics in a network called Particle Transformer [8].The network identifies the type of particle that produced a jet, which is a cluster of hadrons whose momenta lie within a cone.In this work, a new transformer-based neural network called an event classifier transformer is proposed that classifies signal and background events to bin a bump hunt analysis.
The architecture of the event classifier transformer is shown in Fig. 2. The inputs are the features of the event, such as the momenta or angles of particles.To be able to use the transformer architecture, each input or each set of inputs is "embedded" into a token, which is a set of N repr.numbers, using separate feed-forward neural networks.In contrast to the transformer, there is no positional encoding because event features do not have sequential dependence.The tokens are normalized using layer normalization [9] and passed to a multiheaded attention layer (MHA) [7].The output is summed to employ a residual connection [10] and normalized with layer normalization.A positionwise feed-forward neural network [7] is applied with residual connection and layer normalization, where the same positionwise network is applied to each token.The output is a group of tokens, where each token is a contextual representation of the input feature.Following a simplified version of the CaiT [11] approach and the Particle Transformer, a trainable class token is passed as a query to a MHA and the contextualized tokens concatenated with the class token are passed as a key and value.The output employs a residual connection and layer normalization and is passed to a linear layer.The linear layer output is a single value that is passed through a sigmoid to represent the probability that the event is from the signal process.To prevent overtraining, dropout layers [12] are used in the scores of the MHA and before the residual connection during training.Implementation of the network is publicly available at Ref. [13].

III. TRAINING TECHNIQUES FOR ENHANCING SIGNIFICANCE
Special training techniques are developed to apply to the event classifier transformer and other neural networks, to increase the expected significance and reduce the correlation for a bump hunt analysis.The following new training techniques are investigated: • Specialized loss function with mass decorrelation.
• Data scope training.
• Significance based model selection.
These techniques are described in turn below, and implementation is publicly available at Ref. [13].

A. Specialized loss function with mass decorrelation
In a binary classification task, where the input x is provided to predict the label y, a neural network f (x) can be trained to output a prediction ŷ = f (x).In this work, signal is defined to be y = 1 and background is defined to be y = 0.The network is trained with a loss function that is used to minimize the difference between the network's output ŷ and label y.
A common loss function is the binary cross-entropy (BCE) loss [14], where the minimum corresponds to the condition and is achieved when which implies ŷ = y.
In this work, to increase the penalty when ŷ and y are different, while keeping the property that the minimum is achieved at ŷ = y, an alternative loss inspired from BCE called extreme loss, E(ŷ, y), is proposed: For a given value of ŷ − y, the extreme loss penalizes the neural network more than the BCE loss, as can be seen in Fig. 3.This extreme loss heavily suppresses backgrounds that have network predictions close to 1 and also signals that have network predictions close to 0. Because the expected significance when binning a bump hunt analysis is sensitive to backgrounds that have high network predictions, this loss can help to increase the significance.
To decorrelate the neural network output with the reconstructed mass, Distance Correlation (DisCo) regularization is used [15].DisCo measures the dependence between ŷ and mass, where the value of DisCo is zero if and only if ŷ and mass are independent.DisCo is multiplied by a factor λ and added to the classifier loss function: Loss = Loss classifier (ŷ, y) + λ • DisCo (mass, ŷ) . (5) The DisCo term penalizes the neural network when ŷ and mass are correlated, where λ sets a balance between the neural network's performance and the degree of decorrelation.

B. Data scope training
To increase the significance and decorrelate the neural network output with the reconstructed mass for the background, the loss terms are calculated with different data scopes during training: The classifier loss, such as BCE loss, is calculated in the scope of signal and background events that fall in the narrow mass window, where the majority of the signal lies.This narrow scope decorrelates the network's output with the mass and can improve the significance.The DisCo term is calculated in the scope of background events that fall in the wide mass window, where the mass window is used in estimating the amount of background in the bump hunt method.

C. Significance-based model selection
When training networks, the best-trained network among the training epochs can be chosen based on an evaluation metric.The loss is often used as the evaluation metric, but the network that has the minimum loss can be different from the network that has the best significance [3,16].In this work, the expected significance is used as a metric to select the best network among the epochs.
The significance is calculated with the following steps: 1. Divide the dataset into bins based on the neural network's output.The bins are constructed to have an equal number of signal events.
2. Calculate the significance of each bin.
3. Combine the significances of the bins.
The significance [3] of each bin is calculated with where N S is the number of signal events and N B is the number of background events within a mass window containing the 5th to 95th percentile of the signal.Note that the mass window width can change depending on the bin.The combination of significance [17] over the bins is calculated with where n is the number of bins and i is the index of the bin.

IV. EXAMPLE ANALYSIS
In this work, a search for the process H → Z (ℓ + ℓ − ) γ is considered, where ℓ + ℓ − represents an e + e − or µ + µ − pair.Such a study has been performed by CMS [18] and ATLAS [19] with the Run 2 LHC data using a luminosity of around 150 fb −1 , where the BDT technique was used in binning the search region.The expected significance for this standard model process of each analysis is 1.2σ [6,20].When adding the expected Run 3 LHC data [21], where 250 fb −1 of data could be collected, and assuming similar analysis sensitivity the expected significance is 2σ.To reach a 3σ significance, an additional 600 fb −1 of data is required.This will be achieved in the High Luminosity LHC [22] era that is planned to start in 2029, where around 300 fb −1 /year of data is expected.To reduce the amount of time to obtain evidence (3σ) of this decay, a neural network approach is explored.
The simplified H → Z (ℓ + ℓ − ) γ analysis in this work searches for a resonance in the reconstructed mass distribution of the Higgs boson candidates.To increase the significance, the search region of the analysis is binned with specially trained neural networks and the performance is compared with that of boosted decision trees.
The following sections describe the dataset, input features for the machine-learning techniques, the machinelearning techniques themselves, evaluation metrics, experiment, and results.

A. Dataset
The dataset is generated with the Monte Carlo event generator MADGRAPH5 aMC@NLO [23,24], particle simulator PYTHIA8 [25], and detector simulator DELPHES3 [26], where jet clustering is performed by the FastJet [27] package.For the event generation of the signal, the Higgs boson (pp → H) is generated with MADGRAPH5 aMC@NLO using the Higgs Effective Field Theory (HEFT) model [28] and decayed to H → Z (ℓ + ℓ − ) γ using PYTHIA8.The background pp → Z (ℓ + ℓ − ) γ is generated with MADGRAPH5 aMC@NLO at leading order.For detector simulation, the CMS detector settings of DELPHES3 are used.To have samples for training, validation, and testing, a dataset of 45 million events are generated for both the signal and the background, for a total of 90 million events.The total number of signal events for each of the training, validation, and testing datasets is scaled to have the standard model cross section of 7.52 × 10 −3 pb with a luminosity of 138 fb −1 , corresponding to the Run 2 luminosity of CMS.The background is scaled to have the standard model cross section of 55.5 pb with the same luminosity as the signal.
After applying these requirements, there are 9 million events for the signal and 5 million events for the background.The large sample helps to reduce overtraining, where the significance metric is sensitive to the sample region that has high classifier scores and a low number of background events.

B. Input features for machine-learning techniques
The following features are used as inputs for the machine-learning techniques that bin the analysis: : Pseudorapidity angle of the photon and leptons.
• Flavor of ℓ: Flavor of lepton used to reconstruct the Z boson, either being an electron or a muon.
• p ℓℓγ T /m ℓℓγ : p T of the reconstructed Higgs boson candidate divided by the mass.
• p ℓℓγ T t : Projection of the reconstructed Higgs boson p T to the dilepton thrust axis [29].
• σ ℓℓγ m : Mass reconstruction error of the ℓℓγ candidate estimated by binning signal events in η and p T for the photon and leptons, and measuring the signal's mass width for each bin.
• p γ T /m ℓℓγ , p leading ℓ T , p subleading ℓ T : Momenta of the photon and leptons.
Many of the features are correlated with the reconstructed Higgs mass.Therefore depending on which features are included in the machine-learning technique and which features the machine-learning technique prioritizes, the output of machine-learning classifier can also be correlated with the mass.Especially when including the momenta of the photon and lepton in certain machine-learning techniques, such as BDTs, to bin the search region, the background mass distribution in certain bins tends to peak close to the Higgs boson mass, which can be seen in Fig. 4.This behavior introduces difficulties in estimating the number of signal events, so that these features are typically excluded for the inputs of the machine-learning technique.However, these features can be used as inputs for neural networks that are trained to be decorrelated with the reconstructed mass.

C. Machine-learning techniques
The following machine-learning techniques are compared: • Boosted Decision Trees using the TMVA frame-work [32], which has 850 trees, with a minimum node size of 2.5%.
• XGBoost [33] with 100 boosting rounds with a maximum depth of 3. The BDT was optimized by modifying the number of boosting rounds, where BDTs with more than 100 rounds have higher overtraining, while other performance metrics are similar.
• Feed-forward neural network that has an input layer with N nodes corresponding to the number of input features, a hidden layer with 4N nodes with a tanh activation function, and an output layer with one node with a sigmoid activation function.The network is implemented with PyTorch [34].
• Deep feed-forward neural network that has an input layer with N nodes corresponding to the number of input features and three hidden layers with 3N , 9N , and 3N nodes with a tanh activation function.
The last hidden layer is connected to a dropout layer with a rate of 0.1.The output layer has a single node with a sigmoid activation function.
The network is implemented with PyTorch.The network was optimized by modifying the dropout rate, where the network with no dropout had significant overtraining and poorer performance, while the network with a dropout rate of 0.2 had similar performance to the dropout 0.1 network.
• Event classifier transformer with N repr.= 16.Each event feature is embedded by a separate feedforward network that has an input layer with one node, a hidden layer with 4 nodes with a GELU activation function, and an output layer with 16 nodes.The MHA has 4 heads, and the positionwise feed-forward network has an input layer with 16 nodes, a hidden layer with 64 nodes with a GELU activation function, and an output layer with 16 nodes.The dropout layer has a rate of 0.1.The network is implemented with PyTorch.The network was optimized by modifying the dropout rate, where the network with no dropout showed overtraining and poorer performance, while the network with a dropout rate of 0.2 had slightly worse performance than the dropout 0.1 network.
When training the machine-learning techniques, the signal and background samples are weighted to have the same number of events.All input features are normalized for neural networks.For BDTs and networks that are trained without DisCo loss, 12 input features are used: η γ , η leading ℓ ℓ , η subleading ℓ ℓ , minimum ∆R (ℓ ± , γ), maximum ∆R (ℓ ± , γ), flavor of ℓ, p ℓℓγ T /m ℓℓγ , p ℓℓγ T t , σ ℓℓγ m , cos Θ, cos θ, and ϕ.These input features have minimal correlation with the reconstructed m ℓℓγ .The training is performed with events that have a reconstructed Higgs boson mass range between 120 and 130 GeV.
When networks are trained with DisCo loss, three additional features are used: p γ T /m ℓℓγ , p leading ℓ T , and p subleading ℓ T .The data scope training method described in Sec.III B is used.The narrow mass window is from 120 to 130 GeV, and the wide mass window is from 100 to 180 GeV.
Different combinations of loss functions are explored: 1. BCE loss.
To increase the stability of the calculation, the extreme loss implementation clamps the neural network output ŷ to a range from 0.001 to 0.999.The λ DisCo factor is 10 for the BCE + DisCo loss and 50 for the extreme + DisCo loss to have similar background correlations among networks trained with DisCo.
Each neural network is trained for 1200 training epochs with the Adam optimization algorithm [35] with a learning rate of 10 −3 and a batch size of 8192.The best model between the epochs is selected by finding the model with highest significance on the validation dataset.To reduce the amount of time in training, the model is evaluated on the validation dataset in five training epochs intervals for the first 50 epochs, and then ten training epoch intervals for the remaining training epochs.This method allows the best model search to be sensitive to the early epochs, where the evaluation metric can change dynamically.

D. Machine-learning technique evaluation metrics
The machine-learning techniques are evaluated with the following metrics: • Expected significance.
• Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve.
• Correlation of m ℓℓγ with the machine-learning technique output.

• Epoch of the best-trained model
The significance is calculated using the method described in Sec.III C, where the sample is divided into eight machine learning technique bins with an equal number of signal events in each bin.The AUC curve is calculated using the scikit-learn [36] package with negative weights set to zero and evaluated in the Higgs boson mass window from 120 to 130 GeV.
The correlation is measured with the following steps: 1.The search region is binned according to the machine-learning technique output.
2. Normalized m ℓℓγ histograms are created for each machine-learning technique bin.
3. The differences between the machine-learning technique bins are measured with the standard deviation of the normalized yield for each mass bin.
4. The mean of the standard deviations is calculated.
When calculating the correlation, the sample is divided into eight machine-learning technique bins with an equal number of signal events in each bin.

E. Experiment and results
A repeated random subsampling validation procedure [37] is used to evaluate the machine-learning techniques, where three trials are done.The number of trials is limited due to the long training time of the neural networks.The validation procedure follows the steps below: 1.The events in the dataset are randomly shuffled and divided equally for the training, validation, and test datasets.
2. The machine-learning technique is trained using the training dataset.For neural networks, the best model between the epochs is selected using the expected significance metric on the validation dataset.
3. The performance of the machine-learning technique using the evaluation metrics is evaluated on the test dataset.
4. The steps above are repeated N times, where in this work N = 3.The machine-learning technique is reinitialized for each trial.
5. After all the trials, the evaluation metrics are averaged to assess the performance of the machinelearning techniques.
The average evaluation metrics for the machinelearning techniques are shown in Table I.The experiment results show the following: • The event classifier transformer trained with DisCo loss shows the highest significance with the lowest background correlation between the machine-learning techniques.
• The deep feed-forward network and event classifier transformer trained with BCE loss show the highest AUC and significance but have higher background correlations.
• The neural networks trained with the DisCo loss show the lowest background correlation.
• Networks trained with extreme + DisCo loss are trained more quickly compared to networks trained with BCE + Disco loss, while showing similar performance.There is an upper limit on performance for a classifier due to the similarities of the signal and background, where machine-learning classifiers could be optimized in approaching the upper limit in a certain phase spaces.Therefore the goal for the machine-learning classifier is to approach the upper limit in the phase space that is most relevant for the problem.The event classifier transformer tends to be able to approach the upper limit on the significance metric better than the other machine-learning techniques used in this paper.However with more hypertraining on the other machine-learning techniques or by using different machine-learning techniques, it could be possible to get closer to the upper limit, which is left for future studies.
The machine-learning technique metrics are shown for one of the subsampling trials.The correlation between the m ℓℓγ and network output is shown in Fig. 5 for the deep feed-forward network trained with BCE loss and the event classifier transformer trained with the extreme + DisCo loss.The network output distribution is shown in Fig. 6, where in the figure overtraining of the network is measured with the χ 2 test [38] implemented in ROOT [39], by comparing the network output distributions on the training sample and validation sample.The residuals in the plot are the normalized differences between the output distributions defined in [38].For both networks, the bump in the middle of the signal distribution is related with the networks making different output distributions depending on p ℓℓγ T /m ℓℓγ .An effect from the DisCo term can be seen, where the minimum network output value is shifted upwards.
It is noted that when increasing the number of input features, the increase in the number of trainable network weights for the event classifier transformer is small compared to a deep feed-forward network.The smaller increase can make the network easier to train, when more input features are used.For example, when changing the number of input features from 12 to 15, the increase of the number of trainable weights for the event classifier transformer is 265 (a 5% increase), while for the deep feed-forward neural network it is 4,671 (a 55% increase).
During training of the feed-forward network with BCE loss, the BCE loss value and the significance can have different trends over the training epochs, which is shown in Fig. 7.This discrepancy demonstrates that when selecting the best model between the epochs, the expected significance should be used instead of the loss.
A study was also performed to compare the results when the data scope training method was not applied.Instead, the wide mass window was used for both the classifier loss and the DisCo loss.The training became difficult where the correlation could be high or the network outputs could converge to one value.The networks that were successfully trained had a few-percent poorer significance when having similar background correlations.

V. RELATED WORK
The Particle Transformer [8] is a transformer-based neural network targeted toward jet-flavor tagging.The structure is similar to the transformer [7], where each particle in a jet is considered as a token.Additionally, variables calculated using the features of a pair of particles are passed through a feed-forward network and added to the attention scores of the MHA.For the final prediction, class tokens are passed as a query to a MHA.The event classifier transformer in this work has a similar structure where the main difference is that each event feature is embedded to a separate token using separate feed-forward networks and that the neural network output is used to bin a bump hunt analysis.
There has been work to develop a loss function targeted toward significance, where a loss based on the inverse of the significance [40] has been studied.However when a network is trained with this loss, the neural network outputs are clustered around 0 or 1.This behavior makes the loss unusable for binning a bump hunt analysis.The proposed extreme loss can enhance the significance while not having the network outputs clustered around 0 or 1.It can also reduce the number of epochs that are needed to reach the optimal performance of the network.
DisCo [15] has been used for jet flavor tagging and for the ABCD analysis technique [41], where it has been shown to be effective in decorrelating a targeted feature of the jet with the neural network output and decorrelating two neural network outputs used for the ABCD method.In this work, DisCo is used to decorrelate the neural network output with the reconstructed mass to bin a bump hunting analysis.

VI. SUMMARY AND CONCLUSIONS
A transformer-based neural network is proposed to increase the expected significance of a search for a resonance in a reconstructed mass distribution.The significance is enhanced by binning events using a network that discriminates between signal and background events.To apply the transformer architecture for this task, each event feature is passed through a separate feed-forward network to create tokens for the transformer.This network is called the event classifier transformer.
Special training techniques are proposed to enhance the significance and reduce the correlation between the network's output and reconstructed mass.
• Extreme loss is proposed that can enhance the significance and reduce the number of training epochs compared to the commonly used binary cross-entropy loss.
• DisCo can be used to reduce the correlation.This allows the network to have input event features that are correlated with the reconstructed mass.
• Data scope training is proposed, where loss terms have different data scopes.This method can increase the significance and reduce the correlation.
• A significance selection metric is proposed for choosing the best model between the training epochs instead of loss.
In the context of a simplified H → Z (ℓ + ℓ − ) γ search, the new event classifier transformer trained with the special techniques shows higher significance and lower mass correlation when compared with boosted decision trees and feed-forward networks.This result demonstrates the potential of the event classifier transformer and the specialized training techniques targeted toward binning a search for a resonance in the reconstructed mass distribution.
FIG.1.Left: mass of reconstructed Higgs boson candidates from the H → Z ℓ + ℓ − γ decay, where a bump can be seen due to the presence of the Higgs boson particle.The Higgs boson cross section was scaled up by 100 to make the bump visible.Right: mass of reconstructed Higgs boson candidates from the H → Z ℓ + ℓ − γ decay with the nominal Higgs boson cross section, where the bump cannot be seen due to the background.
Loss = Loss classifier (ŷ, y) y = [0, 1] ŷ ∈ narrow mass window +λ • DisCo (mass, ŷ) .(6) y = 0 ŷ, mass ∈ wide mass window FIG. 4. Top: reconstructed m ℓℓγ background distributions, where each histogram is a bin in the XGBoost output distribution with an equal number of signal events.Lower signal percentile (sig.p) values correspond to higher output values.p γ T /m ℓℓγ , p leading ℓ T FIG. 5. m ℓℓγ distribution of the background, where each histogram is a bin in the machine-learning technique output distribution.Each bin has an equal number of signal events.Lower signal percentile (sig.p) values correspond to higher network output values.Top: deep feed-forward network trained with BCE loss.Bottom: Event classifier transformer network trained with extreme + DisCo loss.Correlation represents the magnitude of difference in the shapes between the machine-learning bins.A lower correlation can be observed with the network trained with DisCo loss.

ACKNOWLEDGMENTSIFIG. 7 .
FIG.6.Left: deep feed-forward network trained with BCE loss.Right: event classifier transformer network trained with the extreme + DisCo loss.Top: network output distribution on the training dataset and validation dataset.Bottom: overtraining of the network evaluated by comparing the network output distribution between the training and validation datasets for the signal and background using a χ 2 test, where residuals are the normalized differences between the output distributions.

TABLE I .
Average evaluation metrics for machine-learning techniques using the random subsampling evaluation procedure with three trials.Random is a random classifier.FNN is a feed-forward network.DFNN is a deep feed-forward network.ETN is an event classifier transformer network.N.A. means not available.Ext is extreme loss.Signi. is the expected significance.AUC is the area under the curve of the receiver operating characteristic curve.Bkg.Corr. is the correlation of m ℓℓγ with the machine-learning technique output for the background calculated with the method described in Sec.IV D. Best epoch is the epoch that had the highest significance on the validation dataset.Bold numbers indicate the best values over the machine-learning techniques.