Searching for gluon quartic gauge couplings at muon colliders using the auto-encoder

One of the difficulties one has to face in the future phenomenological studies of the new physics~(NP), is the need to deal with increasing amounts of data. It is therefore increasingly important to improve the efficiency in the phenomenological study of the NP. Whether it is the use of the Standard Model effective field theory~(SMEFT), the use of machine learning~(ML) algorithms, or the use of quantum computing, all are means of improving the efficiency. In this paper, we use a ML algorithm, the auto-encoder~(AE), to study the dimension-8 operators in the SMEFT which contribute to the gluon quartic gauge couplings~(gQGCs) at muon colliders. The AE is one of the ML algorithms that has the potential to be accelerated by the quantum computing. It is found that the AE-based anomaly detection algorithm can be used as event selection strategy to study the gQGCs at the muon colliders, and is effective compared with traditional event selection strategies.


I. INTRODUCTION
Supported by the large amount of experimental evidences, it can be concluded that the Standard Model (SM) is able to describe and explain the vast majority of phenomena in particle physics, with a few rare exceptions.These exceptions include experimental results such as the neutrino mass [1][2][3], the W boson mass problem [4,5], the muon g−2 problem [6][7][8], and more [9].Besides, the SM cannot describe dark matter, gravity, etc.As a result, the existence of the new physics (NP) beyond the SM has been widely believed, and the search of NP as well as precision measurements have been at the forefront of interest in the high energy physics (HEP) community [10].
Both the search for NP and precision measurements require dealing with a large number of events.With more and more data to be processed, more efficient ways to search for NP are called for.One of the reasons why the SM effective field theory (SMEFT) [11][12][13][14] has been widely used in the phenomenological study of NP in recent years is that SMEFT has an outstanding advantage of searching for NP signals with high efficiency.In the SMEFT, the NP particles are integrated out, and the NP effects become new interactions of known particles.Formally, the new interactions appear as higher dimension operators with Wilson coefficients suppressed by powers of a NP scale Λ.The operators that are most likely to be found correspond to the operators with Wilson coefficients that are least suppressed by Λ.The high efficiency of searching for NP using SMEFT is demonstrated by the fact that, instead of dealing with various NP models, the number of operators to be considered at a specific order of Λ in the SMEFT is finite.Not only that, but if an operator is not found, then multiple NP models contributing to this operator will also be constrained.However, as the importance of the dimension-8 operators in the SMEFT has been realized [15][16][17][18], more and more phenomenological studies have been devoted to the dimension-8 operators in recent years .For one generation of fermions, there are 895 baryon number conserving dimension-8 operators [46,47], and there are even more operators if one considers operators with dimension more than eight.A procedure to select the events which does not rely on operators to be searched for can further improve efficiency.
Machine learning (ML) algorithms are one of the ways to process the data efficiently.ML is a multidisciplinary cross-discipline for studying how computers can mimic and implement human learning behaviors in order to acquire new knowledge, and has already been used in HEP studies [48][49][50][51][52][53][54][55][56][57].ML algorithms have an advantage in processing complex data, and one of its common applications is anomaly detection (AD).When using AD to search for NP models, its implementation is often independent of the NP model to be searched for.
While hyperparameters in ML algorithms will often be NP model-dependent, the procedure of tuning of hyperparameters is usually common for different NP models.Based on this, an event selection strategy utilizing AD can be viewed as an automated event selection strategy.
Meanwhile, quantum computing is another effective way to deal with large amounts of data.Many ML algorithms can be implemented by quantum computing [68][69][70], an example of which is the auto-encoder (AE) algorithm [71,72].AE is an unsupervised learning dimensionality reduction algorithm using artificial neural networks (ANN), which is at the same time capable of AD.Therefore, it can be expected that AE can be used to detect NPs, and AE has already been used in phenomenological studies in HEP [65,66,73,74].
Analogous to the principal component analysis (PCA) algorithm, the most common scenario for AE is data dimensionality reduction.While the PCA is a linear dimensionality reduction by solving the feature vector, the AE algorithm is a nonlinear dimensionality reduction.It is verified that, AD based on PCA is able to search for NP [64], therefore it can be expected that a similar approach works also for AE.Similar to PCA, AE has the potential for quantum acceleration [75][76][77][78], and thus AE holds great promise for processing large amounts of data.
To verify the feasibility of the AE algorithm in the search of high dimensional operators in the SMEFT, we use an AE algorithm based AD (AEAD) to study the gluon quartic gauge couplings (gQGCs) [79,80] in this paper.In the SMEFT, operators contributing to gQGCs start from dimension-8, therefore we concentrate on those dimension-8 operators.
As an arena, the processes affected by gQGCs at muon colliders are considered.The muon colliders have been hotly discussed in recent years for searching for NP signals [81][82][83][84][85][86][87][88][89][90][91].A muon collider has the advantage of being able to explore both high luminosity and high energy frontiers, while at the same time being less affected by the QCD background as a lepton collider.On the one hand, higher collision energies are better if one is committed to studying dimension-8 or even higher dimension operators, and on the other hand, higher luminosities require more efficient means of processing data.Thus, the processes at muon colliders that are affected by gQGCs are both worth studying and suitable for exploring the AEAD algorithm.For comparison with a conventional event selection strategy, in this paper the process µ + µ − → jjν ν is studied which has been studied in Ref. [92].It has been shown that this process is sensitive to the gQGCs.
The rest of the paper is organized as follows, in section II, the dimension-8 operators contributing to the gQGCs are introduced.In section III, the event selection strategy based on AEAD algorithm is presented.In section IV, the numerical results and expected constraints on the operator coefficients are presented.Section V is a summary.

II. DIMENSION-8 OPERATORS CONTRIBUTING TO THE GQGCS
The gQGCs arise from the Born-Infeld (BI) extension of the SM, which was originally motivated by the idea that there should be an upper limit on the strength of the electromagnetic field [93].It has been shown that, the BI model is also related to the M-theory inspired models [79,80,[94][95][96].In the SMEFT, the operators contributing to gQGCs appear ν where G a µν is gluon field strengths, W i µν and B µν denote electroweak field strengths, and M i are mass scales associated with NP particles.For convenience, we define f i ≡ 1/(16M 4 i ).The expected constraints on M i at the Large Hadron Collider (LHC) with the center of mass (c.m.) energy √ s = 13 TeV, and luminosity L = 36.7 fb −1 obtained by using the process gg → γγ are shown in Table I.The combined sensitivities of the Zγ and γγ channels at the LHC with √ s = 13 TeV and L = 137 fb −1 [80] are about three times of the ones shown in Table I.
Different from the case of a hadron collider that the operators are classified into four pairs with same Lorentz structures in phenomenological studies, at the muon colliders, the pairs can be decoupled.In particular, the process µ + µ − → jjν ν can be affected by the vector boson scattering (VBS) subprocess W + W − → gg, which is associated with only O gT,0,1,2,3 operators.The Feynman diagrams are shown in Fig. 1.Since the VBS contribution is logarithmically enhanced at large c.m. energies compared with the tri-boson process, we concentrate on the O gT,0,1,2,3 operators.It has been shown that, at muon colliders the process µ + µ − → jjν ν can archive a competitive sensitivity compared with the hadron colliders [92].
In the process µ + µ − → jjν ν, there is no interference between the SM and gQGCs, which simplifies the procedure to obtain the expected constraints.However, there are two (anti-)neutrinos in the final state.This usually results in some loss of information, which in turn affects the efficiency of the event selection strategy.This just provides a place to test whether the AEAD algorithm is effective or not.

FIG. 2:
The graphical representation of the AE.The AE network can be decomposed into two parts, where the encoder is consist of the x in,1,2 and the latent layers, the decoder is consist of the latent, X 3,4,out layers.In this paper, we use a dense connected network.
AE is a type of ANN, which can be used in various applications, including data compression, feature extraction, denoising, data generation, etc.The structure of an AE is shown in Fig. 2, which is primarily composed of two parts, the encoder and the decoder.Both encoder and decoder can consist of multiple layers, with the neurons in the input layer denoted as x in i , and those in the output layer denoted as X out i .The data input to the input neurons and the data obtained from the output neurons are also notated as x in i and X out i , respectively.The number of neurons in the input layer is as same as the one in the output layer and the dimension of the input vector (denoted as l).In this paper, l = 8.The goal to train an AE is to reconstruct the input data, i.e., the labels of the training data are just the input data, therefore the training of an AE is unsupervised learning.The training objective is to minimize the reconstruction error between the ⃗ x in and ⃗ X out , aiming to obtain as similar a set of vectors from the output layer as possible after inputting a set of vectors into the input layer.
A well-trained AE can reproduce the input by utilizing a few variables, α i=1,...k with k < l, along with the decoder network.Being able to reconstruct the input data indicates that, the information in α i is enough to describe the data with the help of the decoder.Also, this means that the encoder is able to compress the information in the input data into k numbers, α i (which is often called the latent space), i.e., data dimensionality reduction.
The reason an AE can achieve data dimensionality reduction lies in the fact that, the features of the input data are not independent.The encoder learns the relationships among the features and compresses them into α i variables, which can then be used by the decoder to reconstruct the ⃗ X out .This mechanism of AE can be used for AD.It is well known that, a challenge for ML methods is the ability to extrapolate into unknown phase space regions, AEAD just turned the disadvantage in extrapolating into a feature, i.e., the artificial intelligence trained using the SM cannot recognize NP and will treat NP as anomalies.
At first, AE is trained on the SM events.Since the events are generated according to the physical laws of the SM, the relationships between the features learned by the AE are also related to the physical laws of the SM shaded behind them.At the same time, it can be expected that the reconstruction of the events generated according to the physical laws of the NP will be less accurate than that of the SM, because the AE has not learned the relationships between the features of the events of the NP.The mean squared error (mse) /N can be used to represent the reconstruction error, where N is the number of events.In AEAD, d can also been used as an anomaly score.It can be expected that d is larger for the NP events compared with those of the SM events, therefore can be used to select the NP events.
Following the above idea, the AEAD event selection strategy can be summarized as follows, • Generate the training and validation data sets using Monte Carlo (MC) simulation, which consist of the events from the SM.
• Train AE, and use the validation data set to avoid overfitting.
• For a test data set, which can be either obtained from the experiment or from MC simulation, calculate d for each event.
• Use a threshold d th to select events, i.e. select the events with d > d th , where d th can be tuned according to the signal significance or expected constraints on parameters of NP.
Note that, although the AE is unsupervised ML algorithm, in AEAD the SM data set is used in the training phase, therefore the AEAD is no longer unsupervised.However, the NP data sets are not used in the training phase.The test data set can be from the experiment, and there is no need to know whether it contains NP or what kind of NP it might contain.
It can be expected that, the signal of NP can be traced by the AEAD whenever the test data set contains events that differ from the law of the SM.

A. Data preparation
To compare with the traditional event selection strategy, in this paper, we use the same events generated as in Ref. [92].The events are generated using MC simulation with the MadGraph5_aMC@NLO toolkit [97][98][99], where the standard cuts are set as the default.The parton shower is applied using Pythia8 [100] with default settings.A fast detector simulation is performed using Delphes [101] with the muon collider card.The data cleaning and preparation phase is applied using MLAnalysis [102].The ANN is constructed and trained using the Keras with the TensorFlow backend [103].The events for the NP are generated with one operator at a time.The operator coefficients are set as the same as Ref. [  The cross-sections of the SM contribution and the NP contributions [92].The NP contributions are cross-sections when the operator coefficients are fi , fi used in the simulation are also shown.The cross-sections after N j cut are denoted as σ, which are also shown.The f U i are partial wave unitarity bounds on the operator coefficients.
an EFT, the SMEFT is only valid under a certain energy scale.One of the signals that the SMEFT is no longer valid is the violation of the unitarity [104][105][106], and the partial wave unitarity is often used in the phenomenological studies of the SMEFT to check whether the SMEFT is valid, which can sets bounds on the operator coefficients [107][108][109][110][111][112][113].The operator coefficients used in the MC simulation are within the constraints set by the partial wave unitarity bounds [92].The partial wave unitarity bounds (denoted as f U i ) are listed in Table II.Denoting σ SM as the cross-section of the SM contribution, and σ gT,i fi as NP contributions with the operator coefficients to be fi , respectively, the cross-sections and operator coefficients fi are listed in Table II.
In Delphes we use the fast jet finder with anti-k T algorithm, and with R = 0.5, p j T > 20 GeV, where R is cone radius, and p j T are the transverse momenta of jets.After the events are generated, we require that the final states to have at least two jets.This requirement is denoted as the N j cut, the cross-sections after N j cut are denoted as σSM and σgT,i fi which are also listed in Table II.Then an eight dimensional vector is made to represent each event such that ⃗ v = (p x , p x , p z ), where p (1) and p (2) are the four momenta of hardest and second hardest jets.Note that p (1) t and p (2) t are the energies of jets in this paper.In the following, we consider only the 8-dimensional vectors described above, ignoring the physical meaning behind them.
For the SM contribution, we generate 1000000 events for each c.m. energy, taking the case of √ s = 3 TeV as an example, 832056 events are left after the N j cut, of which 400000 events consist the training data set, 100000 events consist the validation data set, and rest events consist the test data set.For each operator, 300000 events are generated, and all the events after the N j cut are used as the test data set.The numbers of events after the N j cut for the NP are generally more than 297000.Before the data sets are fed with the AE, a z-score standardization [114] is applied, i.e. the vectors ⃗ v n are replaced by ⃗ x in,n such that x in,n i = (v n i − vi ) /ϵ i , where x in,n i , v n i are the i-th components of the ⃗ x in,n and ⃗ v n vectors, vi and ϵ i are the mean value and standard deviation of the i-th component over the training data set.vi and ϵ i used in this paper are listed in Table III.
The AE is to reproduce the input vectors, therefore the labels of the training data sets are just as same as the input of the data sets.

B. Structure of the AE network
Since the events are represented by eight dimensional vectors, the number of neurons in the input layer of the encoder, and in the output layer of the decoder are both eight.The number of layers and the number of neurons in each layer is depicted in Fig. 2. In this paper, we use a densely connected ANN, taking the hidden layer x 1 i as an example, values at neural x 1 i can be calculated as ⃗ x 1 = g in,1 (W in,1 ⃗ x in + ⃗ b in ), where W i,j is the weight matrix stored in the links between i-th and j-th layers, and ⃗ b i is the basis vector stored in the neurons in i-th layer, and g i,j is the activation function between i-th and j-th layers.In this paper, we use the "LeakyReLU" function [115] between the layers, where α = 0.01.Specifically, g in,1 , g 1,2 , g latent,3 , and g 3,4 are LeakyReLU functions, g 2,latent and g 4,out are linear activation functions.The loss function defines how well the input vectors can be reproduced by the AE.In this paper, we use the mse as the loss function.
Since it is expected that, the reason AE to be able to distinguish between the SM and NP is because the AE is able to find the patterns of the SM while being unable to learn the patterns of NP.Therefore we need the AE to have weak generalization properties and only reproduce events of the SM accurately, and smaller k the better performance of AEAD is expected.To investigate the effect of the dimension of the latent space, four cases are considered, they are k = 1, 2, 3 and 4.

C. Early stopping
Overfitting is a situation where the model performs well on the training set but relatively poorly on the test set.In that case, the model is weak in predicting unknown data.One of the methods to avoid overfitting is early stop.
The process of training the model is the process of updating the model parameters (i.e.We use early stopping method to avoid overfitting in this paper.As an example, the mses for the training data set and the validation data set as functions of number of epochs for k = 1 are shown in Fig. 3.We stop when the mean of mse of last 50 epochs of the validation set is larger than then the one of the last 100 epochs, which is checked every 100 epochs.
In the training, we preserve the networks by training to 500 epochs while keeping only the the above phenomenon is due to the fact that there are events in the SM data set that are more anomalous relative to the rest of the SM events, and that where they are partitioned determines the mse.Normally, this is a situation more appropriate to use mae as a loss function, however, since mse is more sensitive to anomalous samples and our goal is not to reproduce the SM events but to find anomalies, we still use mse as a loss function.

D. Distribution of the anomaly score
As described in the previous section, the AEAD uses how well the AE can reproduce the input as the anomaly score.That is, one can use d as the anomaly score.Note that d is defined using the data after z-score standardization, therefore is dimensionless.
In this subsection and in the following, we concentrate on the test data sets.

E. Latent space distribution
It is known that AE can also be used as a data dimensionality reduction algorithm, a scheme that uses AE for classification uses AE as a data preparation stage.In combination with AE, other classification algorithms or AD algorithms can be applied in the latent space, i.e., the space consisting of the α i values that are in the middle layer.Since the AE is trained to approximately reproduce events of the SM, this means that α i contain the major information needed to reconstruct the events.For this to happen, there are hidden relationships between the components of the vectors input to the AE, as a result, the components of the input vectors are not independent of each other.After dimension reduction, the events can be represented by a smaller number (k in our case) of variables.
Therefore, with the help of latent space, the features of the SM and NP are more easily presented visually.
The distributions of the events in the latent spaces when k = 1 are shown in Fig. 6, and the case for k = 2 are shown in Fig. 7.It can be seen that, in the latent spaces, the NP events already distribute differently from the SM events.Due to the nonlinearity of the activation functions (segmented functions are used as the activation functions), different regions in the latent space often represent different functions that have been imposed to α i in the decoder, and therefore represent different relationships among the components of the input/output vectors.The fact that the SM and NP events distributed differently in the latent space reflects the conjecture that, since the hidden relationships between the components are obtained by having the AE trained on the SM events, the events of the NP are not described by these relationships.This explains why d can be used to search for the NP signals.

F. Cut efficiency
Since there is no interference between the SM and NP, the cross-section after cut can be written as, where σSM and σgT,i fi are listed in Table II, ε SM is the cut efficiency of selecting events with d > d th for the SM, and ε O gT,i are cut efficiencies for the NP signals.Note that, ε does not include the effect of the N j cut (denoted as ε N j ), which is already included in σ = ε N j σ. ε SM and ε O gT,i are functions of d th , and to facilitate the study of expected coefficient constraints, we fit the cut efficiencies using rational functions, For the case of ε SM when k = 1 and √ s = 3 TeV, we use a rational function with six parameters, and for the other cases we use a rational function with four parameters.As an example, the results of the fits in the case of k = 1 are shown in Fig. 8.It can be seen that, ε SM is much smaller than ε O gT,i , indicates that the event selection strategy using d th can be used to suppress the background.

G. Expected constraints on the operator coefficients
Usually, when the NP signals are not found, the task is to set constraints on the operator coefficients.This can be done by using the statistical signal significance which is defined as [116,117], where N bg is the number of background events, N s is the number of signal events.Since there is no interference between the SM and NP, the number of events after cuts can be obtained by N s = ε O gT,i L × σgT,i and N bg = ε SM L × σSM , where ε is the cut efficiency, σ is the cross-section after N j cut, and L is the luminosity.
Using the fitted ε(d th ), the expected constraints on the operator coefficients can be directly obtained.The expected constraints for S stat = 2 in the "conservative" case are shown in  IV.
The expected constraints on f i and M i in both the "conservative" and "optimistic" cases [88] are calculated, and listed in Tables V and VI.Taking the case of √ s = 3 TeV as an example, in this paper, the 2σ constraint on M 0 is about 27% of √ s which is smaller than √ s.From the point of view of partial wave unitarity bounds, there is no sign that the validity of EFT has been violated.However, if we assume that f i = c i /Λ 4 and assume that Λ ≥ √ s, then c i ∼ O(4π) (for example |f 0 /Λ 4 | s 2 < 12.1 for S stat = 2 and √ s = 3 TeV).
Thus the constraints are still in the strongly coupled scenarios, and one can expect that combined constraints of multiple processes or higher luminosities can further tighten the constraints.Compared with the HL-LHC, where the combined (combined of pp → γγ, pp → ℓ + ℓ − γ, pp → ν νγ and pp → q qγ channels) constraint on M 0 is about 22% of √ s [80], our result indicates that the muon collider is sensitive to gQGCs.
The uncertainties in this paper come from different aspects.In MC simulation, higher order contributions are not included.The beam induced background is also important,

FIG. 1 :
FIG. 1: Feynman diagrams for the process µ + µ − → jjν ν, where (a) shows the Feynman diagrams of the gQGC contribution, and (b) shows the typical Feynman diagrams of the SM background.

W
i,j and ⃗ b i ) through learning.In the training phase, the data set is divided into a training data set and a validation data set, and only the training data set is used to update the parameters.During the training process, the errors for the training set and validation set gradually decreases, and after reaching a critical point, the errors for the training set continue to decrease and the ones for the validation set start to increase.Early stopping is to prevent overfitting by stopping the training before the critical point, i.e., the number of iterations is truncated.

7 FIG. 3 : 8 i 46 FIG. 4 :
FIG. 3: The learning curves in the training phase for k = 1.The top-left panel corresponds to √ s = 3 TeV, the top-right panel corresponds to √ s = 10 TeV, the bottom-left panel corresponds to √ s = 14 TeV, and the bottom-right panel corresponds to √ s = 30 TeV.
As an example, the normalized distributions of d when k = 1 are shown in Fig. 5.It can be seen that the d for the SM background at different energies are generally small, thanks to the well trained AE.Meanwhile, the d for the NP signals are generally larger, and the larger the √ s, the larger d.From the distributions it can be conclude that d can provide a good discriminate ability to select the NP signals as expected.

FIG. 5 :
FIG. 5: The normalized distributions of d when k = 1 for the SM events and for the NP events in the test data sets.The top-left panel corresponds to √ s = 3 TeV, the top-right panel corresponds to √ s = 10 TeV, the bottom-left panel corresponds to √ s = 14 TeV, and the bottom-right panel corresponds to √ s = 30 TeV.

FIG. 6 :
FIG. 6: The distributions of the test data set events in the latent space when k = 1.The top-left panel corresponds to √ s = 3 TeV, the top-right panel corresponds to √ s = 10 TeV, the bottom-left panel corresponds to √ s = 14 TeV, and the bottom-right panel corresponds to √ s = 30 TeV.

FIG. 7 :
FIG. 7: Same as Fig. 6 but for k = 2.The panels in the first row are for the SM test data set, and those in the second row are for the O gT,0 test data sets.The first, second, third and the fourth columns correspond to √ s = 3 TeV, 10 TeV, 14 TeV and 30 TeV, respectively.

FIG. 8 :
FIG. 8: ε as functions of d th compared with the fitted ε(d th ).The top-left panel corresponds to √ s = 3 TeV, the top-right panel corresponds to 10 TeV, the bottom-left panel corresponds to 14 TeV, and the bottom-right panel corresponds to 30 TeV.

Fig. 9 .
Fig. 9.It can be shown that, the cases for k = 1 and k = 2 can perform better than the traditional event selection strategy.At √ s = 3, 10 and 14 TeV, k = 1 works best, and at √ s = 30 TeV, k = 2 works best.In the following, we choose d th which minimize the expect constraints according to Fig. 9.The results of d th are listed in TableIV.

TABLE I :
The constraints on the operator coefficients at 95% CL obtained by the process

TABLE II :
is used in the training phase.As an example, the process µ + µ − → ν νjj at muon colliders is considered, which is sensitive to the dimension-8 operators contributing to gQGCs.The event selection strategy based on AEAD is studied, and the expected constraints on the operator coefficients are calculated.It can be shown that, the constraints are generally tighter than those obtained by using a traditional event selection strategy.Therefore, it can be concluded that the AEAD is effective in the phenomenological study of the SMEFT.It is expected that the AE algorithm accelerated by quantum computers can be even more efficient in the future.
[92]operator is considered at a time.There are possible contributions from other high dimensional contributions.Note that if the jets were from quarks, there will be possible interference between the SM and NP, and the interference of a dimension-12 operator is at the same order of gQGCs assuming the Wilson coefficients are at the same order.traditionaleventselection strategy[92], and the AEAD event selection strategy.The top-left panel corresponds to √ s = 3 TeV, the top-right panel corresponds to 10 TeV, the bottom-left panel corresponds to 14 TeV, and the bottom-right panel corresponds to 30 TeV. set

TABLE III :
The means vi and standard deviations ϵ i of the components of the vectors ⃗ v over the training data sets.TeV 10 TeV 14 TeV 30 TeV 1 ab −1 10 ab −1 10 ab −1 10 ab −1 k 3

TABLE VI :
Same as Table V but for the "optimistic" case.

TABLE VII :
[92]traditional event selection strategies at different energies[92]./ p T is the transverse missing momentum and m jj is the invariant mass of two hardest jets.It is also required that N j ≥ 2 to calculate m jj