Anomaly detection in high-energy physics using a quantum autoencoder

The lack of evidence for new interactions and particles at the Large Hadron Collider has motivated the high-energy physics community to explore model-agnostic data-analysis approaches to search for new physics. Autoencoders are unsupervised machine learning models based on artificial neural networks, capable of learning background distributions. We study quantum autoencoders based on variational quantum circuits for the problem of anomaly detection at the LHC. For a QCD $t\bar{t}$ background and resonant heavy Higgs signals, we find that a simple quantum autoencoder outperforms classical autoencoders for the same inputs and trains very efficiently. Moreover, this performance is reproducible on present quantum devices. This shows that quantum autoencoders are good candidates for analysing high-energy physics data in future LHC runs.


I. INTRODUCTION
In the absence of a confirmed new physics signal and in the presence of a plethora of new physics scenarios that could hide in the copiously produced LHC collision events, unbiased event reconstruction and classification methods [1][2][3][4] have become a major research focus of the high-energy physics community. Unsupervised machine learning models [5][6][7][8], popularly used as anomaly-detection methods [9][10][11][12][13][14][15], are trained on Standard Model processes and should indicate if a collision event is irreconcilable with the kinematic features of events predicted by the Standard Model.
One of the most popular neural network-based approach are autoencoders [16]. Autoencoders consist of an encoder step that compresses the input features into a latent representation with reduced dimensionality. Subsequently, the latent representation is decoded into an output of the same dimensionality as the input feature space. The entire network is then trained to minimise the reconstruction error. The latent space acts as an information bottleneck, and its dimension is a hyperparameter of the network. The assumption is that the minimal dimension of the latent space for which the input features can still be reconstructed corresponds to the intrinsic dimension of the input data, here Standard Model induced background processes. However, the trained autoencoder would poorly reconstruct any unknown new-physics process with a higher intrinsic dimension. If the signal is kinematically sufficiently different from the background samples, the loss or reconstruction error will be larger for signal than for background events. Such autoencoders can be augmented with convolutional neural networks [17,18], graph neural networks [19,20] or recurrent neural networks [21,22] on its outset, making it a very flexible anomaly detection method for a vast number of use cases.
With the advent of widely available noisy intermediate-scale quantum computers (NISQ) [23] the interest in quantum algorithms applied to high-energy physics problems has spurred. Today's quantum computers have a respectable quantum volume and can perform highly non-trivial computations. This technical development has resulted in a community-wide effort [24,25] exploring the applications of quantum computers for studying quantum physics in general and in particular, the application to challenges in the theoretical description of particle physics. Some recent studies in the direction of LHC physics include evaluating Feynman loop integrals [26], simulating parton showers [27] and structure [28], quantum algorithm for evaluating helicity amplitudes [29], and simulating quantum field theories [30][31][32][33][34][35]. An interesting application of quantum computers is the nascent field of quantum machine learning-leveraging the power of quantum devices for machine learning tasks, with the capability of classical 1 machine learning algorithms for various applications at the LHC already recognised, it is only natural to explore whether quantum machine learning (QML) can improve the classical algorithms [36][37][38][39][40][41][42][43].
This work explores the feasibility and potential advantages of using quantum autoencoders (QAE) for anomaly detection. Most quantum algorithms consist of a quantum state, encoded through qubits, which evolves through the application of a unitary operator. The necessary compression and expansion of data in the encoding and decoding steps are manifestly non-unitary, which has to be addressed by the QAE using entanglement operations and reference states which disallow information to flow from the encoder to the decoder. To this end, a QAE should, in principle, be able to perform tasks ordinarily accomplished by a classical autoencoder (CAE) based on deep neural networks (DNN). The ability of DNNs are known to scale with data [44], and large datasets are necessary to bring out their better performance over other machine-learning algorithms. Interestingly, we find that a quantum autoencoder, augmented using quantum gradient descent [39,45] for its training, is much less dependent on the number of training samples and reaches optimal reconstruction performance with minuscule training datasets. Since the use of quantum gradient descent is a relatively new way of improving the convergence speed and reliability of the quantum network training, we provide a detailed introduction in Appendix A. Moreover, compared to CAEs, which use the same input variables as the QAE, QAEs have better anomaly detection capabilities for the two benchmark processes we use in our study. This better performance is particularly interesting as the CAE has O(1000) parameters compared to just O(10) for the QAE. The study indicates the possibility to study quantum latent representations of high-energy collisions, in analogy to classical autoencoders [19,[46][47][48]. Our results indicate that quantum autoencoders could be advantageous in anomaly detection tasks in the NISQ era.
The rest of the paper is organised as follows. In section II, we present an introduction to classical autoencoders based on deep neural networks. We then describe the basic ideas of quantum machine learning and a quantum autoencoder in section III. The details of the data simulation, network architecture, and training are described in section IV. We present the performance of a quantum autoencoder compared to a classical autoencoder in section V. We conclude in section VI.

Encoder Decoder
FIG. 1: Schematic representation of a simple dense classical autoencoder (left) and a quantum autoencoder (right) for a four dimensional input space and a two dimensional latent space. To induce an information bottleneck in quantum unitary evolutions, we throw away states |β i (trash states) at the encoder output (green lines), which are replaced by reference states |βi (shown in orange lines ), containing no information of the input |xj . The mechanism can be better understood by dividing the Hilbert space of the complete system into three parts: HA the subspace formed by the qubits that are fed to the decoder, HB the subspace of the qubits that are discarded after encoding, and H B the subspace where a fixed reference state (initialised as |0 ⊗ dim H B ) unacted by the encoder is fed to the decoder. SWAP gates can achieve the exchange of states denoted by black lines.
Autoencoders are neural networks utilised in various applications of unsupervised learning. They learn to map input vectors x to a compressed latent vector z via an encoder. This latent vector feeds into a decoder that reconstructs the inputs. Denoting the encoder and decoder networks as E(Θ E , x) and D(Θ D , z) with Θ E and Θ D denoting the learnable parameters of the respective network, we have wherex denote the reconstructed output vector. The whole network is trained via gradient descent to reduce a faithful distance L, between the reconstructed outputx and the input vector x. For instance L can be the root-mean-squareerror (RMSE), wherex i and x i are the i th component of the reconstructed and input vectors respectively, and n is their dimension. A faithful encoding should have an optimal latent dimension k < n, with k being the intrinsic dimension of the data set. This dimensionality reduction is crucial in many applications of autoencoders, which otherwise learns trivial mappings to reconstruct the output vectorsx. Unsupervised learning deals with learning probability distributions, and properly trained autoencoders are excellent for many applications. A dense CAE for a four feature input and two-dimensional latent space is shown in figure 1. The encoder and the decoders are also enclosed in red and blue boxes, respectively. One popular usage of autoencoders in collider physics is anomaly detection. In various scenarios at the LHC, the background processes' contributions are orders of magnitude larger than most viable signals. However, a plethora of possible signal scenarios exist that could be realised in nature, making it unlikely that the signal-specific reconstruction techniques of supervised learning methods comprehensively cover all possible scenarios. This motivates unsupervised anomaly detection techniques, wherein a statistical model learns the probability distribution of the background to classify any data not belonging to it as anomalous (signal) data. Using an autoencoder as an anomaly detector, we train it to reconstruct the background data faithfully. Many signals have a higher intrinsic dimension 2 than background data due to their increased complexity. Hence, they incur higher reconstruction losses. Thus, the loss function can be used as a discriminant to look for anomalous events.

III. QUANTUM AUTOENCODERS
Quantum machine learning broadly deals with extending classical machine learning problems to the quantum domain with variational quantum circuits [50]. We can divide these circuits into three blocks: a state preparation that encodes classical inputs into quantum states, a unitary evolution circuit that evolves the input states, and a measurement and post-processing part that measures the evolved state and processing the obtained observables further. For this discussion, we will always work in the computational basis with the basis vectors {|0 , |1 } denoting the eigen states of the Pauli Z operatorσ z for each qubit.
There are many examples of state preparation in literature [51], which has their own merits in various applications. We prepare the states using angle encoding, which encodes real-valued observables φ j as rotation angles along the x-axis of the Bloch sphere where R x = e −i φ j 2σ x denote the rotation matrix. The number of qubits required n, is same as the dimensions of the input vector. A parametrised unitary circuit U(Θ), with Θ denoting the set of parameters, evolves the prepared state |Φ to a final state |Ψ , The final measurement step involves the measurement of an observable on the final state |Ψ . Since measurements in quantum mechanics are inherently probabilistic, we measure multiple times (called shots) to get an accurate result. In order to do that, we need quantum hardware that can prepare a large number of pure identical input states |Φ for each data point. After defining a cost function, the parameters Θ can be trained and updated using an optimisation method. To better capture the geometry of the underlying Hilbert space and to achieve a faster training of the quantum network, 3 we will use quantum gradient descent [45], where the direction of steepest descent is evaluated according to the Fubini-Study metric [52,53]. The general idea is to make the optimisation procedure aware of the weight space's underlying quantum geometry, which improves the speed and reliability of finding the global minimum of the loss function. A brief outline of quantum gradient descent is given in Appendix A.
While we have not discussed the specific form of the parametrised unitary operation U(Θ), it is important to note that one of the major advantages of quantum computation is due to its ability to produce entangled states, a phenomenon absent in devices based on classical bits. The prepared input state is separable into the component qubits, and a product of unitaries acting on single-qubit states will not entangle the subsystems. The CNOT gate is a standard two-qubit gate, which will be used in our circuit to entangle the subsystems.

A. Quantum autoencoders on variational circuits
Quantum autoencoders based on variational circuit models have been proposed for quantum data compression [54]. In our work, we want to learn the parameters of such a network to compress the background data efficiently. Along the same principles as anomaly detection on classical autoencoders, we expect that the compression and subsequent reconstruction will work poorly on data with different characteristics to the background.
A quantum autoencoder, in analogy to the classical autoencoders has an encoder circuit which evolves the input state |Φ to a latent state |χ via a unitary transformation U(Θ), and then reconstructs the input state, via its hermitian conjugate |Φ = U † (Θ)|χ . However, note that since unitary transformations are probability conserving and act on spaces having identical dimensions, there is no data compression in such a setup. In order to have data compression, some qubits at the initial encoding |χ are discarded and replaced by freshly prepared reference states. Such a setup for a four feature input and two dimensional latent space if shown in figure 1. The unitary operators output identical number of qubits, however at the encoder step, two of its outputs (shown by green lines) are replaced by freshly prepared reference states (shown in orange lines), devoid of any information of the input states. We describe the basics of quantum autoencoding in the following, mainly based on the discussion of quantum autoencoders for data compression from ref. [54]. Quantum anomaly detection of simulated quantum states has been investigated in ref [55]. To the best of our knowledge, our study is the first to explore anomaly detection of classical inputs via a quantum autoencoder. The main difference between existing studies and ours is that the input states for the former are inherently quantum mechanical. In contrast, the choice of input embedding of the classical numbers in our case determines the nature of the quantum state. We will use angular encoding, where the quantum states are separable into the constituent qubits. We will, however, be extensively using CNOT gates in the unitary evolution which will entangle the different qubits.
Let us denote the Hilbert space containing the input states by H. For describing a quantum autoencoder, it is convenient to expand H as the product of three subspaces, with subspace H A denoting the space of qubits fed into the decoder from the encoder, and H B denoting the space corresponding to the ones that are re-initialised, and H B denoting the Hilbert space containing the reference state.
In the following, we will denote states belonging to any subspace with suffixes while the full set will have no suffix. For example, |a AB ∈ H A ⊗ H B , |κ ∈ H, |b B ∈ H B etc. We will use the same convention for operators acting on the various subspaces.
Since we entangle the separable input qubits in the subspaces H A ⊗ H B via U AB (Θ), the latent state |χ AB ∈ H A ⊗ H B , in general, is not seperable. The input of the larger composite system including the reference state is |Φ AB ⊗ |β B , with |β B denoting a freshly prepared reference state (initialised as |0 ⊗ dim H B ) not acted on by the unitary U AB . The process of encoding can be therefore written as, where I B denotes the identity operator on H B . Explicitly, the dimensions of the subspaces H A , H B , and H B are 2 N lat , 2 N trash , and 2 N trash , respectively, where N lat is the number of qubits passed to the decoder directly from the encoder, while N trash are the ones that are discarded. Swapping the B and B , gives the input to the decoder as where V BB indicates a unitary that performs the swap operation, 4 and I A is the identity operator on H A . The output 4 For instance swapping the state of two qubits in the basis {|00 , |01 , |10 , |11 }, can be implemented via the unitary matrix of the decoder can now be written as with I B being the identity operator on H B . The decoding, therefore, takes the swapped latent state |χ , and the unitary U † AB evolves it with no information from the encoder in the subspace H B . The reconstruction efficiency of the autoencoder can be quantified in terms of the fidelity between the input and output states in the subspace H A ⊗ H B , which quantifies their similarity. For two quantum states |ψ and |φ , it is defined as For normalized states, we have 0 ≤ F ≤ 1, with F = 1 only when |φ and |ψ are exactly identical. We can write the fidelity of the complete system as where we have implicitly assumed that the unitary operators are extended to the whole space via a direct product with the identity operator on the subspace it does not act on, for notational compactness. Noting that U AB |Φ AB = |χ AB , we can write this as, Writing the swapped state as Since we are interested in the wave functions belonging to the subspace H A ⊗ H B , we trace over B to get the required fidelity. However, a perfect fidelity between the input and outputs of the AB system can be achieved when the complete information of the input state passes to the decoder, i.e.
The state |Φ c A denotes a compressed form of |Φ AB , i.e it should contain the information of the AB system in the input, while |β B is equivalent to the reference state, with no information of the input. If the B and B systems are identical during the swap operation, the entire circuit reduces to the identity map. The output of the B system, hereby referred to as the trash state, is itself the determining factor of the output state fidelity. The output of the B system can be obtained after tracing over the A system as:ρ B = Tr A {|χ χ| AB } and the required fidelity of the B system is F (|β B ,ρ B ).
A perfect reconstruction of the input is possible only when the trash state fidelity F (|β B ,ρ B ) = 1. Thus a quantum autoencoder can be trained by maximising the trash state fidelity instead of the output fidelity, which has the advantage of reducing the resource requirements during training. Although, the output fidelity obtained by tracing over the B system is numerically not equal to the trash state fidelity, we can use the latter in anomaly detection as well, since it is a faithful measurement of the output fidelity. Thus, unlike vanilla classical autoencoders, we can reduce the execution and training of QAEs into the encoder circuit for anomaly detection.
The above discussions have focused on the underlying principles behind a quantum autoencoding process on single input states. As stated before, we need to prepare identical input-states for each data point and repeat the unitary evolution and measurement to get a useful estimate of the fidelity, evident also from the use of density operators to express the output state. Referring to the ensemble of the input states as {p i , |Φ i AB }, we obtain for the cost function where the negative sign converts the optimisation process into minimising the cost function. It is important to note that the ensemble should not be taken as being analogous to the batch training in classical neural networks, as it is required for the accurate prediction of the network output even when testing the autoencoder network. The fidelity between two qubits at the encoder output and the reference states is measured via a SWAP test.

A. Data simulation
To show the prowess of the quantum autoencoders, we study two processes with distinctive features: a QCD continuum background of top pair production taking possible signal signatures of resonant heavy Higgs decaying to a pair of top quarks, and invisible Z decays into neutrinos with a likely signal of the 125 GeV Higgs decaying to two dark matter particles. As we shall see in the following sections, the relative performance of QAEs over CAEs show parallels in these two different signatures, pointing towards an advantage of QAEs over CAEs not governed by the specific details of the final state.

Resonant Higgs signal over continuum tt background
The first background and signal samples used in our analysis consist of the QCD tt continuum production, pp → tt, and the scalar resonance production pp → H → tt, respectively. The background and the signal events are generated with a centre-of-mass energy of 14 TeV, as expected during future LHC runs. Each top decays to a bottom quark and a W boson, and we focus on the decay of the W 's into muons exclusively. We consider four different masses of the scalar resonance, m H = 1.0, 1.5, 2.0, and 2.5 TeV. All events are generated with MadGraph5 aMC@NLO [56], and showered and hadronization is performed by Pythia8 [57]. Delphes3 [58] is utilized for the detector simulation, where the jets are clustered using FastJet [59]. We generate about 30k events for the background samples, while for each signal sample, we generate about 15k. The background events are divided into 10k training, 5k validation and 15k testing samples.
For the object reconstruction, a standard jet definition using the anti-k t algorithm [60] with the jet radius R = 0.5 is used. For the signal bottom jets, the output from Delphes 3 is used and require p b T > 30 GeV. For isolated leptons, we requires p l T > 30 GeV and its isolation criteria with R = 0.5. We extracted four variables {p b1 T , p l1 T , p l2 T , / E T } for our analysis, keeping in mind the limitations of current devices. To conserve the aperiodic topology of these variables in the angle embedding (given in eq. 6) we fix the range of each variable to [0, 1000] by adding two points 5 and map the whole dataset to a range [0, π] via the MinMaxScaler implemented in scikit-learn [61]. The two added points are then removed from the dataset. This maps each feature's minimum and maximum to two distinct angles separated by a finite distance due to the selection criteria.

Invisible Higgs signal over invisible Z background
To test the anomaly detection capabilities of QAEs in a different scenario, we study invisible decays of a Z boson produced with two jets originating from QCD vertices. As a possible signal, we take the production of the 125 GeV Higgs boson and two jets originating from Electroweak vertices, decaying to two scalar dark matter particles. The generation is carried out in the same manner as in the previous case, including the definition of jets. We demand that we have at least two reconstructed jets with p T > 30 GeV, and the events have a missing transverse momentum / E T > 30 GeV. For the background, we have 30k events divided into 10k training, 5k validation, and 15k test events, while for the signal, we have 15k test events. We extract six variables to train the QAE and the CAE. They are the absolute separation in pseudorapidity between the two jets |∆η jj |, the invariant mass of the dijet system m jj and the sum of transverse energies within four ranges of pseudorapidity η C ∈ {1.0, 1.5, 2.0, 2.5}. The mapping to conserve the aperiodic topology of these variables in the angular embedding is done by increasing their range on the higher side.

B. Network architecture and training
The QAE was implemented and trained using Pennylane [62]. As stated before, we train and test the QAE model with only the encoder circuit. After the input features are embedded as the rotation angle of the x-axis in the Bloch sphere, the unitary evolution U(Θ) consists of two stages. In the first step, each qubit is rotated by an angle θ i in the y-axis of the Bloch sphere. The values of these angles are to be optimized via gradient descent. After this, we apply the CNOT gate to all the possible pairs of qubits, with the ordering determined by the explicit number of the qubit. This circuit is shown in figure 2 for a four qubit input QAE with two-qubit latent representation. It is given by, where C ij is the CNOT operation acting on the composite space of two qubits i and j, and R i y (θ i ) is the rotation of a single qubit i about the y-axis of the Bloch sphere. Note that the expression does not contain the operations of the SWAP test, which will be explained in the following paragraphs. The training proceeds to find the optimal values for θ i .
The number of qubits discarded at the encoder, the size of the trash-state, fixes the latent dimension 6 via N lat = N in − N trash , with N lat the latent dimension, N in the size of the input state, and N trash the number of discarded qubits. The reference state |β B , has the same number of qubits N trash , and it is initialized to be We measure the fidelity between the trash-stateρ B and the reference state |β B via a SWAP test [63]. It is a way to measure the fidelity between two multi-qubit states. For any two states |φ and |ψ with the same dimensions, the fidelity F (|φ , |ψ ) can be measured as the output of an ancillary qubit |a anc after the following operation, where H anc is the Hadamard gate acting on the ancillary qubit, and c-SWAP is the controlled swap operation between the states |φ and |ψ controlled by the ancillary qubit. Thus the total number of qubits required for a fixed N in and N trash is N in + N trash + 1. Due to the limitation of current quantum devices we limit the input feature to four, and scan over the possible latent dimensions. The peak shifts towards the right in analogy to the CAE, however the shift is not as pronounced. With a single training sample, the network is not able to converge completely while for anything greater than 10, the increase in training size has practically no effect.
The quantum network is trained by minimising the cost function (c.f eq. 11) with quantum gradient descent for the one, two and three-dimensional latent spaces. We train these instances for different training sizes of 1, 10, 100, 1000, and 10000 events to study the dependence of the QAE's performance on the size of the training data. We update the weights for each data sample, with 5000 shots in all training scenarios. For training sizes greater than or equal to 100, we train the networks for 50 epochs. In comparison, for sample sizes 1 and 10, we train the QAE for 500 and 200 epochs, respectively. To benchmark the performance of a QAE on a quantum computer, we train a QAE with the two inputs p l1 T and p b1 T with quantum-gradient descent on Pennylane, and compare the test performance with the simulation and the IBM-Q. For running on the IBM-Q, we build and implement the test circuits in Qiskit [64].
We also train classical autoencoders using Keras-v2.4.0 [65] with Tensorflow-v2.4.1 [66] for the same input features, for comparison. The encoder is a dense network mapping the input space to a latent dimension of N lat ∈ {1, 2, 3}, and has three hidden layers with 20, 15, and 10 nodes. The hidden layers have ReLU activations while the latent output has Linear activation. The decoder has a symmetric configuration to the encoder. The networks are trained with Adam [67] optimiser with a learning-rate of 10 −3 to minimise the root-mean-squared error between the input vector x and the reconstructed vectorx. For the CAEs, we found that training with single data per update (technically batch size=1) has a volatile validation loss per epoch, with slow convergence. Therefore, we choose a batch size of 64 to train the CAEs. 7 We train the QAE with analogous architecture for a six-dimensional input for the second scenario for a twodimensional latent space in a similar fashion for all training sizes. For the CAE keeping the number of nodes and layers identical to the previous case for six-dimensional input and output vectors, we perform a hyperparameter scan, the details of which is given in Appendix C. All results shown in the next section for this scenario is for the best performing hyperparameters.

V. RESULTS
Results of the various training scenarios are presented in this section. We present a detailed investigation of the QAE and CAE's properties for the tt background scenario in Sections V A to V D. The lessons learnt from these analyses, particularly the training size independence and the relative performance, are then tested for the invisible Z background in Section V E.

A. Dependence of test reconstruction efficiency on the number of training samples
The distribution of the loss function of the independent background test samples for different training sizes of the CAE is shown in figure 3. Although training with a single data point is inherently inaccurate, we perform such an exercise as a sanity check of the CAE's comparison to a QAE. The test distribution shifts towards the left as one increases the training size, thereby signifying increased reconstruction efficiency. For training sizes of up to 10 2 , the limited statistics will produce a very high statistical uncertainty. Since it is not the main emphasis of our present work, we do not comment any further. Looking at the distribution across different latent dimensions for 10 3 and 10 4 training samples, one can see the impact of the information bottleneck. For a singular latent dimension, the passed information is already available from 10 3 samples, and hence the loss distribution is very close to the one trained on 10 4 . This relative separation increases as we go to higher latent dimensions, denoting the higher information passed to the decoder to reconstruct the input, which is exploited with higher training samples. For an analogous comparison with the quantum fidelity, we define the cosine similarity between the input vector x and the reconstructed vectorx as, where the dot product is done with a Euclidean signature. The distribution of the cosine similarity shown in figure 4, shows similar features to the loss function's distribution, with efficient reconstruction possible only when the train size is at least 10 3 . We have seen that CAEs cannot be trained with limited statistics to reconstruct the statistically independent test dataset. From the distribution of the test sample's fidelity in figure 5, we see that QAEs are much more effective in learning from small data samples. Although training with a single data point has not reached the optimal reconstruction efficiency, it is obtained with ten sample events. Unlike CAEs, see figs. 3 and 4, the test fidelity distribution for all latent spaces are identical for training sizes greater than or equal to ten. The independence of the sample size is particularly important in LHC searches where the background cross section is small. This particularly interesting feature may be due to the interplay of an enhancement of statistics via the uncertainty of quantum measurements and the relatively simple circuits employed in our QAE circuit. For a single input point and assuming that we have hardware capable of building exact copies, a finite number of measurement processes always introduces a non-zero uncertainty in the network output. This uncertainty can act as additional information in the quantum gradient minimisation, which is performed after the measurement process, increasing the convergence for smaller data samples. Moreover, existing studies [68,69] show the advantage of quantum machine learning over classical approaches. Additionally, the use of quantum gradient descent [39] makes the loss landscape more convex, thereby speeding up convergence.

B. Classification Performance
We compare the QAE and the CAE's performance for the four-dimensional input feature space. The metrics used in this presentation bear similarity to those used in a supervised framework. It also assumes that a randomly chosen event is equally likely to be either the background or the signal. This assumption is not sound in the context of LHC searches or in an anomaly detection technique since the background's cross-section is orders of magnitude larger than that of the signal. Nevertheless, they are handy when comparing different classifiers.
For each value of m H , we plot the Receiver-Operator-Characteristics (ROC) curve between the signal acceptance and the background rejection in figure 6, for the networks trained with 10k samples. The ROC curve is obtained by evaluating the signal acceptance as a function of the background rejection since both are functions of the threshold T 0 applied on the loss function. The signal acceptance S ∈ [0, 1] quantifies the fraction of accepted signal events when one puts a threshold T 0 on the variable x, while the background rejection¯ B ∈ [0, 1] measures the fraction of rejected background events for the threshold T 0 . An outline of how the ROC curve is obtained is given in Appendix B. The black dotted lines denote the performance of a random classifier with no knowledge of either the signal or the background, and the lines further away from it indicate better performance than those in its vicinity. The performance reduces with increasing latent dimensions for CAEs and QAEs, with the highest background rejection coming for a singular latent dimension. Comparing the QAEs and the CAEs (dotted vs solid lines for each colour), we find that QAEs perform better than CAEs consistently in all latent dimensions and the different values of m H . This better performance may be a universal property of QAEs. However, as our analysis is a proof-of-concept, an in-depth exploration of the properties of QAEs in general and anomaly detection at colliders, in particular, is needed to affirm this observation.

C. Anomaly detection
We now explore the performance of the autoencoders in a semi-realistic search scenario. When we scale the normalization of the signal and the background by their respective probability of occurrence, i.e. their respective cross-sections, we are essentially in an anomaly detection scenario since the background is orders of magnitude larger than the signal. The performance of the autoencoders can then be quantified in terms of statistical significance as a function of the threshold applied on the loss. For the background, we scale the cross-section obtained from Madgraph by a global k factor of 1.8 [70], while for all the signal masses, we fix a reference value of 10 fb. The yield is then calculated as where p is the baseline selection efficiency, σ p the cross-section, and E p (T 0 ) the efficiency at a threshold T 0 of the loss distribution, for a process p, while L is the integrated luminosity which we take to be 3000 fb −1 .
Since it is natural to use the best classifier in a search, we evaluate the significance of the autoencoders with one latent dimension, trained on 10k samples. We apply the threshold for the QAE and the CAE on the quantum trash state fidelity and the RMSE loss, respectively. We use (1 − Fidelity) for the QAE to make the signal-rich regions same in both scenarios. RMSE loss is chosen over the cosine similarity since the former was found to have a higher performance. The significance N S / √ N B for each of the signal masses as a function of the threshold T 0 is shown in figure 7. We fix the threshold range so that there are enough background test statistics in the least background like bin. Looking at the peak of the significance, we note that QAEs outperform CAEs, which is only natural from the preceding discussions. However, an interesting development is the relative performance for the different masses. Even though the ROCs indicated higher discrimination with increasing mass, the significance increases for m H = 1.0 TeV to 1.5 TeV and decreases for higher masses. Since we have fixed a fiducial cross-section for each signal mass, it plays no role in this irregularity. The trend arises via an interplay between the higher discrimination by the autoencoder output and the decrease in baseline efficiency with increasing mass m H . The decreasing selection efficiency is due to the isolation criteria of the jets and the leptons, which would be naturally boosted when we go to higher resonant masses m H , thereby becoming more collimated.

D. Benchmarking on a quantum device
We now compare the performance of the quantum simulator and the actual quantum hardware. Since there is a limitation on the available number of qubits, we limit the feature space in two dimension, which consists of {p b1 T , p l1 T }. For our QAE setup, in addition to the two qubits for embedding the input features, one qubit for the reference state and another ancilliary qubit for the SWAP test are needed. We use the simpler version of the quantum circuit shown in figure 2, which is implemented and trained using PennyLane. To compare the performance, we use the same circuit with the same optimized parameters both for PennyLane and for the IBM-Q belem backend. Accessing the IBM hardware was done through Qiskit.
In figure 8, we show the fidelity distributions for the background and the signal samples for our QAE circuit with the optimized circuit parameters computed by the simulator in Pennylane and in the actual quantum device of IBM-Q belem backend. The plot shows the shape of the distribution (denoted by the width of the shaded region) in the y-axis for each bins of size 0.1 in the x-axis (plotted at each bin center). The lines at each ending denote the range of the data of the y-axis. Since IBM-Q does not have a shallow implementation of the CSWAP operation, the fidelity distributions are smeared toward 0.5, and it is especially worse around 1. One of the advantages of using the SWAP test is to reduce the number of qubits for the evaluations of the fidelity during the optimization process. For example, to check the performance of the current circuit, directly measuring the fidelity between the reference state and the output for the second qubit would be enough. It can be achieved by the simple Pauli z measurements. The correlation of the fidelities obtained by Pennylane and by IBM-Q belem, based on the SWAP test and on the Pauli z measurement are shown in the right panel as the violin plots, in blue and in orange, respectively. The correlation is better for the Pauli z measurements for the same circuit part with the identical input parameters. It suggests that  the decoherence effects from a deeper circuit obscure the performance.
In figure 9, we show the ROC curves based on the fidelity distributions for the background and the signal samples evaluated by Pennylane simulator in the left panel. The central panel shows the ROC curves based on the fidelities evaluated by the SWAP test, while the right panel shows those by the second qubit Pauli z measurements, for the same IBM-Q device of belem backend. As one can see, the performances based on the Pauli z measurements on the IBM-Q device follow those obtained by the Pennylane simulator. The AUCs for them are also essentially the same. Thus, the deficit in the performance with the SWAP test is due to the too deep circuit realization for the CSWAP operation in the IBM-Q device. Therefore, the realisation of a CSWAP operation with a shallow circuit is necessary.
To check the efficacy of quantum hardware for the four input QAE, we evaluate the trash state fidelity of a QAE with four-dimensional input features. Due to hardware limitations discussed above, we estimate it without the SWAP test for a single trash qubit giving us a three-dimensional latent representation. The correlation between the Pennylane evaluated fidelity and the output from IBM-Q lagos, shown in figure 8, displays a good agreement between the simulation and the hardware.
E. Comparative training efficiency and performance for pp → Z(νν)jj background We have seen that a QAE trains efficiently and performs better than a CAE in a hypothetical resonant signal scenario. To gauge how these important behaviours carry over to a different process, we study the training size dependence and performance of a QAE and CAE for an invisible background (and signal), detailed in the last paragraph of Section IV A for a two-dimensional latent space. Note that all the results for the CAE are for the best model chosen after a hyperparameter scan described in Appendix C.
The loss distribution of the test dataset for the background for different sizes of training data and their ROC curve for the case of 10k training samples are shown in figure 10. The characteristics are similar to the previous scenario, giving further evidence that the training efficiency of the QAE is not limited to a specific kind of process. Moreover, from the ROC and the AUC value, we see that the QAE also performs better than the CAE. This superior performance is particularly noteworthy given that the CAE's hyperparameters has been chosen after a hyperspace scan restricted to a fixed width and depth.

VI. CONCLUSION
The lack of evidence for new interactions and particles at the Large Hadron Collider has motivated the high-energy physics community to explore model-agnostic data-driven approaches to search for new physics. Machine-learning anomaly detection methods, such as autoencoders, have shown to be a powerful and flexible tool to search for outliers in data. Autoencoders learn the kinematic features of the background data by training the network to minimise the reconstruction error between input features and neural network output. As the kinematic characteristics of the signal are different to the background, the reconstruction error for the signal is expected to be larger, allowing signal events to be identified as anomalous.
Although quantum architecture capable of processing huge volumes of data is not yet feasible, noisy-intermediate scale devices could have very real applications at the Large Hadron Collider in the near future. With the origin of the collisions being quantum-mechanical, a quantum autoencoder could, in principle, learn quantum correlations in the data that a bit based autoencoder fails to see. We have shown that quantum-autoencoders based on variational quantum circuits have potential applications as anomaly detectors at the Large Hadron Collider. Our analysis shows that for the scenario we consider, i.e. the same set of input variables, quantum autoencoders outperform dense classical autoencoders based on artificial neural networks, asserting that quantum autoencoders can indeed go beyond their classical counterparts. They are very judicious with data and converge with very small training samples. This independence opens up the possibility of training quantum autoencoders on small control samples, thereby opening up data-driven approaches to inherently rare processes.
where we have assumed that the signal rich regions are on the lower side of the variable x. The ROC curve is then obtained by expressing the signal acceptance as a function of the background rejection as .
The ROC curve therefore shows the function S (¯ B ) without any reference to the threshold T 0 , which is implicitly assumed in the evaluation of the dependent ( S ) and the independent (¯ B ) quantities. The variable x can be any physical observable or the output of a neural network model. For the studies conducted here, it is the RMSE loss for the CAE, and the fidelity for the QAE.  The details of the hyperparameter scan of classical autoencoder with six-dimensional inputs and outputs are given in this appendix. We use the RandomSearch algorithm implemented in KerasTuner [74] for the scan. The number of nodes in the hidden layers of the encoder is kept fixed to 20, 15, and 10. With a (fixed) two-dimensional latent space, we use a symmetric decoder setup. Once the skeleton of the architecture is fixed, we scan over the activation function of the layers, L1 regularisation and L2 regularisation of the weights, the dropout value between two successive layers, and the training's learning rate and batch size. Their respective values along with the best one chosen for the final training are given in table I. The best value of the hyperparameters are from thousand trials trained for hundred epochs, and the training is terminated if the validation loss does not improve for ten epochs (implemented as the EarlyStopping callback during training).
We do not vary the width or the depth to compare the capabilities of CAEs with at least some degree of comparability to the simple QAE used in the study. Increasing the width and depth will undoubtedly increase the expressive power of a CAE, which is not the objective of the current study. Networks like Convolutional or graph autoencoders acting on low-level high dimensional data will undoubtedly perform better than currently executable QAEs. However, existing quantum resources cannot process such high dimensional data.