Quantum–probabilistic Hamiltonian learning for generative modelling & anomaly detection

The Hamiltonian of an isolated quantum mechanical system determines its dynamics and physical behaviour. This study investigates the possibility of learning and utilising a system’s Hamiltonian and its variational thermal state estimation for data analysis techniques. For this purpose, we employ the method of Quantum Hamiltonian-Based Models for the generative modelling of simulated Large Hadron Collider data and demonstrate the representability of such data as a mixed state. In a further step, we use the learned Hamiltonian for anomaly detection, showing that diﬀerent sample types can form distinct dynamical behaviours once treated as a quantum many-body system. We exploit these characteristics to quantify the diﬀerence between sample types. Our ﬁndings show that the methodologies designed for ﬁeld theory computations can be utilised in machine learning applications to employ theoretical approaches in data analysis techniques.

The Hamiltonian plays a crucial role in our theoretical understanding of a physical system.The dynamics of an isolated quantum system are governed by an effective Hamiltonian which indicates the interaction of the system's constituents.In many, arguably simple, cases, it is possible to determine the effective Hamiltonian through a set of theoretical considerations, such as observing its interactions and using the underlying symmetries of the system.Often, however, it is challenging to derive the algebraic form of a Hamiltonian from theoretical considerations only.Hence, several Hamiltonian learning methods have been proposed by employing thermal or eigenstates [1][2][3][4][5][6], short-time evolutions [7][8][9][10], and data-driven approaches [11].With recent technological developments, it has become possible to simulate the effective Hamiltonian that governs a quantum many-body system in an actual quantum device.Widely used methods like Variational Quantum Eigensolver (VQE) [12,13] and its generalisation Variational Quantum Thermaliser (VQT) [14] have become the most promising algorithms for noisy-intermediate scale quantum (NISQ) devices [15].
There is a substantial yet often underappreciated similarity between the computational methods used for data analysis, e.g. in quantum machine learning and the theoretical description of quantum many-body systems.In both scenarios, one optimises the variational parameters of a given ansatz over an objective function.For the former, this is naturally the expectation value of a Hamiltonian, and for the latter, it is a loss function chosen for the nature of the problem.In Machine Learning (ML) applications, one usually chooses one or multi-qubit measurement with a Pauli operator and optimises the probability of the expectation value of this operator on a set of qubits.However, this operator is not necessarily optimal for the optimisation process, where it is only a subset of possible combinations of different operators.In this study, we will investigate the possibility of learning an optimal effective operator for the optimisation process and the implications of this operator for the application.
In generative modelling, the aim is to learn the joint probability distribution between the target and the observed data, which enables the model to generate new data resembling the observed data.This requires representing the probability distribution of the data within a quantum device.The mixed states are an ideal surrogate for such representation since they form probabilistic mixtures of pure states.Additionally, mixed states attain the properties of both quantum and classical correlations, enhancing the representability of a given probability distribution.The likeness of a given probability distribution can be captured within a Parametrised Quantum Circuit (PQC) as a thermal state of a modular Hamiltonian.Quantum Hamiltonian-Based Models (QHBM) [14] have been proposed as generative models which split the learning process into two distinct parts.The first part is responsible for learning a modular Hamiltonian with the aid of a classical neural network for capturing the classical correlations within the data.The second portion consists of a PQC that constructs the learned Hamiltonian's thermal state.The aim is to approximate the probability distribution of the data by optimising the learned thermal state with respect to the mixed state based on the data.
Motivated by the close methodical relation between data analysis and the simulation of field theories, i.e. the use of optimisation methods on parametrised circuits, we propose to learn a Hamiltonian from data, according to the QHBM approach.Thus, we will demonstrate an endto-end hybrid quantum-probabilistic optimisation procedure to simultaneously learn the probability distribution and modular Hamiltonian for the data.We then use the learned probability distribution to generate new data and, as a case study, apply it to top-quark production at the Large Hadron Collider (LHC).Furthermore, we will use the learned Hamiltonian for anomaly detec-tion, as we will show that both the expectation value of the learned Hamiltonian and the Hamiltonian-based time evolution sequence discriminate between signal and background data samples.
Our findings in this study show that the optimisation methods developed to simulate quantum many-body systems are easily transferable to data-analysis applications and can be used to integrate the theoretical foundations of quantum mechanics into ML techniques and vice versa.To our knowledge, this approach has not been proposed or investigated for data-analysis applications 1 .
Previous studies on Hamiltonian learning techniques are mainly based on learning the structure of a quantum state, e.g. with simulated data from a known Hamiltonian with certain noise [11].The initial proposal of QHBM [14] is designed to provide a hybrid learning algorithm.However, due to the complexity of the quantum systems, it is challenging for a classical computer to generate efficient enough samples.To alleviate that problem, various solutions have been proposed, such as approximated free energy techniques [23] and replacing classical sampling of the quantum data with a quantum circuit that represents the sample itself [24].Simulating thermal states are especially challenging due to their circuit depth requirements which can be remedied via noise-assisted thermal state preparation [25].Additionally, various density matrix simulation algorithms have been proposed for error mitigation [26], and algorithms that are enhanced with classical post-processing techniques to simulate non-trivial Hamiltonian [27].Despite the plethora of applications for quantum simulation, such techniques have not been employed in data analysis methodology, where the closest application was building a covariance matrix for principal component analysis through density matrices [28].Our implementation goes beyond quantum simulations and discusses the usage of such techniques for data analysis by representing generic data as a mixed quantum state of a learned operator and, in doing so, generating an abstract representation of the entire dataset as a Hamiltonian.
This study has been structured as follows; in section A we outline the methodology that is adapted.Section I introduces the dataset and preprocessing scheme which is followed by generative modelling exercise in section I A and anomaly detection in section I B. Finally, we offer conclusions in section II.
A. Quantum Hamiltonian-Based Models This section will be subdivided into three distinct subsections.Firstly, we will delve into the intricacies of constructing the quantum variational ansatz, emphasizing the incorporation of data embedding.Secondly, we will elucidate the construction of the Hamiltonian, providing rationale for the specific approach chosen.Lastly, we will delineate the formulation of the objective function, drawing parallels to classical generative modeling for a clearer understanding.

Quantum variational ansatz & data representation
The objective of generative modeling is to encapsulate the entire feature space within a single probability distribution function, enabling efficient sampling to replicate the underlying distribution.Quantum mechanics provides a natural framework for this representation through the concept of mixed states.In quantum theory, a mixed state is a composite entity formed by combining pure states or other mixed states, serving as a faithful representation of the probability distribution encompassing its constituent states.A mixed state can be represented as where p i is the probability of observing the state |s i ⟩ in the mixed state σ.Hence i p i = 1.If we aim to acquire a density matrix representation of the feature space for data regeneration, a challenge arises due to the inherent nature of quantum circuits as pure state simulators.Embedding a mixed state directly onto a quantum circuit is not straightforward.To circumvent this challenge, we can reinterpret the data samples as a probability distribution.Each feature within a data point possesses an associated occurrence probability, which can be effectively encoded onto the quantum circuit using binary values (0's and 1's) derived from sampling a Bernoulli distribution.By generating a sufficient number of such samples, the mean of the sample set corresponds to the occurrence probability of the feature.Through the learning of these samples, it becomes possible to reconstruct the correlation relationships between the features within the quantum circuit.By this method a single data point will be represented as a collection of pure states which sampled from a Bernoulli distribution, where the state for d-th data point with n individual feature probabilities given as p i , has been represented as the sample i, drawn from Bernoulli distribution.This state can be embedded on a quantum circuit by applying Pauli-X gates where Bernoulli distribution results in 1.The combination of the entire data set in terms of sampled states can be represented as, where α d is the weight of each data point within the data set D. Here N represents the number of samples drawn from the Bernoulli distribution for each data point.Finally, the mixed state of the entire data set can be represented as σD = |D⟩⟨D| with appropriate normalization.
Notice that, now a data point does not correspond to a single circuit measurement but of a stack of circuits with different binary inputs from a single Bernoulli distribution.Furthermore, one can learn the correlation structure of the by means of a variational ansatz, Û (ϕ), where ϕ represents the trainable parameters of the ansatz.

Building the Hamiltonian
To facilitate the optimization process, it is imperative to establish a well-defined measurement protocol.This protocol must employ an operator or Hamiltonian capable of accurately encapsulating the entropic probability distribution associated with the targeted mixed state for acquisition.While one option involves the application of a pre-existing Hamiltonian, such as the Ising model, it is essential to acknowledge that this approach inherently assumes specific correlation structures among the features.Alternatively, a more ambitious endeavor entails simultaneously acquiring both the complete Hamiltonian representation and the density matrix characterizing the underlying data through an optimization process.
Given the resource-intensive nature of learning a Hamiltonian, we have opted for the utilization of a classical neural network, which offers a more flexible structural framework.However, it is worth noting that while any neural network ansatz can be employed for this purpose, the estimation of the partition function may pose computational challenges that exceed the allocated resources.The necessity of a partition function will become evident in the subsequent subsection.
Energy-Based Models (EBMs) [29] present an ideal choice for our task, as they inherently encompass the partition function.An EBM establishes a mapping between a state configuration and a scalar energy measure, denoted as E θ (v) := v ∈ V → R, where v ∈ V represents a spin configuration within the set of all possible configurations.EBMs are designed to determine the optimal energy by minimizing the marginal probability distribution of the states in V, given by where |v n ⟩ represents normalized spin states determined by the MC algorithm, and E θ (v n ) corresponds to their energy as measured by the chosen EBM ansatz.It is important to note that K θ is defined as a Hermitian operator.Using the modular Hamiltonian definition in eq. ( 4), the expectation value of d-th data point can be defined as where the mean expectation value has been computed by taking the mean of N samples taken from the Bernoulli distribution.

The objective function
The primary objective of this endeavor is to establish a dependable representation of σD through the acquisition of a learned density matrix, denoted as ρθ,ϕ .In classical probabilistic learning, the optimization process revolves around minimizing the Kullback-Leibler divergence, D KL , to diminish the disparity between two probability distributions (refer to eq. (B4)) [30].In this context, the goal is to approximate ρθ,ϕ ≃ σD .Extending this principle to our scenario, where we work with a given Hamiltonian and temperature, the Gibbs-Delbrück-Moliéve variational principle [31] asserts that the most suitable objective function for this process is the free energy, defined as which is bounded by the actual free energy of the system.
Here, S(σ D ) denotes the entropy of the data (refer to eq. ( B3)), β represents the inverse temperature, and E signifies the expectation value of the given Hamiltonian as defined in eq. ( 5).Notably, both the Hamiltonian and the entropy are unknown prior to data analysis, posing a significant challenge.Recognizing that the free energy can also be expressed as the log-partition function, we can reformulate the entire expression as follows: where k β is a constant related to the true entropy of the data and the Boltzmann constant.This inequality offers a more suitable objective function for our purpose.Since temperature and the Boltzmann constant lack physical significance in the context of data analysis, they can be employed as regularizers of the objective function, and for this study, we consider them to be equal to one.
After staring to the left-hand side of the eq.( 7), it becomes evident that the entropy of the data essentially manifests as the negative log-probability distribution of a multivariate Gaussian distribution centered at zero: where the Hamiltonian can be interpreted as the covariance matrix (Σ(θ)) among different features.Given that the determinant of a Hermitian matrix is equivalent to the sum of its eigenvalues, Z 2π.Thus, our approach essentially models the data as a multivariate Gaussian distribution while simultaneously learning the covariance matrix through the optimization of − log L(θ, ϕ).Similar analogies can be found in the context of simulating lattice field theories using flowbased algorithms [32,33].
In unsupervised learning, the neural network serves as a statistical model of the underlying data, and the objective is to minimize the negative log probability distribution.Drawing from the analogy presented earlier, it becomes evident that reformulating this problem as a thermal state effectively implies an assumption that the data can be suitably approximated by a Gaussian distribution.This insight establishes a direct connection between theoretical approaches and conventional machine learning techniques, highlighting the interplay between sophisticated modeling strategies and established methodologies in the field of statistical thermodynamics.

Combining it all together
Fig. 1 provides a schematic overview of the entire process, segmented into two primary panels.The upper panel illustrates the generation of the modular Hamiltonian, Kθ , and the partition function, Z θ , through the utilization of a MC algorithm.In contrast, the lower panel depicts the variational circuit, incorporating sampled data points as specified in eq. ( 2) as input.Leveraging the modular Hamiltonian, we compute the expectation value for each data point, as detailed in eq. ( 5).Subsequently, we combine the partition function and the expectation value to formulate the loss function, as elucidated in eq.(7).
Once the mean loss function is computed for a batch of data points, we update the trainable parameters θ and ϕ using the following expressions: Within the concept of batched learning, optimising negative log-probability distribution has two significant computational bottlenecks.First, as mentioned above, one must estimate a proper modular Hamiltonian for the optimisation procedure.This has been done independently of the input data where the MC algorithm chooses the most probable set of spin configurations and, by doing so, automatically minimises the energy of the EBM ansatz.This requires reestimation of the modular Hamiltonian after each update.It is important to emphasise that since the MC algorithm proposes an input configuration completely independent of the input data, the optimisation procedure is the only connection between data and the Hamiltonian.The second bottleneck involves σd estimation; since we do not have access to "quantum data", we need to sample through the probability distribution of each data point and compute the expectation value ⟨ K⟩ θ,ϕ .
Such an optimisation process allows the quantum circuit to learn a non-linear distribution by minimising structed from a non-linear classical neural network, |σ D Û (ϕ)⟩ is being forced to approximate such non-linear behaviour to reduce the distance of the projection.
Although it is a powerful representation of the data, learning a completely free Hamiltonian is computationally challenging simply because the Hamiltonian has to be decomposed into Pauli operators at every step of the optimisation process.It is possible to avoid the EBM if we assume a certain structure for the modular Hamiltonian K.For instance, a generic Hamiltonian that captures the nearest neighbour interactions can be suitable to capture near-term complexity of the data, where σ ± are raising and lowering operators and θ are the trainable coupling strength.Since the summation captures only nearest neighbour interactions, this Hamiltonian may not be able to capture the complexity of the data, but it will simplify the optimisation process significantly.Additionally, tensor network techniques can aid in the decomposition of large Hamiltonian matrices.
However, such simplifications are out of the scope of this paper, where we focus on the most generic application and see if it is indeed possible to create a useful operator through this procedure.

I. RESULTS
As a case study, we used top tagging dataset [34,35], which includes over a million mixed collider events for semi-leptonic top and dijet production channels at √ s = 14 TeV.Events are generated and showered in Pythia 8 [36], and the detector simulation has been achieved using Delphes 3 package [37] with default ATLAS configuration card.All jets are reconstructed via anti-kT algorithm [38] with R = 0.8 within Fast-Jet [39] package.Furthermore, the central-boosted phase-space has been captured by requiring jet transverse momentum, p T , to be within [550, 650] GeV and absolute pseudo-rapidity to be |η| < 2.
The jets are further processed to be represented as calorimeter images, potentially captured by the hadronic calorimeter (HCAL) at an LHC experiment.Following the procedure presented in refs.[40,41], leading jet constituents are centred on around the jet-axis on pseudo-rapidity and azimuthal angle, η − ϕ, plane within [−1.5, 1.5].Each image is divided into four-quadrant, and the most energetic quadrant has been moved to the topright corner by horizontally and vertically flipping the image.Finally, all the training samples are standardised over randomly chosen 200,000 images by fitting p T within the [0, π] range.This standardisation procedure yields calorimeter images of 37×37 pixels; however, since it is not possible to process this within a quantum circuit, we simplified our data by cropping 12 pixels from each axis and down-sampling the resulting image by taking the mean of four adjacent pixels.
Two left panels of Fig. 2 show the mean of 5000 images for signal (on the left) and background (on the right) where η ′ and ϕ ′ are the modified pseudo-rapidity and azimuthal angle axes after standardisation.The colour represents the value of the transverse momentum in each pixel, measured on the very right panel of the figure .Notice that this is before normalizing p T distribution within [0, π] range.The following two image captures only a single event within the signal (on the left) and background (on the right) samples.All the images are cropped to show only 27 × 27 central image to focus on the main activity.Even though the averaged images are easily differentiable, single events are usually random looking and not easily differentiable; hence various ML techniques have been developed to differentiate these samples.
Because the top decays into two light jets through a W decay and a b-jet, it creates a three-prong signature, as shown on the very left panel of Fig. 2. Such topological behaviour has been exploited by many analytic tagging algorithms (e.g.ref. [42]).The dijet signature, on the other hand, leaves a single prong signature on the calorimeter, as shown in the second image on the left panel of Fig. 2.This is due to the fact that dijet events do not contain enough energy to produce two distinct jet signatures.Such a process is crucial to investigate at the LHC because the top quark's mass can further our understanding of the Higgs mechanism and its coupling to the top quark since mass comes with a large coupling to the Higgs boson.Even with larger centre-of-mass energies, the production of top quark pairs has been improved at the LHC.These events are usually contaminated with dijet events, making it challenging to isolate top quark events.Hence it is vital to separate top events from the dijet background to improve the experiment's sensitivity to its couplings.
In the following sections, we will use only a fraction of these images by cropping and downsampling them due to computational limitations.This mainly affects the span of information since it is highly dependent on the geometrical positions of the energy deposits on each pixel.We will start with the central pixels and increase the pixel count from there, but it is important to note that the central four pixels are maximally similar throughout both samples.Hence one includes more information regarding the nature of the event once we go beyond the central four pixels.

A. Generative modelling
As the first set of applications for QHBM, we will aim to learn the probability distribution of the pixel intensity in calorimeter images.Each standardised sample pixels has p T intensity between [0, π].The pixel intensity can be interpreted as probability distribution if it passes through a bijective function which outputs values between [0, 1], such as a sigmoid function.This will allow us simultaneously interpret pixel intensities as probability distributions and convert them back to their status quo.Due to the computational cost of the quantum simulation and the optimisation methodology, we choose to perform our investigation with only the central four pixels, which retain the necessary information to differentiate between top and dijet images as presented in a previous study [43].
The modular Hamiltonian has been determined via Restricted Boltzmann Machine (RBM), where details have been presented in App. A. We reestimate the modular Hamiltonian for each batch training by collecting a set of spin states via the MC algorithm presented above.The initial state for each training has been set to | ↑ • • • ↑⟩; each following MC algorithm has been initiated by the last state determined in the previous MC run.For each execution MC algorithm ran for 100 steps to converge on a stable Gibbs state without collecting any; the number of collected states is analysed case by case below.Note that these states are entirely independent of the input data; hence MC algorithm independently minimises the energy of the RBM by choosing the most probable set of states.
The expectation value for each image has been estimated via eq.( 5).Since we are employing batched learning, the expectation value of the batch has been computed by taking the mean of each expectation estimation in the batch.Finally, the variational parameters of the network have been updated with respect to the mean objective function, Notice that the mean only entitles the expectation value of the modular Hamiltonian.We divided our study into different benchmarks to study the effects of σD and Kθ estimations.PennyLane package [44] has been employed for quantum circuit simulation, the RBM and optimisation have been held within TensorFlow [45,46], and TensorFlow-Probability [47] packages.Our im- plementation can be found in this GitLab repository 2 .All the benchmarks are trained with 1000 training samples, and overtraining has been monitored with the same number of validation events3 .Adam optimisation algorithm [48] has been employed with 10 −2 initial learn-ing rate, where the learning rate has been reduced to its half if validation loss has not been improved for over 25 epochs.Each benchmark has been trained for 100 epochs, and training has been terminated if the validation loss hasn't been improved for over 50 epochs.
For the quantum ansatz, we used Matrix Product state (MPS) structure [49] where two-qubit operators have been applied to each adjacent qubit in a staircase-like architecture which is depicted in Fig. 3.We will refer to each of these constructions from the first qubit to the last as a layer.Each two-qubit operator, Ûi (ϕ), includes two rotation gates around the Pauli-Y axis for each input qubit with an independent variational rotation angle followed by a CNOT gate.For each benchmark, we used three layers.Note that the algorithm has also been tested with different architectures such as simplified twodesign [50] and strongly entangling layers [51], which has been observed to improve the results.
Fig. 4 shows the test metrics for each benchmark where each point has been tested with 10,000 mixed test events and presented with one standard deviation, estimated by dividing the test sample into batches of 25.The left panel shows the benchmarks for σD estimation where N M C for Kθ estimation have been set to 200.On the other hand, the right panel shows the benchmarks for Kθ estimation where N smp for σD estimation have been set to 5000.It is also important to note that the samples to estimate σD for the right panel are generated before the optimisation process to speed up the application; however, for the left panel, each sample produced during the training effectively allowed those benchmarks to see different samples in each iteration.Each panel is divided into five sub-panels, where the top two panels present the Kullback-Lieber distance between input images (q) and the sampled output images (p) and batched mixed state of the data (σ D ) and estimated mixed state (ρ ϕ )4 .In the following panels, we present trace distance and fidelity between the thermal state of the data and the estimated thermal state.Finally, we plotted the estimation of the von Neumann entropy separately for the signal (blue) and the background (red) 5 .Note that for the thermal state of the data all images in the input has been assigned the same weight i.e. α i in eq. ( 3).For the details about these metrics, we refer the reader to App.B. During our tests, we observed that the performance of the generative model is mainly based on the wellness of the estimation of the mixed state of the data, where for at this stage, has been influenced both by the EBM and Û (ϕ).However, during the testing, modular Hamiltonian haven't been used; thus, the variational thermal state has only been influenced by Û (ϕ). 5 For gray scale print; signal is represented with dark gray where background is light gray throughout the paper.
larger samples, we observed improved fidelity and trace distance (see the left panel of Fig. 4).Furthermore, we observed that the wellness of the estimation also improves the Kullback-Lieber distance between the input and estimated states and exponentially reduces this metric's uncertainty.Notice that S(σ D ) has been presented separately for signal and background.Although we haven't seen any significant difference in signal or background for any other metric, the entropy for different sample sets has been clearly separated.Note that all benchmarks are trained with mixed data, and neither has been exposed to the information regarding the data type.
On the right panel, we present the effect of the Kθ estimation on the same test metrics.Although we haven't observed any significant improvement in fidelity, trace distance and Kullback-Lieber distance (except a minor refinement in D KL (q|p)), we observed that wellness of Kθ estimation improves the entropy estimation of the data and reduces the uncertainty.Hence the bottom right panel of Fig. 4 indicates that for good enough Kθ and σD estimation, signal and background samples will produce unique entropy values.Thus this information can also be used to identify the nature of the data.However, S(σ D ) has not been observed to be a powerful discriminator.We computed the receiver operating characteristic curve to quantify the difference between signal and background, and the highest area under the curve value we observed was around 0.7.
Note that we haven't discussed the advantage of learning an operator for the data.In the following section, we will discuss a possible usage of the modular Hamiltonian in the context of anomaly detection.

B. Anomaly detection
Anomaly detection is a methodology in which the network ansatz learns the structure of the known data and tries to detect the difference in new data, if any.For this purpose, we have used two test cases.For the first case, we used six qubits where in addition to the central four pixels, we added the top two pixels into the collection.For the second case, we also included the bottom two pixels to test the algorithm for eight qubit scenario.
We are using the same procedure outlined in sec.I A, trained both scenarios using background-only samples for 100 epochs and 1000 events where σD is estimated by 5000 samples before the training.The only difference between the two test cases is that we used 500 MC samples for six qubit scenario and 1000 for eight qubit scenario.The difference is due to the size of the latent space, where we observed that a larger latent space requires more MC samples to estimate Kθ for the stability of the result, which we will discuss later in this section.
The network results have been tested with 10,000 background-only test samples.For the six-qubit scenario, we observed fidelity of 0.81 and a trace distance of 0.3, whereas, for eight qubit scenario, we observed 0.79 and 0.3, respectively.
Although von Neuman entropy, as shown in sec.I A, can lead to a significant observable to differentiate two types of samples, we propose a new observable based on the modular Hamiltonian.We will analyse two different cases; for the first, we will look into the effect of time evolution.We will define the time evolution operator of a modular Hamiltonian as where T = N ∆t.For small ∆t, this operator can be applied on the quantum circuit under the Trotter-Suzuki approximation.Using this relation, one can compute the fidelity of the time-evolved quantum state as where |ψ(0)⟩ = Û (ϕ)|p n ⟩ i d and |ψ(t)⟩ = T N |ψ(0)⟩.For the second case, since its computationally less costly, we will analyse the expectation value of the without time evolution.
We computed the time evolution up to T ≤ 500 for estimated Kθ in each scenario with ∆t = 0.1 time steps.The left panel of Fig. 5 shows the fidelity, eq. ( 8), concerning each time step for signal (blue) and background (red) samples where the six-qubit scenario is presented in the upper panel and the eight-qubit scenario in the lower panel.The thickness of each curve shows one standard deviation for the entire test sample 6 .Note that for the sake of visibility, plot referring to six qubit scenario has been limited to T ≤ 200 whilst the computation has been done for T ≤ 500.In order to devise a quantitative measure, we computed the power-frequency curve from the fast Fourier transform of the time evolution sequence (see eq. (B5)).For the mean time-evolution sequence, we present power-frequency distribution on the right panels of Fig. 5 for each respective time-evolution result.Although we haven't observed any significant difference in low-frequency regions, the power of both curves becomes significantly different for high-frequency regions.It is essential to note that the power-frequency curve becomes identical once the network is trained with mixed signal and background samples.Additionally, for the four qubit scenario, the differentiability has been observed to be significantly low.We compute the receiver operating characteristic (ROC) curve concerning the power distribution for a frequency threshold to quantify the ability to differentiate between two samples via the time evolution sequence.The true (false) positive rate, i.e. signal (background) efficiency, has been computed by counting the number of events in binned power distribution between its maximum and minimum values for a given frequency.The left panel of fig.6 shows the ROC curve and corresponding area under the curve (AUC) values for six (blue) and eight-qubit (red) scenarios.The dashed black line shows the random choice where the classification quality improves as the curves move further away from this line towards the upper left corner of the plot.The best minimum frequency value has been chosen for both distributions; hence we did not observe any improvement in the AUC value for larger frequencies.We observe that the eight-qubit scenario reaches saturation at a frequency of 0.056 with a 0.85 AUC value.In contrast, the six-qubit scenario requires a frequency of 0.2 to reach saturation at 0.82 AUC value.
For the second, less costly method, we compared the expectation value for signal and background without any time evolution step, T = 0.The right panel of Fig. 6 shows the ROC curve computed for 200 different thresholds chosen between maximum and minimum expectation values.We tested the results for a 10, 000-event signal and background test sample where, as before, the red and blue curves show the results for eight and six-qubit scenarios, respectively.Even at the initial time step, we observe that AUC values for both cases are above 0.9.
Utilising the time evolution sequence, we observe up to 3% difference between six and eight qubit scenarios, reducing the required frequency by 72%.Notice that we are barely able to achieve 50% using a four-qubit scenario with the largest frequency that we compute; thus, adding new information significantly affects the ability to differentiate two sequences.Using only the information from the expectation value provides significantly better differ-entiability, whereas, in the six (eight) qubit scenario, we observed 9% (6%) improvement in AUC values.
As mentioned before, the stability of the results relies on sufficient MC samples for Kθ estimation.Due to the probabilistic nature of EBM, the computation of Kθ leads to a slightly different modular Hamiltonian; hence the stability depends on increasing the number of samples; in other words, it depends on reaching a stable Gibbs state.For a lower number of MC samples, we observed a more significant standard deviation in each sample and lower differentiability between two sets of samples where the AUC value was significantly lower.
This exercise shows that the data from different sources can be interpreted as distinct quantum states; hence their corresponding Hamiltonian will produce different results when it acts on different states produced by these data samples.Since the Hamiltonian should be able to capture the entropic probability density of the given data, we investigate von Neumann entropy between each site at the ground state of the learned Hamiltonian.The reason for using von Neumann entropy is that it captures the information flow between reduced density matrices, and the change in the entropy value indicates statistically viable information for the optimisation process.This measure has also been utilised in ref. [41] to compress the feature space with an MPS ansatz.Von Neumann entropy has been computed by first finding the lowest eigenvector of the six-qubit learned Hamiltonian via direct diagonalisation.Furthermore, we constructed the reduced density matrix between two sites corresponding to each pixel.Fig. 7 shows the pixel density averaged over the test set for signal (blue) and background (red) bars captured by the left y-axis.This shows which pixels are statistically more active.The green hashed bar shows the relative entropy between two sites where the right y-axis has captured the value.The x-axis shows the location of each pixel on the circuit, and the green bars are placed in between each pixel location.We observe that the entropy values remain high between the low-density pixels.However, we observe exponentially low entropy values between pixels three and four, where pixel three has the highest density among all background pixels.It is essential to emphasise here that the learned Hamiltonian does not have any access to the input data, and it is constructed by generating a Gibbs state through an MC algorithm.Hence the only link between the data and the Hamiltonian is the optimisation algorithm which enables the Hamiltonian to capture the statistical distribution of the input.

II. DISCUSSION & CONCLUSION
Quantum Hamiltonian-Based Models are a group of ansatz that attempts to approximate the probability distribution of the data by representing it as a thermal state of a learned Hamiltonian.In this context, the computationally intensive Hamiltonian learning has been miti-gated to a classical network, and a variational quantum circuit has been optimised with respect to the expectation value of the learned Hamiltonian.This method is a generalisation over the Variational Quantum Thermaliser technique, where one generates the thermal state of a given Hamiltonian at a target temperature.However, using a specific Hamiltonian for an ML application will be highly constrained since it is not always possible to a priori know the correlation structure of a given data.Hence, it has been learned during the optimisation process by utilising a classical Energy-Based Model.This enables us to create a unique Hamiltonian for the data, which can then be used to scrutinise the properties of the data further.Thus this study demonstrates that the methods developed for quantum simulations are flexible and reusable for ML applications; hence shows the strong link between theoretical approaches and statistical ML techniques.This can lead to a more interpretable and intuitive ansatz by virtue of our knowledge of quantum theory, and this study aims to take a step further to achieve a fully theory-driven ML technique.
In this study, we demonstrate the usage of QHBM for generative learning and anomaly detection for LHC data.We showed that the calorimeter images could be embedded into quantum circuits as a mixed state, and a variational thermal state of a learned Hamiltonian can represent their probability distribution.As a by-product of the optimisation process, the objective function converges to the entropy of the data, which has been observed to produce unique values for different types of data.Hence, this information can be further used to identify the generated data samples.
It is essential to ask if it is possible to use the learned Hamiltonian to understand the data structure further.We have presented two possible use cases of the learned Hamiltonian for anomaly detection.For the first case, we analysed the expectation value of the time evolution sequence for the learned Hamiltonian.We showed that by converting the sequence to the frequency domain, one could observe significantly different curves for two types of samples by computing the power distribution for the fast Fourier-transformed sequence.Secondly, we showed that even the expectation value of the learned Hamiltonian is significantly different for different data types, which we quantify by analysing the difference at various thresholds.
Our findings signify a fundamental property of the quantum many-body Hamiltonian.Once learned, the given Hamiltonian represents the dynamical properties of a specific quantum state.Since signal and background samples form significantly different state representations, a Hamiltonian designed for one type of sample reacts differently to a different system; since these systems have distinct dynamical properties.Hence we show that it is possible to treat a given data sample as a quantum manybody system, and by using theory-driven optimisation techniques, one can learn this system's Hamiltonian to be used to understand its properties.Although we only show two possible use cases for generative modelling and anomaly detection, we hope that such approaches can be taken to devise more interpretable ML applications and build dedicated optimisation algorithms that can utilise the system's physical properties.
Although the usage of the quantum theory comes with significant advantages, it is essential to admit that this method comes with undeniable computational costs and limitations.The elephant in the room is the ability to execute these quantum circuits within a quantum device.Although we used a relatively small number of qubits, since generating the mixed state of each data point within the circuit requires many executions, we could not reproduce these results within a current quantum device.However, this can be improved by storing the inputed mixed states within a quantum memory device, which alleviates the need to regenerate such a computationally expensive process.As presented in the anomaly detection example, for this particular dataset, the geometrical position of the active pixels is crucial to characterise the dataset.Hence increasing the number of qubits will allow more information, and our experiments indicated that it would allow for the simulation of lower time steps for discrimination.Increasing the number of qubits also requires the implementation of extensive correlations between features.An MPS ansatz was suitable enough since our experiments were implemented with a few features.Still, we observed significant gains when more complex circuit architectures were implemented, which will be increasingly important with the implementation of larger feature spaces and makes the classical computation of the circuit increasingly challenging.
A further obstacle to the method is the completely free modular Hamiltonian which significantly affects the algorithm's scalability.With the increasing qubit size, the modular Hamiltonian grows exponentially via 2 Nq × 2 Nq , which makes it quite challenging to scale the algorithm for larger systems.As we discussed before, this can be avoided by imposing certain assumptions on the modular Hamiltonian to limit its shape.

FIG. 1 .
FIG. 1. Schematic representation of the Quantum Modular Hamiltonian-based learning for data analysis.Two parts of the implementation has been represented as two parallel layers stacked on top of each other in the representation where the top layer is responsible of forming a Hamiltonian by generating a Gibbs state through a MC algorithm based on an EBM.The bottom layer is responsible to compute expectation value of the Hamiltonian for a sampled set of initial states.Finally, expectation value and the partition function is combined to form the cost function of the network.

FIG. 2 .FIG. 3 .
FIG.2.Signal and background images projected on η ′ − ϕ ′ frame.From left to right, images represent signal and background represented with a mean of 5000 randomly chosen events and with a single event for each sample.For this representation, ten pixels have been cropped from each axis of the images from the original 37 × 37 pixels.Colour represents the magnitude of energy deposited in each pixel.

FIG. 4 .
FIG. 4.Test metrics presented for various networks where the left panel shows the network trained with a different number of samples for density matrix estimation and the right panel shows the network trained with a different number of MC samples for Kθ estimation.The top panels of both sides show the KL divergence between input and output samples, followed by KL divergence, trace distance and fidelity between the truth level density matrix and the network's density matrix estimation.The bottom panel shows the network's estimation for von Neumann entropy, where red and blue represent background and signal samples.The results on the left panel are prepared using 200 MC samples for the estimation of the Hamiltonian.Similarly, the right panel uses 5000 samples for σD estimation.Each result has been presented with one standard deviation, estimated by dividing 10,000 test samples into batches of 25 events.

FIG. 5 .FIG. 6 .
FIG. 5. Time evolution of the modular Hamiltonian for six qubits (top panel) and eight qubits (bottom panel) scenario.The left panel shows the fidelity distribution, and the right panel shows the power spectrum of the FFT of the distributions.The signal and the background are represented with blue and red colours in each panel.

FIG. 7 .
FIG. 7. Solid bars show the pixel density of each site over the entire test data where the value is bound to the left y-axis.Hashed bars show the relative entropy, computed from the lowest eigenvector of the learned Hamiltonian, between each site and the value, which is bound to the right y-axis.The x-axis shows the site or pixel location. ,