Score-based Diffusion Models for Generating Liquid Argon Time Projection Chamber Images

For the first time, we show high-fidelity generation of LArTPC-like data using a generative neural network. This demonstrates that methods developed for natural images do transfer to LArTPC-produced images, which, in contrast to natural images, are globally sparse but locally dense. We present the score-based diffusion method employed. We evaluate the fidelity of the generated images using several quality metrics, including modified measures used to evaluate natural images, comparisons between high-dimensional distributions, and comparisons relevant to LArTPC experiments


I. INTRODUCTION
The Liquid Argon Time Projection Chamber (LArTPC) [1,2] is a particle detector technology utilized in several current and future neutrino experiments [3][4][5][6][7].Their wide use in experimental neutrino physics derives from their ability to scale to sizes with length dimensions of tens of meters while still being able to resolve the three-dimensional location of charged particle trajectories to the several-millimeter scale.
The active portion of a LArTPC consists of a timeprojection chamber (TPC) wherein ionization electrons created by charged particles are drifted toward chargesensitive devices via an electric field created across the TPC volume.The charge-sensitive devices include planes of sense wires or readouts capable of recording the 2D locations of ionization [8,9].For physics analyses using LArTPCs, the production of Monte Carlo (MC) simulation is a critical challenge.One of the primary bottlenecks in creating simulated data is the time required to model the deposition and resulting readout electronics waveforms produced by a given pattern of ionization [10].For example, one data acquisition event for an approximately meters-cubed LArTPC requires simulating several millisecond waveforms for 5-10k electronics channels.For such an LArTPC operating on the surface, modeling the signals from the energy depositions of O(10) charged particle trajectories requires several minutes to produce.
Such MC generation challenges are similar to those faced by particle physics experiments utilizing the Large Hadron Collider (LHC).There are now several examples of deep neural network-based algorithms which can balance fidelity to Geant4-produced data and high throughput [11][12][13][14][15].In addition to serving as faster surrogates, there are also potential uses of generative models for the calculation of likelihoods of low-level data which opens the door to novel approaches to various kinds of inference problems.As a result, there has been significant activity in developing generative models for LHC experiments, with Generative Adversarial Networks (GANs) [16][17][18][19][20][21][22][23][24][25] and Normalizing Flows [26][27][28][29][30][31][32] having found a wide range of applications.However, as far as we are aware, there has not been as much progress on generative models for LArTPC data.In our experience, many of these approaches struggled to generate images from our LArTPC dataset.Our previous work [33] using another generative modeling approach, Vecter Quantized-Variational Auto Encoders or VQ-VAEs, did show some promise in beginning to produce images with trajectories that resembled those in LArTPCs.But in our judgment, these images still lacked sufficient image quality.In this paper, we describe results using a generative diffusion model.We find that it produces images of particle trajectories that are very similar to those from a Geant4-based simulation.As far as we are aware, this is the first demonstration of such quality for LArTPC data.
Currently, diffusion models (DMs) are attracting much attention due to their ability to generate novel images with high fidelity and large variance.These models are being used for a large variety of image types, including natural images [34], paintings [35], or drawings [36].When asked to generate novel images, they produce samples that match closely to the style and subjects of the images from the training set.They also produce samples that vary over the different semantic content included within the training set.For example, in high energy physics DMs have primarily been used in collider physics for jet unfolding [37][38][39], and calorimeter data generation [40][41][42].
In our work, we analyze the results of training a scorebased diffusion model [43], which is one of several implementation approaches.We find that this model can generate images which reproduce the features, both at high and low distance scales, found within LArTPC-like images containing particle trajectories.Such images are globally sparse but locally dense, making them very different from the natural images for which DMs were developed.Furthermore, the way we measure the quality of the images produced by DMs for LArTPCs will differ from those for natural images.Therefore, we also present methods of quantification we believe are relevant to LArTPC images.As a first proof-of-principle exploration, we only train and generate images, which are 64x64 pixels in size.In the context of current experiments, this would be considered small as LArTPCs produce images on the scale of 1000x1000 pixels.Our aim is to show that DMs can generate images of high enough quality such that DMs may one day serve as surrogate models for LArTPC detector simulations.However, this work does not yet address the desire to improve the speed of generating MC simulation.The challenges and potential directions toward that and other goals will be discussed in later sections.

II. SCORE-BASED GENERATIVE MODELS
The goal of generative models is to learn a probability distribution given samples (data) and to produce, i.e., generate new samples from this learned distribution.In most cases, especially when using deep networks, the model learned is an implicit one, in that an explicit characterization of the distribution is not sought, only a means to efficiently sample.For a survey of various methods, we refer the reader to [44,45].
In our case, the data are (portions of) images that come from a LArTPC detector.In this context, we note the following points.
• Compared to the popular image datasets, the LArTPC images are extremely sparse, and thus the latent dimensionality of the data manifold is potentially much smaller.Thus, the models we construct must be able to produce samples from this low-dimensional manifold where our data resides.
• Furthermore, the frequency at which different semantic content, i.e., different modes of the distribution, appears must mimic that in the reference dataset.For example, in the context of LArTPC images, one would want to sample images that depict low-angle scattering by a muon at a much lower rate than scattering examples that reverse the direction of the trajectory as that physical scattering rate is much lower.
• We might also expect that the generative model can reproduce the distribution of underlying particle momenta used to prepare the reference data set.
Motivated by the recent success of the Score-based Generative Models (SGM) in [43] that are part of a wider family of so-called diffusion models [45], in this paper, we employ them for learning a generative model for the LArTPC data and evaluate its performance.On top of the empirical success, several recent works such as [46][47][48] have shown that given an accurate score estimator under relatively mild conditions, SGM can provably learn to generate high-dimensional multi-modal distributions even when the data is supported on low-dimensional manifold, a scenario that has plagued past efforts in generative modeling which are usually prone to suffering from mode collapse.However, apart from the requirement to learn good score estimators in high dimensions, diffusion models come with additional computational complexity, most notably in generating samples from the learned model [49].Below we begin by providing a brief overview of the SGMs, primarily taken from the seminal paper [43].
Most generative models involve learning a transformation or a sequence of transformations, i.e., a process, between easy-to-sample distribution, e.g., the normal distribution and the data distribution.For diffusion models, one connects the data images to the images sampled from the normal distribution through a diffusion process where at each time step, an increasing amount of noise is added to the data images.The SGM learns to undo this diffusion process, i.e., reverse the forward noising process in a principled way.This reverse process can then be used to generate samples from the manifold of images implied by the reference data set.
To be precise, following [43] forward diffusion process of noising the images is captured by the following Stochastic Differential Equation (SDE) starting at time t = 0 with initial condition ⃗ X 0 ∼ data to time t = T : Here ⃗ X t ∈ R d is a vector-valued random variable containing the pixel values on the d pixels of the image at time t.f ( ⃗ X t , t) and g(t) are referred to as the drift and diffusion functions, respectively, of the SDE, and ⃗ W t is a standard Wiener process (aka Brownian motion) independent of ⃗ X t .The key idea behind SGM is based on results that under mild conditions the reverse process defined by ⃗ X s = ⃗ X T −t1 is also a diffusion process [50,51].provides a pictorial illustration of the forward (data-tonoise) and backward (noise-to-data) processes.The reverse SDE is given by the following equation for s ∈ [0, T ] (which, following [46], we write in more transparent forward way): The expectation with respect to ⃗ X 0 in equation 3 can be approximated with respect to the empirical distribution of ⃗ X 0 , which is precisely supported on the given dataset of images.For the inner expectation, for specific choices of f (x, t) and g(t), one can explicitly write down the closed form of log p( ⃗ X t | ⃗ X 0 ).The outermost expectation in equation 3 is with respect to a distribution over the time window [0, T ] and λ(t) is a positive weight function that can be used to normalize or modulate the relative importance of scores estimated at different times.Further details of the choice of the forward SDE, the choice of θ for the score estimator, as well as time discretization of the forward and reverse SDEs and weighting, is discussed in section IV.

III. DATA EMPLOYED
We construct 64x64 crops of particle trajectories from images found in the public PILArNet 2D dataset [53].This is a public dataset of LArTPC-like 2D images and 3D voxelized data.The data is made using Geant4 to simulate the transport of one or more particles within a volume of liquid argon.The location of ionization created by the particles is recorded within the volume.It is important to note that the 2D images are not made using a full detector simulation that models signals on a readout wire plane such as those found in current LArTPC experiments.We are not aware of such a model that is available outside of one of the existing LArTPC collaborations.Instead, images are made by simply projecting the 3D locations of ionizations deposited within a volume onto the XY, XZ, and YZ planes.However, a simple model of charge diffusion is applied as the ionization deposits are projected to the planes.This increases the longitudinal and transverse spatial distribution of the deposits.We use the XZ projections in our studies.
The particles transported to make the images are chosen at random from electrons, photons, muons, protons, and charged pions.The starting location of the particles are chosen at random uniformly in the volume.The momentum of the particles generated are sampled using a uniform distribution whose bounds are particle dependent.Further details on this can be found in Ref. [10].We do not make any cuts on the data set using particle types or momenta.
From the projected images, 64x64 crops are made.The locations of crops are chosen to ensure a minimum number of pixels above a certain threshold in order to avoid empty images.The pixels are then whitened.A common bias value is subtracted from all the pixels across the data set and then scaled to keep values between -1 and 1.Finally, the values were rescaled between 0 and 255 and converted to png files for convenience.We prepared a total of 50,000 images for the training dataset.An additional 10,000 images were reserved as a validation sample.We have made our datasets publicly available on Zenodo2 for reproducibility and to encourage comparisons with alternate methods.

IV. TRAINING THE MODEL
The choice of forward SDE: Within the SGM framework, one can make different choices for the form of the drift and diffusion functions.In this work, we used what is called a variance preserving SDE (VPSDE) in [43] whose SDE is The function β(t) is a time-varying function used to control the amount of noise added at time t.We again follow [43] and use a linear function in t that varies between configurable bounds, β min and β max : For our choice of values for β max and β min (and other parameters), please refer to Table III.The function, β(t), through its appearance in the drift term of Eq. 4, serves to modulate the rate of decay of the mean expected pixel values to zero.Given some total time interval, T , and sufficiently large enough β(t) values, this process transforms an initial sample of X 0 ∼ p data , drawn from the distribution of data images, into, X T ∼ N (0, σ 2 I).In other words, the pixels at time T are assumed to have become uncorrelated.Here, σ 2 is the variance of the normal distribution and is assumed to be common between all pixels in the images.
Training our model requires learning the score ∇ ⃗ Xt log p( ⃗ X t ) as a function of time.We use batch stochastic gradient descent to optimize s θ according to Eq. 3.For each training iteration, we sample a batch of N (unperturbed) training images, ⃗ X 0 .For each ⃗ X 0 in the batch, a random time is sampled from a uniform distribution over the interval [0, T ].Both ⃗ X 0 and the time, t, are used to sample a training example, ⃗ X t , through the following transition kernel: Note that the kernel's specific expression is dictated by the choice of a VPSDE as the diffusion process.By choosing a large β max , the kernel is driven more quickly towards N (0, I).
The batch of perturbed images, along with the sampled batch of times, t, are passed into the score function estimator, s θ ( ⃗ X t , t), implemented as a convolution neural network (CNN).We follow the work of Song et al. [43] for the CNN architecture and use their Noise Conditional Score Network (NCSN).We chose to keep many of the default configurations used in that work for modeling the CIFAR-10 dataset [54].The full list of parameters can be found in appendix A, table III with our modifications highlighted.
The implementation of the loss function we use to train s θ ( ⃗ X t , ⃗ t) follows directly from Eq. 3 and Eq. 6.Because of the marginal probability, p( ⃗ X t | ⃗ X 0 ) is Gaussian, the gradient of the log-marginal probability has a closed-form.Therefore, the empirical loss calculated for a training batch of images is where the sum is over a batch with N samples, indexed by i, in the training batch.Each item in the batch uses a different, t i , sampled from a uniform distribution over the interval t ∈ (0, T ].The values of ⃗ µ t and ⃗ σ 2 t are mean and variance in the Normal distribution defined in Eq. 6. In the training configuration we used, λ(t) from Eq. 3 is chosen to be simply 1.
We train the model for a designated number of steps, periodically saving checkpoints for later use in generating images.We will instead refer to training epochs from here onwards, where one epoch is defined as the number of training iterations to process the entire training dataset.Since our batch size, N , is 128 images and our training dataset is 50,000 images, one epoch is equivalent to 391 training iterations.We trained the model for a maximum of 300 epochs.
The losses for the training and validation sets versus epoch for the final training run are shown in Figure 2. The similar loss between the training and validation set as a function of epoch suggests that there is no significant over-fitting of s θ .As will be discussed in Sec.VII, we observe that the model does not improve appreciably past epoch 50 through several metrics for evaluating the quality of generated images.The time to train a model for 50 epochs took approximately 6 hours using two NVidia Titian 2080 RTX GPUs.

V. GENERATING IMAGES
After training the model, we were able to produce generated images from any of the pre-designated checkpoints.While this method can generate any number of images, we produced 50,000 to match the number of training images for easier comparison.Generating 50,000 images from any epoch took approximately 40 hours using two NVidia Titian 2080 RTX GPUs.This amounts to about 5.8 seconds per 64x64 image per GPU.
The generation process, generally, requires implementing a numerical approximation of the reverse diffusion SDE given in Eq. 2. We employed a version of the Euler-Maruyama in the code repository of Song et al. [43].Additional approaches were available within the repository.These approaches include alternate numerical solvers and the use of "corrector" functions, which aim to modify the sampled images at each iteration to be more like the training data.Exploration of more efficient and accurate SDE methods is still an active area of research for generative models.We leave studies for optimizing LArTPC image generation for future work.
Generating a sample image under our choice of VPSDE and numerical SDE predictor involves iterating over M constant time steps of ∆t = −(T −ϵ)/(M −1) from time T to ϵ = 1.0×10 −3 .A non-zero ϵ is used to avoid numerical instabilities.During each iteration, i, going in reverse order from i = M to i = 1, the sample image, ⃗ X ti , and predicted mean pixel values ⃗ µ ti at t i = T + (i − M )∆t are updated using where z ∼ N (0, I), and β(t) is again defined by Eq. 5.
The pixel values for the prior (noise) image, ⃗ X t M =T , are sampled from N (0, I).
We perform a minimal amount of post-processing on the generated images.By default, the images are normalized to have pixel values ranging from 0 to 255.We apply a threshold to each image pixel such that all pixel values below 5 are set to 0. This ensures a constant background necessary for the pixel-level sensitivity of our quality metrics, namely the SSNet labels.A similar threshold is applied to the wire plane waveforms from the LArTPC used in the MicroBooNE experiment [10].We believe this is an artifact of the diffusion process struggling to achieve the large homogeneous regions of zero values needed for these sparse LArTPC images.All post-processing is also applied to the training and validation datasets.
Per our loss curve in Figure 2, we can see that the network rapidly improves and then plateaus.We have selected epochs 50 and 150 for which we test the quality of generated images.After 50 epochs of training, our generated samples are of high fidelity according to our metrics in the results section.Extending training to 150 epochs and beyond produces a minimal improvement in the loss and is not worth the extra computational effort.As such, we make the case that training for 50 epochs is sufficient.Our various quality metrics will further support this choice of training time.

VI. EVALUATING QUALITY
We find qualitatively that the images generated by the network are very similar to those in the training set when compared by simple visual inspection.In order to quantify this a little further, we conducted a small sample "Turing test" where participants were asked to differentiate true LArTPC (training) images from our generated images by eye.We had 11 participants of varying expertise look at 100 randomly selected images generated from epoch 50.Only a single participant achieved a statistically significant 64% accuracy.Everyone else was within one standard deviation of random chance.All the accuracies and corresponding p-values can be found in Figure 4.This small-sample experiment suggests that the generated images are at least superficially very similar to the LArTPC images.
While such a qualitative study helps us confirm our belief that the generated model produces high-fidelity images, quantitative measures are desired.How to measure the quality of images is still an area of open research for natural images.In addition to how well the images match, we also want to know if the variation in the type of images are similar.To measure image quality, we have conducted several studies that aim to measure how alike the images are at both large and small scales.
For small scales, we are interested in whether the generated images exhibit the distribution of patterns of pixel values in a small patch.To measure this, we propose  (67) and shower (33) images.This corresponds to a binomial distribution where 1σ = ±5% from random chance (50%).P-values for the total accuracy are boxed above each entry.Only one participant was statistically significant, suggesting our generated images are a close match but still have room for improvement.Note that some participants did not complete the full test; their accuracies have been extrapolated, but their p-values (marked with an asterisk) are less significant.
to compare the output of a semantic segmentation network, referred to as SSNet [55,56], that has been used by LArTPC experiments as part of their reconstruction and analysis chain [57].The pattern of ionization left behind by charged particles can be classified broadly into two classes: track and shower."Tracks" are nearly straight lines resulting from more massive charged particles, such as protons, pions, and muons.Electrons and photons typically instigate an electromagnetic cascade, resulting in a pattern of deposited charge referred to as a "shower".
The semantic segmentation network we employ is trained to label individual pixels in an image as either having been created by a particle whose trajectories are tracklike or shower-like.In order to judge the fidelity of gener-ated images, we compare the distribution of shower and track scores along with the frequency of track-like pixels and shower pixels in the generated and real data set.
For larger scales, we use measures that come from the study of high-dimensional distributions along with reconstructed quantities inspired by physics that are extracted from the images.For high-dimensional probability distributions, we use the measures: Wasserstein-1 distance, Sinkhorn Divergence, and Maximal Mean Discrepancy (MMD).These measures attempt to quantify how different the images are at the distribution level.We were also able to approximate an FID distance metric (details in section VII C on the following page).For the physics-inspired metrics, we measure quantities that reflect the type of information experiments will want to extract about the particles that ultimately make the images.This includes a measure of the length and width of track-like trajectories and the energy of shower trajectories.We use SSNet and simple clustering techniques to identify track-like and shower-like trajectories.The entire image is then classified as either track or shower based on the majority of pixel labels.We employ a simple principle-component analysis of the spatial pixel distribution for tracks to measure the length and width.Ultimately, the most relevant measure for asking if the data produced by a generative network is sufficient is its impact on physics analyses.The data we worked with does not lend itself to such analyses and so we leave that to future work to address this.

VII. RESULTS
The method for determining the quality of LArTPClike images is still an open question.Here, we report on four proposed approaches to quantifying generated LArTPC image quality: SSNet pixel-labeling, high dimensional goodness of fit, modified FID score, and physics-based analyses.Background pixels where the pixel value falls below some threshold are shown in purple.The output of SSNet on background pixels is not used in our analysis.

A. SSNet Results
Our first external metric for gauging the quality of generated images uses our pre-trained semantic segmentation network (SSNet).For every image, SSNet labels each pixel as either background, track, or shower, as shown in Figure 5.We then compare the distribution of these pixel labels and their associated certainty against our validation dataset.Per our loss curve (and high dimensional comparison), we have selected two key epochs: 50 and 150 to analyze in figure 7. We compare these distributions against our validation dataset.The generated images produced using weights from epochs 50 and 150 both have nearly identical distributions to each other and the validation dataset.This is in agreement with our prediction that there is little, if any, improvement from training beyond 50 epochs.

B. Model Validation Via Comparing
High-dimensional Distributions Seeking to refine our model selection we propose to compare the pixel-space distributions, i.e., the distributions in the 64 2 dimension space, of the training, validation, and generated data via three high-dimensional goodness-of-fit (GoF) measures: Maximum Mean Discrepancy (MMD) [58], Sinkhorn Divergence (SD) [59,60], and the Wasserstein-1 distance [61].Wasserstein distances, also referred to as the optimal transport and earth mover's distance, are based on defining distances based on the minimal cost of coupling via a joint distribution with the given distributions as fixed marginals.In some cases, this coupling corresponds to a map, a special type of coupling, which in turn can be interpreted as a transport of mass reshaping one distribution into another.MMD distances belong to a family of distances called integral probability metrics, which are defined as a maximum absolute difference, i.e., discrepancy, between the expected value of a class of functions under the two distributions.Notably, Wasserstein-1 distance is also an integral probability metric.Sinkhorn divergence, while not strictly a distance, interpolates between Wasserstein and MMD distances and is numerically faster to compute as well as has faster rates of statistical estimation [60,62]. These

C. Via Inception Score and SSNet
A common metric for image generation is the Fréchet inception distance (FID) [63].This is calculated by getting the deepest layer activations of the classifier model for a dataset, fitting the activations to a multivariate Gaussian, and then finding the 2-Wasserstein distance between the training and generated datasets.Typically, the activations come from the deepest layer (pool3) of Google's Inception v3 [64].However, we do not have a well-defined classifier for LArTPC type images.Instead, we use SSNet and get the activations of one of the deepest convolution layers (double resnet [2]) as an 8,192 parameter vector and conduct the prescribed calculations.The architecture of SSNet features skip connections, and alternate layer choices quickly become computationally unwieldy.Despite this, we believe our SSNet-FID is analogous to traditional FID.This SSNet-FID metric, shown in Figure 8

D. Physics-based Comparisons
An important characteristic of LArTPC images is the fact that they encode underlying physical processes.As such, we have devised a few simple tests to verify that the generated images follow the expected realistic behavior of particle interactions.Perhaps the most relevant feature is the energy the particle deposits within the detector.We begin by separating the generated images as either track or shower according to the dominant SSNet pixel labeling.The different physical processes of track and showers require separate methods of analysis.
For track-like events, energy is deposited as it moves through the detector medium (liquid Argon), so we can simply find the length of the track to get an approximate measure of the energy.For this calculation, we used DBScan [65] (eps=3) to remove background noise and keep only the largest cluster of pixels.We then use a convex hull algorithm to determine the distance between the two farthest points on the track.The majority of track images contain a single particle track.However, many events have more than one track.This is from interaction vertices producing multiple track-producing particles or occasionally overlapping events; an example is shown in Figure 11.Note that the length of these multiple-track events is the longest straight line distance between two points, not a trace of the entire event.We apply this calculation across our key epoch generated images and LArTPC validation dataset to provide a consistent metric for comparison.We then bin these values into a histogram in Figure 9 displaying the fraction of images for each length.It is difficult to determine whether epoch 50 or 150 more closely matches the validation distribution.To aid our differentiation, we have calculated a total chi-squared comparing each generated histogram to the validation histogram, recording in Table I.We find that epoch 50 is a closer match.The histograms are not a perfect match, but we believe them to be reasonably close enough to justify the functionality of this method of generation.
We also measured the width of the track by finding the maximum distance along the secondary axis of a principle component analysis applied to the largest cluster of track pixels.We are applying the same DBScan clustering algorithm as with the track length analysis.We are interested in the track width because it correlates with several physical processes within the detector.First, in the context of our images, where track trajectories are made by muons, pions, and protons, the track width is sensitive to the frequency of secondary interactions that can occur with pions and protons, which scatter off nuclei in the detector.In theory, it is also sensitive to the shape and frequency of delta-ray production from muons.The delta rays manifest as shower features in the images and, in principle, should not be included in the track cluster pixels.However, in practice, the delta ray showers are often small enough to be tagged by the SSNet as track pixels and are then seen as track pixel bumps along the track.Second, the width of tracks as seen in LArTPCs reflects detector physics.This includes diffusion of the ionization cloud as it drifts to the charge-sensitive electronics.It is also sensitive to the details of how the ionization induces a signal onto the charge-sensitive devices such as wires or pixel-based collection electrodes.
As before, we have plotted our track width histogram in Figure 10.We see that the validation distribution is primarily thin widths (53% of tracks are ≤ 5 pixels) and then exponentially decays with a long tail.These thin widths come from straight-line single-track images.Wider tracks are caused by track curvature (from processes like multiple-Coulomb scattering) and tracks with secondary interactions.All of which can vary slightly if background noise causes bumps or wiggles in the track.Figure 11 shows an example track image from the validation dataset that has multiple tracks and some minor background interference.Looking back at our track width histogram, we can compare our key epochs to the validation distribution.
The width distributions for epochs 50 and 150 are a close fit to the validation dataset.It is interesting that the large-width tail of the distribution matches fairly well, suggesting that secondary interactions are reproduced fairly well.However, there are still noticeable differences in the first few bins.Disentangling the cause of this difference is reserved for future work where one would want an image generation apparatus capable of better manipulating the activation of physics processes such as delta-ray production, ionization discussion, or the physics of the readout.Using the chi-squared values in Table I, we find that for the track width distribution agreement, the time epoch 150 is a significantly closer fit to our validation dataset.This suggests that longer training might benefit the learning of the more subtle physics that influences this image quality measure.The energy of shower-like events is proportional to the amount of charge they deposit through scattering processes.Since our shower images are dominated by these charged particle scattering interactions, we find the energy by summing all the pixel intensities in the image.We have plotted a histogram showing the amount of charge deposited per image in Figure 12.
Epochs 50 and 150 both more closely resemble the validation dataset but are not perfect.Visually, it is difficult to determine which charge distribution is more accurate.We can use the mean as a comparison shorthand.FIG.12: The total charge deposited in shower event images from epochs 50, 150, and the validation dataset.
Another metric to observe is the makeup of our generated dataset broken down by track and shower images.We find that both our training and validation datasets contain 69% track and 31% shower images.The breakdown of our generated epochs can be inferred from the number of events displayed on the legends in our track length and shower charge plots.For convenience, we have calculated the percentages in Table II.Epochs 50 and 150 are within a few percent of the expected rate, but both slightly overproduce shower events.Images generated from epoch 50 are produced closer to the true ratio of track and shower images than our other key epochs.II: Breakdown of track and shower rates.Images are classified as "track-like" or "shower-like" based on the majority pixel label according to SSNet.The convergence of the track versus shower fraction in the generated image data set produced at the different epochs towards the validation set is a measure of the model's fidelity in reproducing the original data set.

E. Checking for Mode Collapse
A major concern for any method of image generation is mode collapse; the created images need to replicate the characteristics of the training data but still be distinct and unique.We want our generated images to sample from the entire space of valid LArTPC images according to the probability distribution p( ⃗ X 0 ) of our training data.An example in our context is that we would want the generator to produce tracks smoothly over a range of lengths and angles and not just generate lengths or angles closely around the neighborhood of examples in the training set.A useful heuristic is to find the nearest neighbors for a generated image, i.e., the images from the training data that are most similar to a generated image.For a single generated image, we calculate its Euclidian distance (L2 norm) from every image in the LArTPC training dataset.The (five) images with the smallest distances are defined to be the nearest neighbors.This measure of this distance describes a quantitative measure of uniqueness, and qualitatively we can tell that the generated image is unlike the training images.We can repeat this procedure to compare a single generated image against the set of all generated images to verify that we are reasonably sampling from the space of images.The results of this process can be seen in Figure 13.Several additional examples of nearest neighbors can be found in Appendix C, Figure 14.
The use of Euclidian distance when computing distance is standard for typical image datasets such as CIFAR-10 and CelebA.However, our dataset of particle trajectories is unlike these standard datasets due to their extreme sparsity and underlying semantic content.Recent research suggests that Earth (or Energy) Mover's Distance (EMD) is a more meaningful metric for colliderbased events [66].As such, we have also found the nearest neighbors using EMD instead of Euclidian distance.Examples of this can be found in Appendix C, Figure 15.
In both measures of nearest neighbors, we find minimum distances significantly greater than zero, implying that the generated image is unique.Visually, we can tell that the nearest neighbors are different images.This is particularly evident in shower-like events where the overall orientation of the shower and its density are similar, but the patterns within the shower envelope vary noticeably.For tracks, the trajectories are very similar, but we do see variation in the frequency and placement of high energy deposits and delta ray features.Interestingly, the measured distances between the training images and the generated images are similar.The distance between showers is also larger than with tracks, as one might expect, given the higher inherent variation in the shower process.One avenue of research is to understand this metric in more detail to both better understand how to quantify potential mode collapse and also for understanding particle trajectories in LArTPCs.In summary, we believe this qualitative analysis supports the conclusion that the generated images are not mere reproductions of the training images.Combining this evidence with the reproduction of various distributions, we conclude that the model does not exhibit significant mode collapse, but is instead generating examples with variations similar to that of the training sample.

F. Comparison to Another Model
As far as we know, the only past attempt at generative models for LArTPCs was our previous attempts using Vector-Quantized Variational Autoencoder (VQ-VAE).At the time, the generated images from this approach began to resemble the patterns seen in LArTPCs, a first compared past unpublished attempts with GANs and OT approaches.However, despite the progress, the generated images were visibly different than the training dataset.The study using VQ-VAEs was presented in [33] and introduced the use of SSNet outputs to quantify the similarity of generated images to the training sample.The SSNet measures aligned with observations of a model approaching LArTPC features but not yet closely reproducing them.In order to facilitate a comparison to the performance of our diffusion model, we reproduce the SSNet measures and measure the quality of the VQ-VAE images using the approaches presented above.The full results of this analysis can be found in Appendix E.

G. Associated Code and Dataset Releases
In order to support efforts to reproduce and compare against our efforts, we will prepare a repository of our code, the weights of our generative model, and copies of our training, validation, and generated data sets.We provide the weights for the final training run at 50 epochs.These materials will be stored on Zenodo3 .Scripts and code for training, generating, and calculating our evaluation metrics will be provided on GitHub4 .

VIII. DISCUSSION
How best to determine the quality of LArTPC-like images remains an open question -just as is the case for natural images.We have proposed several approaches which aim at different views of how to compare the images.This includes the direct comparison of the images using statistical techniques developed for comparing high-dimensional probability distributions, feature analysis through a domain-relevant CNN (SSNet), and through the physical quantities extracted from the images.We find that the VPSDE diffusion model implemented here produces generated images very similar to the validation set.These measures also largely improve over training time.
One limitation of our quality measures is that they still do not address what we ultimately want to know: would an analysis that attempted to make a physical statement be significantly biased by the use of data created by a generative diffusion model?Our ability to answer this question is limited by our choice of dataset, which used cropped images that did not make an attempt to capture a full particle trajectory.This reflects the goal of this work, which was to explore if diffusion models could mimic the collection of features found in LArTPC images.This study provides encouraging evidence to this narrower question, opening the path towards developing generative models that can potentially assist physics analyses in various ways, thus making the larger question now relevant.
Future directions for our work would aim at addressing the impact on physics analyses but will also need to explore the large space of choices in the implementation of diffusion models.One direction would be to implement diffusion models for various particle types conditioned on momentum.Implementation of these would make it possible to generate a collection of images that could be translated and combined to form images representing the final state particles from neutrino interactions.Particlelevel or interaction-level images could then be passed into existing reconstruction and selection algorithms to quantify the amount of bias introduced into those analyses.An additional technical challenge somewhat unique to LArTPCs is the need to correlate the generation of several objects.The most common form of LArTPC in operation uses three sets of wire planes in its readout system.To fully simulate this, three tomographic projections of a 3D particle trajectory must be generated.Also, LArTPCs capture not only the ionization left behind by particles but also scintillation light captured in optical sensors, making the problem multi-modal.
Another key direction for development is to improve generation time.This will be critical if one aims for these models to assist in MC production.Currently, a full event image from the MicroBooNE experiment used in physics analyses has a size of 1008x3456.If we assume a linear scaling in complexity from the 64x64 images in this work, a naive implementation of our current approach would be computationally infeasible, requiring over 68 thousand hours per 2080 RTX GPU to generate 50,000 large images or 81 minutes per image per GPU.This time is likely an upper bound, however, as one would imagine more sophisticated procedures such as generating smaller sub-images per particle trajectory and combining them to form a whole event image.Furthermore, we plan to explore methods used for natural images, such as latent space diffusion [34], which has increased the efficiency for generating higher-resolution images.There are possible speed improvements by finding a more optimal implementation of the reverse SDE process.For example, one can reduce overall generation time by taking larger time steps while integrating the reverse SDE.In this strategy, auxillary networks can help correct errors in the state that might reduce the quality of the images [43].Other work centers around the choice of drift and diffusion functions, f and g [67].Another area of active research is into the theoretical connections of diffusion models to optimal transport [68] and Schrodinger Bridge formalisms [69], which may one day guide the choice of drift and diffusion functions for specific goals of the model.Interestingly, recent work has explored the possibility of extending generative models past the use of SDEs describing diffusion and instead using SDEs that implement other physical processes [70].
In future work, we plan to expand this score-based diffusion model to larger 512x512 sized images and condition the generated image on particle type and momentum.This will entail a significant computational challenge, however, we believe the success in largescale natural image generation will be reproducible for our LArTPC data.These additions might aid existing physics analysis by providing an on-the-fly event generator.In the near term, there are potential applications for these models in tasks like generating additional background events or boosting the performance of current reconstruction algorithms by better conditioning low-level data by removing noise or filling detector gaps.We will continue to explore the applicability of these score-based DMs in the space of neutrino detector experiments.

IX. CONCLUSIONS
In this work, we show that a choice of implementation for a score-matching diffusion model is able to learn and generate novel images containing features that would be observed in data from LArTPCs.We have proposed metrics for future work to determine the model selection and to quantify how well the generated images match the original data set.According to these metrics, this is the first example of LArTPC experiments where images produced by a generative model are shown to be very similar to the training set.The slight but significant deviation in the generated images is seen in the distribution of narrow track widths which indicates further work to fully capture the physical processes that influence that aspect of the trajectories.The above focuses on the quality aspects of generated images and leaves developments in generation efficiency to future work.However, we believe our results are an important proof-of-principle demonstration of quality that establishes the groundwork to further pursue the use of score-matching diffusion models for LArTPC experiments.

Figure 1 FIG. 1 :
FIG. 1: Visualization of the diffusion processes used to train and sample a generative model for LArTPC-like images.The generative process takes noise images (middle columns of the figure) to images that contain particle trajectories similar to what might be seen in a LArTPC.This generative process can be seen as the reverse of the forward process defined by the stochastic differential equation (SDE) on the left, which transforms LArTPC images into noise images.Implementing the reverse process SDE (on the right) from noise to data requires estimating the score function -a quantity related to the underlying probability distribution of the LArTPC images.The score function is learned through training with data.For simplicity, we have defined a backward-time variable s = T − t.

FIG. 2 :
FIG. 2: Loss per epoch compared against training set and validation set.The loss function measures the mean-squared error between the predicted and true score functions for a training batch of images.The lighter curve in the background shows the loss for one training batch sampled every 100 iterations.The darker curves in the foreground result from smoothing through a running average over 5 epochs.

FIG. 3 :
FIG. 3: Samples of randomly selected training (left) and generated (right) images.See more samples in Appendix C.

FIG. 5 :
FIG. 5: Visualization of the output of the semantic segmentation network (SSNet) used to label track and shower pixels.The top row shows four LArTPC images from the training set.The bottom row shows the corresponding pixel labels as part of a track-like trajectory (yellow) or shower-like trajectory (cyan).Track-like trajectories are those produced by muons, charged pions, and protons.Shower-like trajectories are due to electromagnetic cascades initiated by either an electron or the interaction of a photon with the argon.Background pixels where the pixel value falls below some threshold are shown in purple.The output of SSNet on background pixels is not used in our analysis.

FIG. 6 :
FIG. 6: High dimensional goodness-of-fit tests comparing images generated from various epochs against the full training and validation datasets.Explanations of these measures can be found in the text.

FIG. 7 :
FIG. 7: SSNet pixel value histograms.(a) shows the number of shower pixels per image, and (b) is the corresponding certainties of these labels.Similarly, (c) is the number of track pixels per image, and (d) is the certainty of the track labels.Across all distributions, we are comparing our generated images at key epochs against our LArTPC validation dataset.We see that epochs 50 and 150 are nearly indistinguishable from each other and the LArTPC validation dataset.

FIG. 8 :FIG. 9 :
FIG. 8: Our SSNet equivalent of Fréchet inception distance (FID) compares generated images for each epoch against the training set (blue) and validation set (orange).As with previous metrics, we see a rapid improvement and then a plateau in performance.All epochs in the plateau are within a narrow range, with epoch 150 being the minimum.For the convenience of those who might wish to compare future work with our results, the values for each epoch are shown in Appendix B, Table IV.The minimum is at epoch 150, with all epochs after 50 being very close.

FIG. 10 :
FIG. 10: The distribution of track widths found in images from the validation dataset and images generated from epochs 50 and 150.Only perfectly vertical or horizontal tracks have a width of 1.The majority of the tracks are 1 -6 pixels wide.Tracks widths ≥ 28 have been collected into an overflow bin.

27 FIG. 11 :
FIG. 11: Example tracks from the validation dataset.The track is highlighted in blue, and the excluded background noise is in white.Intensities at a maximum for visualization purposes.Track (a) has some minor curvature.Track (b) is composed of three connected tracks and a few pixels of intersecting background noise.

FIG. 13 :
FIG. 13: Visual check for mode collapse.On the left-most column, the large images are an example of a shower-like (top) and track-like (bottom) trajectory generated by our model.For each large image, we use the Euclidean distance (L2 norm) to define the five nearest images in the training dataset (top small row) and generated dataset (bottom small row).The distances are labeled above each comparison image.We have excluded the image itself from the generated dataset.More examples can be found in Appendix C, including using EMD distance.

FIG. 14 :
FIG.14: Checking for mode collapse by finding the nearest neighbors using Euclidian distance (L2 norm).The five large images are generated events from 50 epochs of training.For each generated image we have found the five closest matches from the training dataset (top row) and generated dataset (bottom row).We have excluded itself (distance zero) in the generated image search.The number above each comparison image is its distance from the comparison image.There are an equal number of images (50,000) in the training and generated datasets.We see that images are distinct and have reasonably large distances, suggesting that we are generating unique images.
The shower charge distribution for the LArTPC validation dataset has a mean of 15134, whereas epoch 50 had a mean of 14861 and epoch 150 of 15577.This puts epoch 50 as the closer match, off by only 273 compared to epoch 150's difference of 716.The mean of a distribution is a simple metric, so we also used a chi-squared test like with the track distributions.Per TableI, epoch 50 has a slightly better chisquared value.Both these heuristics agree that epoch 50 produces at least marginally better shower images than our other key epochs.

TABLE I :
Total chi-squared values comparing binned histograms of our generated key epochs against the LArTPC validation set.

TABLE III :
List of all parameters used in our score model.Bolded values differ from the CIFAR-10 defaults.The number of training iterations (n iters) and frequency of checkpoint saving (snapshot freq) should be modified to suit the desired output.Running in training mode with the shown values (and our 50k LArTPC image dataset) will result in 150 epochs of training with checkpoints for generation saved every 50 epochs.Generating will then produce 50048 images from the first checkpoint.Our config files along with our code can be found on our GitHub.Training 121.63 18.96 10.2 9.15 8.94 8.99 8.95 8.75 8.79 Validation 121.38 18.31 10.2 9.21 9.09 8.87 8.94 8.71 8.84

TABLE IV :
SSNet-FID Values for all generated epochs.