Point Proposal Network for Reconstructing 3D Particle Endpoints with Sub-Pixel Precision in Liquid Argon Time Projection Chambers

Liquid Argon Time Projection Chambers (LArTPC) are particle imaging detectors recording 2D or 3D images of trajectories of charged particles. Identifying points of interest in these images, namely the initial and terminal points of track-like particle trajectories such as muons and protons, and the initial points of electromagnetic shower-like particle trajectories such as electrons and gamma rays, is a crucial step of identifying and analyzing these particles and impacts the inference of physics signals such as neutrino interaction. The Point Proposal Network is designed to discover these specific points of interest. The algorithm predicts with a sub-voxel precision their spatial location, and also determines the category of the identified points of interest. Using as a benchmark the PILArNet public LArTPC data sample in which the voxel resolution is 3mm/voxel, our algorithm successfully predicted 96.8% and 97.8% of 3D points within a distance of 3 and 10~voxels from the provided true point locations respectively. For the predicted 3D points within 3 voxels of the closest true point locations, the median distance is found to be 0.25 voxels, achieving the sub-voxel level precision. In addition, we report our analysis of the mistakes where our algorithm prediction differs from the provided true point positions by more than 10~voxels. Among 50 mistakes visually scanned, 25 were due to the definition of true position location, 15 were legitimate mistakes where a physicist cannot visually disagree with the algorithm's prediction, and 10 were genuine mistakes that we wish to improve in the future. Further, using these predicted points, we demonstrate a simple algorithm to cluster 3D voxels into individual track-like particle trajectories with a clustering efficiency, purity, and Adjusted Rand Index of 96%, 93%, and 91% respectively.


I. INTRODUCTION
Accelerator based neutrino oscillation experiments have successfully deployed deep convolutional neural networks (CNN) in their data analysis pipeline [1][2][3]. Many of the present and future experiments utilize a liquid argon time projection chamber (LArTPC), a class of particle imaging detectors which produces 2D or 3D images over many meters of detected charged particle trajectories, with a resolution of the order of mm/pixel. Examples of such experiments along with their respective active volumes include MicroBooNE (90 tons) [4], the Short Baseline Near Detector (SBND, 112 tons) [5], ICARUS (600 tons) [6] and the Deep Underground Neutrino Experiment (DUNE, 40,000 tons) [7].
The particle trajectories recorded in LArTPC images often appear as 1D lines in a 2D or 3D space. Their topological features can be diverse, ranging from straight linelike tracks to branching tree-like electromagnetic showers. In the process of analyzing an image from the pixellevel energy deposits to build a larger picture of particle trajectories with their respective kinematic properties, detecting points of interest such as the initial and terminal points of particle trajectories in the early stage of a data reconstruction chain is critical. For example, in * ldomine@stanford.edu clustering tasks on electromagnetic (EM) showers, the initial point can help to define a general direction for the whole shower that includes dozens to hundreds of EM secondaries. This is especially useful for separating neutral pions, a source of major background to ν e signal for neutrino oscillation analysis as well as an important sample for detector energy calibration, from cosmic rays and neutrino-nucleus interactions. Finding these points can also be a crucial step in determining a neutrino interaction vertex. If each particle trajectory is associated with these points of interests, the predicted points naturally include candidates for the neutrino interaction vertex.
Localizing an arbitrary number of such points in an image is analogous to a task called object detection in the field of Computer Vision. Many object detection algorithms based on CNNs have been proposed [8][9][10][11] including Faster Region Convolutional Neural Network (Faster R-CNN) which has been one of the most popular choices for object detection applications and also successfully applied in LArTPC image data [1]. Faster R-CNN consists of a feature extractor CNN and an attention mechanism called Region Proposal Network (RPN). The feature extractor consists of convolution layers and pooling layers, and generates a data tensor with low spatial resolution compared to the input. The RPN takes this data tensor and generates region proposals, typically rectangular shaped bounding boxes, that are likely to contain a target object in the original image resolution. The insight of RPN is to act on a spatially contracted data tensor which arXiv:2006.14745v3 [hep-ex] 10 Jul 2020 contains fewer pixels compared to the original input, thus addressing the challenge of long compute time. R-CNN is a family of algorithms that employ the RPN concept. One of the most recent of these is Mask R-CNN, [8] which is undeniably the most popular object instance detection algorithm to date.
Inspired by the concept behind RPN, we have designed a Point Proposal Network (PPN) to identify points of interest in a LArTPC image, namely the initial point of EM particles, referred to as shower-like particles in this paper, as well as the initial and terminal points, collectively referred to as endpoints, of track-like particles, which include all but shower-like particles. While RPN is responsible for predicting both the location and size of a bounding box for an object detection, PPN is simplified to propose only the location as the target is a point, not an object. Our goal is to integrate PPN into a generic, full 3D data reconstruction chain for LArTPCs, which consists of multi-task deep neural networks, such that the whole chain can be optimized end-to-end. Building on the previous effort, we use U-ResNet [12] as the feature extractor and implement PPN to predict the position and semantic type of an arbitrary number of points in an image with voxel-level precision.
The contributions of this paper are two-fold: • Introduce PPN for reconstructing the 3D endpoints of track-like particles and the initial point of shower-like particle with sub-voxel precision.
• Provide a performance benchmark on a public LArTPC simulation dataset (PILArNet) for future reference and comparison against other methods.
While, in this paper, our target is 3D LArTPC images, the concept of PPN is applicable to both 2D and 3D images [13]. Section II introduces the architecture of the UResNet network that we use as the backbone for PPN, as well as the details on PPN architecture, the loss definitions and post-processing methods. Section III outlines our experiments setup, including details on the public LArTPC data sample that we use. Section IV shows a first benchmark of the PPN performance on this sample. The study presented in this paper is fully reproducible using a Singularity [14] software container 1 , implementations available in the lartpc mlreco3d 2 repository and public simulation samples [15] made available by the DeepLearnPhysics collaboration.

II. NETWORK ARCHITECTURE
The network architecture consists of two parts: U-ResNet [12] and PPN. Both blocks include many CNN layers. In order to make our algorithm scalable to a largescale LArTPC detector analysis, we designed the whole chain using Sparse Submanifold Convolutional Network (SSCN) [16,17].
A. UResNet: feature extractor U-ResNet is designed for a voxel-level classification task, called semantic segmentation, for 3D LArTPC images [12]. The architecture of U-ResNet can be divided into two parts, namely encoder and decoder. The encoder consists of repeated blocks of convolution and strided convolution layers which down-samples the image resolution while increasing the features dimension, thus learning from key features in an image at different scales. We refer to the number of down-sampling operations as depth. The decoder takes the low-spatial size, highly compressed data tensor from the encoder and upsamples them back to the original image resolution. After each up-sampling operation, the data tensor of matching spatial size is taken from the encoder output and concatenated to the up-sampled tensor before the combined tensor is further processed by convolution layers in the decoder. The key concept behind the concatenation operation, introduced by the original U-Net [18] authors, is to recover the lost spatial resolution information in the encoder block due to strided convolution layers and effectively combine with the abstract features contained in the up-sampled tensor. Convolution layers in the decoder block are trained to best combine high spatial resolution information and abstract feature information. As a result, they learn how to best spatially interpolate abstract features extracted by the encoder back to the original image resolution. Figure 1 shows the architecture of U-ResNet. For the study carried out in this paper, we set the depth of UResNet to 6 and used 16 filters at the first convolution layer. The number of filters increases linearly with the depth, and is 96 at the deepest layer.

B. PPN layers
Within the U-shaped network architecture (see Figure 1), we implement PPN by introducing additional convolution layers at different spatial resolutions, starting with the most contracted data tensor at the lowest spatial resolution. While these PPN layers could be attached to either the encoder or the decoder of U-ResNet, it is more powerful to attach them to the decoder block as data tensors generated by the decoder should be more information rich. At the deepest level and coarsest spatial resolution, the so-called PPN1 produces a softmax score of a value between 0 and 1 for each voxel, which indicates whether or not the voxel contains the location coordinate of any of the true points, i.e. the target 3D points to be detected. We call this detection score in the following. We call the voxels positive if the detection  score is above the set threshold value. We call other voxels negative. Positive voxels yield an attention mask that we can use at the next step. At an intermediate level and medium spatial resolution, we up-sample the mask predicted by PPN1 and use it as an attention mask to pre-select candidate positive voxels. The so-called PPN2 layer then similarly predicts a subset of positive voxels among these pre-selected voxels in the attention mask at this spatial resolution. Finally, we up-sample the result of PPN2 to the original image resolution and use it as another attention mask. The final layer, so-called PPN3, is made of 3x3 convolutions which predict the following quantities for each voxel that has been selected through these successive attention masks: • a detection score (of being a voxel within some neighborhood of a ground truth point, for which we choose a distance threshold of 5 voxels), • a 3D position (offset with respect to the voxel cen-ter), • and a type score (for the point to belong to a semantic type).
We note that the 3D position prediction made from a particular voxel may be located in its neighbor voxel. This implies that multiple neighbor voxels of a target voxel, which contains one or more of true points, can all propose positions that are within the target voxel, which gives more information to identify 3D points and can improve the performance.

C. Loss definitions
Among all voxels x i we define true voxels A as voxels within a certain distance threshold d positive from the true FIG. 2. Simulated LArTPC event from our dataset. The left picture (network input) shows energy deposits from charged particle trajectories, whose color corresponds to an energy scale. In the right image (labels), each voxel is assigned one of five colors: heavily ionizing particles (HIP) in purple, minimum ionizing particles (MIP) in dark blue, electromagnetic showers in light blue, delta rays in green and Michel electrons in orange.
points q j , and all other voxels as negative: where N is the number of voxels in the input data tensor at a certain PPN layer. We define several losses. First, for all input voxels, we compute a cross-entropy loss for positive/negative classification task at each PPN-i layer and then average over all voxels. For i = 1, 2, 3, if y k ∈ {0; 1} indicates whether the voxel is positive or negative in the labels and p k is the predicted softmax score for this voxel to be positive, Secondly, only on true voxels, we define a linear distance loss on the predicted positions. We consider the distance to the closest ground truth point q. The raw predictions p of the network are actually shifts with respect to the center of the subject voxels (0.5 + x): Thirdly, only on true voxels, we compute a crossentropy loss for a point type prediction. The predicted point type is compared with the semantic type of the closest true point. If N c is the total number of semantic types for a point, y is a one-hot encoded vector indicating to which type the point belongs, and p c is the predicted probability that the point belongs to a semantic type c, Finally, the sum of all losses is minimized: The architecture that we proposed so far will yield a prediction of a position, detection score, and semantic type score for each voxel that has been selected in the last layer at the original image resolution. The number of such positive voxels whose predictions are considered, is related to the attention mask predicted by PPN2 and the spatial size ratio between PPN2 layer and the original image size. This will dictate for each voxel predicted as positive at PPN2 level how many voxels are selected at the last layer. Hence we might have many proposals whose positions are clustered near a true point, with different scores and type predictions. We need a strategy to combine overlapping proposals to deduce the candidate of distinct 3D points, and we want this strategy to value both accurate positions and type predictions. In this paper, we adopt the following simple post-processing scheme.
1. Thresholding on the detection scores, for example we require a score value of 0.5 or higher to be considered positive.
2. We then run the DBSCAN [19] clustering algorithm on the positive point positions. The hyperparameters of DBSCAN are set to =1.99 in voxel unit for the maximum Cartesian distance between two points to be clustered together, and min samples= 1. must be small enough to avoid merging together predicted endpoints of short tracks.
3. Pooling operation on the points that belong to the same cluster in order to deduce a single score, type predictions, and 3D position. We use average-pooling for the 3D coordinate locations, and maximum-pooling for the scores including the positiveness prediction and individual semantic type.
4. Finally, we enforce that a point detected by PPN as a type among c i (set of types, with type score > 0.5 for each type c i ) needs to be within 2 voxels of a voxel predicted by U-ResNet to have one of the c i types.

A. Dataset
We use 3D LArTPC particle images from the PILAr-Net dataset [15], an open dataset made available by the DeepLearnPhysics collaboration 3 . We use the largest 3D image in the dataset, a cubic volume with each side 768 voxels (453 million voxels) at 3 mm/voxel spatial resolution. Figure 2 shows an example image from this dataset. The dataset contains 80,000 images for the training set and 20,000 images for the test set where each image contains several particles traversing the LAr volume. The PILArNet provides five types for the voxel-level semantic category. These include heavily ionizing particles (HIP, e.g. protons), minimum ionizing particles (MIP, e.g. muons and pions), electromagnetic (EM) showers, delta rays, and Michel electrons from muon decays. Further, the dataset provides particle-level metadata including endpoints of HIP and MIP particles as well as the initial point of other particle types including EM showers, delta rays and Michel electrons. These 3D points are provided with a floating-point precision in the unit of voxels, and used as true points for training PPN. More details can be found in the PILArNet reference [15].

B. Training details
We drop the point labels for particles with less than 10 MeV in total energy deposit or a total voxel count of less than 7 voxels, which corresponds to a trajectory of a few voxels in length as a typical trajectory width is a few voxels or more. The PPN1 and PPN2 layers have a spatial size of 24 voxels and 96 voxels respectively. During training, we add true voxels to the attention masks generated by the PPN1 and PPN2 layers so that the subsequent layers, namely PPN2 and PPN3, can be trained with some mixtures of true and false voxels. This allows all PPN layers to train simultaneously from the beginning.
The batch size is 64 and we used an Adam optimizer with learning rate 0.001 to train the network. Training the U-ResNet alone first for 20k iterations, then U-ResNet+PPN for another 20k iterations, took 184 hours on a Nvidia V-100 GPU for the total of 32 epochs. The whole network (i.e. U-ResNet+PPN) can be trained from scratch without having to separate into two stages, in which case 40k iterations took 231 hours. Unless stipulated otherwise, the default configuration for the rest of this paper is the two-stage training. FIG. 5. Correlation between the distance from a predicted point to the closest true point, and the distance from a predicted point to the voxel center of the corresponding true point. We selected predicted points that are within 3 voxels of a true point. About 3.6 % of these predicted points are associated with true points that PILArNet locates exactly at the center of a voxel. They are not pictured here.

A. Position Precision
our algorithm successfully predicts 96.8 % and 97.8 % of 3D points within the voxel distance of 3 and 10 from the true points respectively. If we look at semantic type-wise results, we find that the fraction of true points which are more than 3 voxels away from any predicted point is 7 %, 2.1 %, 8.2 % and 1.6 % for the HIP, MIP, EM shower and Michel electron types respectively. For this analysis and Figure 4, we excluded delta ray type true and predicted points since the true point coordinates for delta rays provided in PILArNet are less precise than those of MIP, HIP, EM shower and Michel electron types. This is likely because the initial points of delta rays often overlap with a muon trajectory, which typically has a width of a few voxels or more. We considered the predicted points to be delta ray type if the point has the delta ray type score of 0.5 or higher.
For those predicted points within 3 voxels of the closest true point, the median distance between the positions of a predicted point and the closest true point is measured to be 0.25 voxels. If PPN is only sensitive to identify a true voxel in which a true point is present, and if it is not capable of regressing the position at the subpixel level, we expect this resolution to be 0.66, which is the median distance between two random point positions in a voxel. We note that 17 % of the true points provided by the PILArNet are located exactly at the center of a voxel, which is typically observed as a result of an endpoint approximation within the recorded volume when a particle is exiting or entering the volume. For other true points whose position is not fixed exactly at the voxel center, the correlation of the distance between a predicted point to the closest true point and the displacement of that predicted point from the true voxel center is shown in Figure 5. When PPN predicts a point within the correct true voxel, geometrically the distance from the voxel center to the predicted point must be between 0 and 0.866. The two distances show almost no correlation in this range, which shows that PPN position resolution is uniform and independent of a true point location within the true voxel. This demonstrates that our algorithm achieved a sub-voxel level precision for this reconstruction task. Purity for a class X is the fraction of predicted points with a type score > 0.5 for this class X, and within 5 voxels of a true point whose type matches one of the predicted point types (i.e. type score > 0.5). Efficiency for a class X is the fraction of true points of type X within 5 voxels of a predicted point with type score > 0.5 for this class X.

B. Type Prediction Accuracy
Before evaluating the point type prediction performance, it is useful to remind ourselves of the performance of the U-ResNet. In this particular training, the confusion matrix that we obtain is shown in Figure 6. We can then look at the distance from predicted points to true voxels of a certain semantic type. For example we expect predicted points with high Michel or Delta type score to be close to MIP voxels in the labels, and Figure  7 confirms this. Figure 8 shows that imposing the maximal distance threshold of 2 voxels between a predicted point with a high type score for a set of types and a voxel whose predicted type matches one of them is reasonable.
We define purity and efficiency metrics for the point type prediction as follows: for a given predicted point, we consider all semantic types for which it has a score > 0.5 and we refer to them as predicted types. We count a predicted type as matched if there exist a true point of the same type within 5 voxels. We note that one point may be associated with multiple types, and one point may contribute as many times as its associated type counts under this scheme. The fraction of matched predicted types is our purity metric. Similarly, for a given true label type, we say that it is matched if there is a predicted point within 5 voxels which has a score > 0.5 for the same semantic type. The fraction of matched true types is our efficiency metric. Under these definitions, we find a purity of 96.3 % and an efficiency of 89.2 %. The Table I breaks down these purity and efficiency metrics for each semantic class. The purity is significantly higher than the efficiency, which indicates that while currently predicted types are highly accurate, there is a space for PPN to improve to predict all possible types associated with a predicted point. This may be related to the architecture where PPN is coupled to U-ResNet which ultimately predicts a single type per voxel.

C. Mistakes analysis
About 2.2 % of predicted points, excluding points predicted as delta ray type, are more than 10 voxels away from any true point. Let us call them far mistakes. Among them, 25.8 % have a high (> 0.5) type score of be- ing HIP, 21.9 % for MIP and 53.8 % for shower. We have visually scanned event displays of these mistakes and report their nature in this section. In summary, we found that a large fraction of far mistakes were due to issues with true points or legitimate mistakes where authors cannot visually distinguish from correct predictions.

Fragmented EM Showers
An EM shower is initiated by an EM particle including an electron, a positron, or a gamma ray, and develops a cascade of them through radiations of gamma rays. In physics analysis, typically a whole cascade is conveniently treated as one EM shower instance instead of identifying dozens to hundreds of secondaries. This is shared in the PILArNet dataset where EM shower information, such as the initial point, is provided for the whole cascade. In LArTPC, however, given the average radiation length of 14 cm [20], which corresponds to 47 voxels, we expect that some radiated gamma rays in the cascade may be separated by significant gaps. This results in cases where a single EM shower may appear indistinguishable from two or more separate, overlapping EM showers.
In those cases, PPN may place multiple initial points within a single shower, as shown for example in Figure 9. While this may visually appear reasonable, they can be the cause of far mistakes as the PILArNet provides only one initial point for the whole shower. Among the far mistakes, more than half (53.2 %) have a high (> 0.5) type score for being EM shower and are within 2 voxels distance from voxels of true shower type in semantic segmentation labels.

Mistakes due to tracks (HIP/MIP)
49.4 % of the far mistakes have a HIP or MIP point type score > 0.5. We randomly sampled 20 cases with one far mistake with high HIP score. 15 of them (75 %) were due to very small HIP trajectories for which PPN made good predictions but true points were missing. This is caused by the fact that these trajectories fall below the threshold that we impose to define true points (> 7 voxels, > 10 MeV total energy deposit), leading to missing true points as shown in Figure 10. 2 out of the remaining 5 cases were found to be "legitimate mistakes" due to a kink in a trajectory as shown in Figure 11. The last 3 cases were genuine mistakes (e.g. a point predicted in the middle of a trajectory without any obvious kink).
On the other hand we also sampled 20 events with far mistakes with high MIP score. One case was due to a short trajectory missing true points, and one was a rare case where PPN made an extra, faulty prediction at the crossing point of two MIP trajectories that accidentally overlapped in the 3D space. The majority (12 cases) were legitimate mistakes due to a kink in a trajectory. The rest (6 cases) were genuinely bad mistakes.

Trajectories affected by the boundaries
10.1 % of far mistakes are within 5 voxels of an image boundary, indicating they may come from a particle trajectory crossing the image volume boundary. MIP trajectories are more likely to cross a volume boundary due to their length. Hence they are more affected by boundaries. Among the far mistakes that are more than 5 voxels away from the boundary, only 18 % have a high MIP type score. This fraction increases to 54 % in the region within 5 voxels from the boundary while negligible statistical change was observed for predicted points of other types.
We have visually scanned randomly selected 10 far mis- takes of a high MIP type score in this region next to the boundary. One of them was a legitimate mistake due to a kink in a trajectory, similar to the dominant case of MIP far mistakes found and described previously. The rest (9) of the MIP mistakes near the boundary were all due to issues related to true points. These issues include: exiting shower trajectory which gets classified as MIP by UResNet (Figure 12), and results in a too short trajectory and loss of true points as previously described for HIP cases, a MIP trajectory that exited and re-entered the image volume ( Figure 13), for which the true points provided in PILArNet appear unreasonable, and also what appears as a genuine mistake of true point location on the boundary provided by PILArNet dataset (Figure 14). We conclude therefore that the majority of far mistakes made by PPN are due to either issues related to true points or legitimate mistakes that visually appear reasonable.

D. Others
We also compared the PPN performance in two training scenarios: a single-stage training, where we start training from scratch both U-ResNet and PPN at the same time for 40k iterations, and a two-stage training where we train U-ResNet for 20k iterations first, before adding the PPN layer and continue training of U-ResNet+PPN for 20k more iterations. Everything else is  identical between the two schemes. The fraction of true points that are within 10 voxels of a predicted point is 98.2 % and 97.8 % respectively. The fraction of predicted points that are within 10 voxels of a true point is 97.8 % in both cases. The fraction of true points which are more than 3 voxels away from any predicted point is 5.2 % and 5.4 % respectively. There is no significant difference between the two training schemes, which confirms that the PPN learning is conditioned by the UResNet performance. Table II shows that PPN layers have a very little impact on the memory usage (about 1GB at train time, negligible at inference time). However they are responsible for about 30 % of the total computation time, if compared with the UResNet-only resources usage.

E. Track clustering
Lastly we report a simple application of U-ResNet and PPN for clustering voxels to identify individual track-like particles. This clustering task belongs to the next important step in the LArTPC data reconstruction pipeline. Using the output of UResNet and PPN, a very simple clustering algorithm can be designed: first, for each predicted semantic type, run a density-based clustering algorithm such as DBSCAN [19] on the voxels predicted to belong to track-like particles (i.e. HIP and MIP types). We use here the parameters of = 4 and min samples= 7 for DBSCAN. This will cluster together particle trajectories that are spatially adjacent, such as tracks coming out of the same interaction vertex. To mitigate this issue we can use the points predicted by PPN to "break" A radius of 10 voxels is used to make the gaps visible in this figure, but the clustering algorithms use a radius of 7 voxels. Bottom left: true track particle clusters. Bottom right: predicted track particle clusters. the predicted clusters: for each predicted cluster from the first step and associated closest predicted points, we mask a sphere of 7 voxels around each predicted point, run DBSCAN again to reconstruct the main trunk of individual track-like particles, and assign the remaining voxels in the masked regions to the closest track-like cluster to complete individual trajectories. Figure 15 illustrates this simple algorithm. 97.8 % of 3D points within the voxel distance of 3 and 10 from the true points respectively. For the predicted points and true points that reside within 3 voxels within each other, PPN achieves the sub-voxel level precision with a median distance of 0.25 voxels. PPN is also the first benchmark algorithm for PILArNet for reconstructing particle positions. Using the output of U-ResNet and PPN, we demonstrated a simple set of algorithms to cluster 3D voxels into individual track-like particles. We reported that our algorithms achieved a voxel clustering efficiency/purity/ARI of 0.96/0.93/0.91. U-ResNet and PPN are part of a scalable, deep-learning based data reconstruction chain for LArTPC detectors.

VI. ACKNOWLEDGEMENT
This work is supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics, and Early Career Research Program under Contract DE-AC02-76SF00515.