Deep neural networks for classifying complex features in diffraction images

Intense short-wavelength pulses from free-electron lasers and high-harmonic-generation sources enable diffractive imaging of individual nano-sized objects with a single x-ray laser shot. The enormous data sets with up to several million diffraction patterns represent a severe problem for data analysis, due to the high dimensionality of imaging data. Feature recognition and selection is a crucial step to reduce the dimensionality. Usually, custom-made algorithms are developed at a considerable effort to approximate the particular features connected to an individual specimen, but facing different experimental conditions, these approaches do not generalize well. On the other hand, deep neural networks are the principal instrument for today's revolution in automated image recognition, a development that has not been adapted to its full potential for data analysis in science. We recently published in Langbehn et al. (Phys. Rev. Lett. 121, 255301 (2018)) the first application of a deep neural network as a feature extractor for wide-angle diffraction images of helium nanodroplets. Here we present the setup, our modifications and the training process of the deep neural network for diffraction image classification and its systematic benchmarking. We find that deep neural networks significantly outperform previous attempts for sorting and classifying complex diffraction patterns and are a significant improvement for the much-needed assistance during post-processing of large amounts of experimental coherent diffraction imaging data.


INTRODUCTION
Coherent diffraction imaging (CDI) experiments of single particles in free flight have been proven to be a significant asset in the pursuit of understanding the structural composition of nano-scaled matter [2][3][4][5][6][7].While traditional microscopy methods are able to image fixated, substrate-grown or deposited individual particles [8][9][10][11][12], only CDI can combine high-resolution images with single particles in free flight in one experiment [13][14][15].CDI became possible due to the recent advent of short wavelength free-electron lasers (FELs) producing coherent high-intensity x-ray pulses with femtosecond duration with a single x-ray laser shot [16].However, CDI also comes with its own set of new challenges.
One of the growing problems of CDI experiments is the sheer amount of recorded data that has to be analyzed.The LINAC Coherent Light Source (LCLS), for instance, has a repetition rate of 120 Hz and a typical hit-ratio of 20 % [16,17].The newly opened European XFEL will have an even higher maximum repetition rate of 27 000 Hz [18], which may add up to several million diffraction patterns in a single 12-hour shift.The idea of using neural networks for classification of large amount of scattering patterns was born out of the significant difficulties of analyzing large data sets of clusters [19], in particularly metal clusters [20].Moreover, the ability to analyze such data sets is wanted by the community in general [21].For example, for the successful determination of 3D-structures from a CDI data set using the expansion-maximization-compression algorithm [21][22][23], it is necessary to sample the 3D Fourier space as densely as possible and this for all sub-species contained in the target under study.The achievable resolution, as well as the chance for successful convergence of the algorithm, correlates directly with the number of diffraction patterns with a high signal-to-noise ratio [22].Thus, huge data sets are taken and in consequence, it is getting increasingly complicated to obtain a high-quality data subset that is suitable for subsequent analysis steps.
The enormous success of neural networks in the regime of image processing and classification provides a unique way of facing the imminent data-analysis bottleneck and reduces the impending problem to a mere domain adaptation from datasets used throughout the industry to ones that are used in CDI research.This work aims to be a stepping-stone towards this adaptation by providing an introduction to the theory of deep neural networks and analyzing how to best transfer and optimize these algorithms to the domain of scattering images.As a new baseline, we train a widely used deep neural network architecture, a residual convolutional deep neural network [24], in a supervised manner with a training set of manually labeled data.We then adapt the neural network to the domain of diffraction images and improve on the baseline performance by addressing the following issues: 1. Modification of the architecture to account for the specificities of diffraction images and thus optimize the prediction capabilities.
2. Determination of the appropriate size of the training dataset in order to keep the manual work of a researcher to a moderate level.
3. Mitigation of experimental artifacts, in particular noisy diffraction images.
Experience has shown that a researcher is able to relate diffraction patterns produced by similarly shaped particles of different sizes and orientations in context with each other.However, a programmatic description for a classification and sorting of these mostly similar patterns is almost impossible to achieve.
Figure 1 makes a case for two diffraction patterns captured from almost identical particles but under different orientations.Both patterns clearly show an elongated and bent streak, but the bending is differently pronounced and directed.If we wanted to handcraft an algorithm that detects this feature, we would need to describe it via some appropriate metric that must take into account the various grades of inflection, direction, brightness, and completeness of this feature within every image.Furthermore, we would need to redo it for every characteristic feature in a diffraction image of which we want to find similar ones.
In addition to that, poor signal-to-noise ratios, straylight, a beam stop or central hole of multichannel plates or pnCCDs and overall poor image quality can even further increase the difficulty to make an automatized classification of all images coherent [25][26][27].
Therefore, we need a robust classification routine that is insusceptible to the described artifacts, just as a researcher is, to tackle the upcoming data volume.Deep neural networks provide a way out of this situation, and we show in this paper that they outperform the current state-of-the-art classification and sorting routines.
Current state-of-the-art automatic classification routines for diffraction experiments employ so-called kernel a) and b) are showing a tic-tac shaped particles whose orientation and size differs.The scattering images are calculated using a multi-slice Fourier transform (MSFT) algorithm that simulates a wide-angle x-ray scattering experiment which includes 3D information about the particle [7,20].Both incoming beams (indicated by the arrow on the left-hand side) produce very different scattering images, yet the dominant feature, an elongated bent streak, is distinctly visible in both calculations.A handcrafted algorithm is typically not able to identify the similarity between both scattering patterns and would classify these two images in two distinct classes, although they belong to the same tic-tac shape class.A deep neural network can learn these complicated similarities on its own when we provide a few manually selected diffraction patterns that contain this feature.
methods [26,28].Bobkov et al. [26] trained a supportvector-machine on a public small-angle x-ray scattering dataset with an Accuracy of 87 %, but only on selected images (we will use this approach as a reference in section 4).Yoon et al. [28] were able to achieve an Accuracy of up to 90 % using unsupervised spectral clustering on a non-public small-angle x-ray scattering dataset.
Deep neural networks, on the other hand, have already been applied to a broad range of physics-related problems ranging from predicting topological ground states [29], distinguish different topological phases of topological band insulators [30], enhancing the signal-to-noise at hadron colliders [31], differentiate between so-called known-physics background and new-physics signals at the Large Hadron Collider [32] and to help solve the Schrdinger equation [33,34].Their ability to classify images has also been utilized in cryo-electron microscopy [35], medical imaging [36] and even for hit-finding in serial x-ray crystallography [37].However, to our knowledge, this paper is the first application of deep neural networks for classifying complex features within diffraction patterns.We show that deep neural networks outperform the current state-of-the-art classification and sorting routines, while being insusceptible to typical artifact features of diffraction measurements.Furthermore a deeper analysis of the trained network shows that it can understand complex concepts of what constitutes a characteristic feature in a diffraction pattern.
The paper is organized as follows: In section 2, the data set is presented and a few experimental details are discussed.Section 3 provides the fundamental theory to understand the basics of neural networks; it has two subsections.Subsection 3.1 covers the theory, and algorithmic underpinnings of deep neural networks and how to train these models and subsection 3.2 presents three common metrics to evaluate the quality of the neural network's predictions.
Section 4 establishes our starting point, while the full benchmark report on the baseline neural network can be found in the Supplemental Material [REF] .We introduce the chosen network architecture and provide baseline results on the data presented in section 2 but also on a reference dataset for which classification results are already published [38].
In section 5, we discuss solutions for the above stated issues of applying neural networks to diffraction data.In subsection 5.1 we discuss the choice of the activation function for the neural network and present a novel logarithmic activation function that enhances the prediction performance with diffraction image data.Subsection 5.2 benchmarks the dependence of neural networks on training data size, asking essentially how much manually labeled data is needed for the neural network to give acceptable results and subsection 5.3 presents an approach to harden the neural network against very noisy data using a custom two-point cross-correlation map.
In section 6 we then provide more profound insights into the output of the neural network by showing and discussing calculated heatmaps that visualize the gradient flow within the neural network.These images directly correlate with what the neural network sees; they are created using an advanced visualization algorithm called GradCam++ [39].
Finally, we give a summary of the principal results and unique propositions of this paper and conclude with an outlook on further modifications as well as future directions.

THE DATA
Helium nanodroplets [1] were imaged using extreme ultraviolet (XUV) photon energies between 19 eV to 35 eV using the experimental setup of the LDM endstation at FERMI, Trieste [40][41][42].Scattering images were recorded with a multi-channel-plate (MCP) detector combined with a phosphor screen which was placed 65 mm downstream from the interaction region; this defines the maximum scattering angle of 30 • .Single shot diffraction images in the XUV regime are in some respect a special case, as they cover large scattering angles and can contain 3D structural information [20], manifesting in complex and pronounced characteristic features, such as the bent streaks in Figure 1.Out of 2 × 10 5 laser shots, about 38 000 images were obtained.The images were corrected for straylight background and the flat detector (see also Langbehn et al. [1]) For the neural network training dataset, we selected 7264 diffraction images uniformly out of all recorded patterns.The size of the subset was chosen to be the maximum a researcher could classify manually given one week time.From this subset we manually identified 11 distinct but non-exclusive classes (see Figure 2 for examples as well as a description and Table 1 for statistics about every class).We chose each of the diffraction patterns shown in Figure 2 for being a strong candidate for their class, but it is important to note that almost all diffraction patterns belong to multiple classes since this is a multi-class labeling scenario.These patterns are therefore not always clearly distinguishable from each other and can exhibit multiple characteristics from different classes.For example, the Newton rings in Figure 2d) are superimposed on a concentric ring pattern that falls into the category Spherical/Oblate, but Newton rings can also occur in other classes, e.g.streak patterns.Furthermore, labeling all images is itself prone to systematic errors because the researcher has to learn-to-label [43].This means that the labeling process itself is to some extent ill-posed, as the researcher does not know the characteristics of a feature a priori which results in a changing perception of features and classes along the labeling process and thus a systematically decreased consistency for every class.
We uploaded all available data alongside our assigned labels to the public CXI database (CXIDB, [44]) under the public domain CC0 waiver1 .).These classes describe general features within the image which are to some extent independent of the particle shape.We derived the superordinate classes from these general features.These complicated inter-class relationships demonstrate the capabilities of a researcher to interconnect mostly distinctive appearing features into a consistent description and ultimately leading to a valid physical interpretation.A hand-crafted algorithm could not account for these relationships normally, but now these interconnections can serve as an additional evaluation metric for the neural network.Since there is no diffraction pattern which belongs to the Spherical/Oblate and Prolate class simultaneously, we can check if the neural network mislabeled a diffraction pattern according to these rules.We can then interpret this as a reliable indicator for a failed generalization of the network.The physics behind these patterns are quite complicated as well, but for a rigorous interpretation and analysis of these patterns, please see Langbehn et al. [1].

What is a deep neural network
We concentrate in this paper solely on deep feedforward neural networks.They are a classification model consisting of a directed acyclic graph that defines a set of hierarchically structured non-linear functions.
A fundamental example can be constructed by arranging n non-linear functions (z 1 , z 2 , . . .z n ) in a chain-like manner: z output = z n (z n−1 (. . .(z 2 (z 1 (x))) . . .)), where x is the input, which is in our case a diffraction image.The first function, z 1 (x), is called the input layer.We then pass the output of z 1 to z 2 and so on; this goes on until the last layer (z n ) which is called the output layer.The nomenclature is that all layers except the output layer (z n ) and the input layer (z 1 ) are called hidden layers.
For illustrative purposes, Figure 3 shows a convolutional neural network.There, we schematically show the layer functions z 1 , . . .z n where every layer consists of two stages; A linear layer-specific operation on its inputs followed by a so-called activation function, which is always non-linear.We address the choice of layer-specific operations in section 3.1.1and then introduce the activation functions in section 3.1.2.In general, the layerspecific operation is always the name-giving component for the layer, so for example if we compute a 2D convolution as the layer-specific operation on the input and then apply an activation function, we call the set of these two stages a convolutional layer.Figure 3 shows a neural network whose first layers are convolutional layers followed by a fully connected layer that produces the predictions.

Affine transformations
All common choices for layer-specific operations are affine transformations.They all introduce trainable weights; free parameters that are adjustable during the training process and are sometimes called neurons due to the intuition that in a fully connected layer they share some similarity to the dendrites, soma, and axon of a biological neuron [45].These trainable weights are the name-giving components in a neural network.Now, the goal of training a neural network is to optimize all these weights for all layers, so, that the predictions for all images in the training data match their accompanying original labels.The original labels are called ground truth and define the upper limit of how good a network can fit a domain.No neural network is better than its training data.In this section, we briefly illustrate the affine transformations of the fully connected layer and the convolutional layer and then explain in the next section the role of the activation function.
a. Fully connected layer The name-giving operation for the fully connected layer is a matrix multiplication performed on a flattened input, for example, a m × n sized input image would be flattened into a m • n sized vector.Mathematically this is a matrix multiplication between a matrix and a vector: where x is the flattened input and w is the weight matrix of a fully connected layer.Here, all input vector elements (e.g., the pixels of an image, now arranged in one large row x k ) contribute to all output matrix elements and are therefore connected.Furthermore, by convention is x 0 defined as 1 and w 0j = b j , where b j is a free and trainable bias parameter.b.Convolutional layer In a convolutional layer, the trainable weights are parameters of a kernel that slides over the inputs, this is visualized in Figure 3.The general idea of a convolutional layer is to preserve the spatial correlations in the input image when going to a lower dimensional representation (the next layer).This is achieved by using a kernel with a spatial extent larger than 1 px.The kernel size is then also the extent to which one kernel can correlate different areas of an input and is called its local receptive field.Each kernel produces one output which is called a feature map or filter.Multiple feature maps from multiple kernels are grouped within one convolutional layer.For example, the first convolutional layer in Figure 3 produces 9 feature maps out of the input diffraction image and hence has 9 kernels that get optimized during training.Since we usually only have in the input layer a 2-dimensional diffraction image as input and a high number of feature maps for every subsequent convolutional layer as their inputs, we define the output of a convolutional layer with a 4-dimensional kernel k that produces i feature maps of size j × k: here the input x has l dimensions of size j × k and we slide a kernel of size m × n across all these l dimensions.
In the given example for the input layer, l is simply 1 and the summation is just across one input image, as shown in Figure 3.

Activation functions
Regardless of the affine transformation that is used, all layer-specific operations produce trainable weights which are passed through an activation function.This function is always non-linear.We only address 2 activation functions here as they are the most common used by the community and the only ones we use; The sigmoid and the LeakyRelu function.The first one is a logistic regression function used mostly at the outputs of neural networks, and the second one is a piecewise linear activation function used between layers for numerical reasons [46,47].The sigmoid function is given as: and the LeakyRelu function is given as: were in both functions x ∈ a are the trainable weights of the affine transformation (the convolutional or the fully connected layer operation, i.e., the output of Equation 1or 2) and γ is the slope for the negative part in the LeakyRelu function and is called leakage.
In Figure 3 the last activation function of the neural network, denoted by Logistic function, is a sigmoid function, because its output can be interpreted as a probability in a Bernoulli distribution, yielding a probability for how likely it is that a given event (an image in our case) is part of a class (in our case, the pre-defined classes from Table 1).Sigmoid functions always give an output between 1 and 0. In our case, we have 11 distinct classes which are mutually non-exclusive, which means every image has a probability of being part of every class.Using a sigmoid function at the end of the neural network yields therefore 11 distinct Bernoulli distributions.The generalization from the single-case Bernoulli distribution to its multi-case n-class distribution equivalent is called categorical distribution.
Interpreting the output of the neural network, as well as the original labels, as a categorical distribution is key to train the neural network because only then we can use statistical measures to evaluate the quality of the neural network's prediction, which allows us to optimize it iteratively.
However, due to the non-linearity of all activation functions, optimizing a neural network is a non-convex problem where no global extrema can be found with certainty.The general procedure is that of a forward pass and then a backward correction.Meaning, we feed the neural network several images, take the network's prediction and compare this prediction to the ground truth; This is the forward pass.Then we calculate a loss function which is a metric for how bad or good the predictions were, see the next section, and correct the weights of the network in a way that it would be better equipped to predict the labels for the images it just saw.This correction step is starting at the end of the network using an algorithm called backpropagation; hence the name backward correction, see section 3.1.4.

The forward pass: Assess the network's predictions
Optimizing a neural network always starts by feeding it multiple images and evaluate what the neural network made of it.For assessing the quality of the network's prediction a so-called loss function is used.It is the defining metric that we seek to minimize during the training of the neural network.In every training step, we compare the output of the neural network to the real labels provided by the researcher and calculate the so-called loss.Lower loss values correspond to a higher prediction quality of the neural net.
Therefore, the goal during the training process is to adjust all weights and biases within the network so, that the loss is minimal for all input training images.There are various possible loss functions which often serve a specific purpose.For classification tasks, such as the present case, primarily the cross-entropy is used [48][49][50][51].Crossentropy is a concept from information theory giving an estimate about the statistical distance between a true distribution p and an unnatural distribution q.In our case, p is the categorical distribution over the ground truth labels, and q is the output of the neural network.
Cross-entropy is calculated as the sum of the Shannon entropy [52] for the true distribution p and the Kullback-Leibler divergence [53] between p and q.The former is a measure of the total amount of information of p, and the latter is a typical distance measure between two probability distributions.
If the Kullback-Leibler divergence is zero, then the cross-entropy is just the Shannon entropy of p, and we have p = q.Then, the predictions of the neural network are not distinguishable from the labels of all training images.
Cross-entropy can be formally written as: where H(p) is the Shannon entropy of p, and D KL (p q) is the KullbackLeibler divergence of p and q [54].When using a sigmoid function as activation function on the output layer, the final loss function can be defined as: where M is the number of all images in the training data, x out i is the prediction for one image from the deep neural network and x i is the original label of the image, assigned by the researcher.Please see the Supplemental Material [REF] for a complete derivation.
Using Equation 6as it is, would require us to pass all images through the network for one training step, as the sum runs over all images.This is computational intractable.Therefore, we use a variant of Equation 6where the sum runs only over a stochastically chosen subset of size bs, called a batch.The size of that batch is called batch size and is an important hyperparameter that needs to be chosen prior to training, see section 3.1.5.One iteration step now involves only bs images from the dataset, and we define an epoch as the number of iteration steps it takes the network during the training to see all images one time.
To summarize, minimizing the cross-entropy is the goal during the training process in a neural network.The network learns to link the user-defined labels to the provided images.All that's left to understand the basic training process of a neural network is a way to adjust the weights in all layers.

The backward correction: Gradient descent and backpropagation
Minimizing the weights within the neural network so that they give minimal loss for all training images is done using two distinct algorithms; gradient descent and backpropagation.In principal, gradient descent works by evaluating the gradient at some point and then moving a certain step-size in the opposite direction; This is done iteratively until the gradient is smaller than some predefined threshold, which is the numerical equivalent of calculating the extrema of a function analytically.
The basic gradient descent step is given by: where η is the afore mentioned step-size, called learning rate, ∇ wτ is the gradient w.r.t. the weights at step τ and H (x out , x) is the loss function from Equation 6.With Equation 7we already could update the weights within the output layer of the neural network (z n (•)), since for the output layer we can calculate the numerical gradients.But we can't do this for the layers that come before the output layer since we're lacking a way to include these.In order to propagate the gradient descent correction throughout the network an algorithm called backpropagation is used [55]: First, we define the gradient of H(x out , x), w.r.t. the weights at the output of the deep neural network, using the chain rule: where N denotes the layer depth of the output layer, h N (•) is the used activation function in that layer and a N j are the outputs of the layer-specific operation, as in Equation 1 and 2. Starting from there we include the layer, preceding the output layer (z n−1 (z n (•))), by making use of the chain rule again: This can be iteratively repeated until the input layer (z 1 (•)) is included in the calculation.By making use of the chain rule until we reach the input layer we can include all trainable weights of all layers into the correction term of the gradient descent algorithm.With this, we conclude the full optimization routine in Table 2.
The iterative optimization routine for a deep feed-forward neural network.
1. Forward pass: Propagate bs images through the network.2. Evaluate the predictions: At the output layer calculate the loss between the ground truth and the output of the deep neural network (Equation 6). 3. Construct the backpropagation rule: Include all gradients w.r.t. the weights of all layer according to Equation 9. 4. Backward correction: Update all weights in the network using gradient descent, see Equation 7.

Training setup
Of significant importance is the way how the network is constructed; How deep should the network be and of what should it consist?For nomenclature, the combination of all used layers, the depth of the network and the used activation functions is called an architecture.
We benchmarked the performance of various architectural choices when used with diffraction images as input and provide the results in the Supplemental Material [REF] and not in the main paper, due to its rather technical character.In short, all architectures are established through extensive empirical research.So far, not only the leading A.I. research institutes, like the Massachusetts Institute of Technology (MIT) or the University of Toronto, but also large companies like Google, Facebook and Microsoft have invested significant amounts of resources to establish well working out-of-the-box solutions [48][49][50].
Building on this and after extensively benchmarking the most common architectures on our own, we settled on an architecture called pre-activated wide residual convolutional neural network in its 18-layer configuration, called ResNet18 [24,56,57].In essence, it is a convolutional neural network much like the example in Figure 3 but it employs so-called residual skip connections which increase Accuracy while decrease training time, see the Supplemental Material [REF] for further details as well as comparisons with other architectures.
After settling on an architecture, training a neural network requires fine-tuning of multiple free parameters.Four of them are critical: The learning rate η, the batch size bs and so-called regularization parameters of which we have two (which will be introduced at the end of this section).
We set the initial learning rate for the gradient descent algorithm to η = 0.1, see also Equation 7. Throughout the training we multiply η with 0.1 every 50 epochs, this increases the chance for the gradient descent algorithm to get numerically closer to a minimum in the loss function [56].Furthermore, we use a batch size of 48 for all training procedures, see also the explanations for Equation 6.
We split the manually classified part of the helium dataset into a training and an evaluation subset, where we shuffle the order of all images and then select 85 % for the training set while the rest serves as an evaluation set.
We rescale all diffraction images to 224 px × 224 px which is necessary to fit the deep neural net on two Nvidia 1080Ti GPUs, each having 11 GB memory.The image dimensions are chosen to be a compromise between file size and resolution.All features we are training the neural network on are still clearly visible and distinguishable after the rescaling.
Furthermore, we face the problem of having a comparatively small training set, consisting of only ≈ 6000 classified images, which could result in a phenomenon called over-fitting.Meaning the network memorizes the training set without learning to make any meaningful prediction from it.Therefore we employ two additional techniques called regularization and data augmentation: 1. Regularization means adding a so-called penalty term to the loss function.There are two regularizations we use, L1 and L2 [58].These penalty terms are dependent on the weights themselves and not on the labels, making the loss function explicitly dependent on the weights of the neural network.This dependency encourages the neural network to reduce the values of all weights according to the two penalty terms and ultimately find a sparser solution which in return helps to prevent over-fitting.Formally we add these two terms to the loss in Equation 6: where H(x out , x) is the cross-entropy loss function, ||w|| 1 and ||w|| 2 are the L1-and the L2-norm applied on the sum of all trainable weight parameters and α and β are so-called regularization coefficients.In our experiments we set α and β to 1 × 10 −5 during training.Using L1 and L2 regularization in combination is commonly referred to as elastic net regularization [58].

Data augmentation means creating artificial input
images by randomly applying image transformations on the original image like flipping the vertical or the horizontal axes and adjusting contrast or brightness values randomly.This greatly increases the robustness to over-fitting and is used as a standard procedure when facing small training datasets [59,60].
We were able to train deep neural networks with a depth of up to 101 layers without over-fitting using regularization and data augmentation, see the Supplemental Material [REF] .In all experiments reported here we choose a depth of 18 layers for the neural network, due to numerical, memory and time reasons.We trained all deep neural networks variants for 200 epochs.

Evaluating a deep neural network
We use three metrics to assess the quality of the predictions from the neural network, Accuracy, Precision, and Recall.We calculated these metrics every 2500 training iteration steps (≈ 52 epochs) using the evaluation dataset.Accuracy is formally defined as: where condition positives/negatives is the real number of positives/negatives in the data and true positives/negatives is the correct overlap of the prediction from the model and the condition positives/negatives.An Accuracy of 1 corresponds to a model that was able to predict all classes of all images correct.Therefore, Accuracy is a good measure for evaluating the prediction capabilities of a model when true positives and true negatives are of importance.Predicting negative labels correct is in the case of the helium dataset of particular interest because we want to estimate if the neural network was able to understand the complex inter-class relationships imposed by the researcher.The network should realize that if, for example, one prediction is Spherical/Oblate, it cannot simultaneously be Prolate.Therefore, the network has to produce a true negative for either one of these predictions.However, using only Accuracy as a metric has several downsides.The most important one is the decreased expressiveness of Accuracy when working in a multi-class scenario.In order to understand this, we first introduce Precision and Recall, and then provide an example: Precision, also-called positive predictive value, is a measure for how reasonable the estimates of the model were when it labeled a class positive, and Recall is a measure for how complete the model's positive estimates were.For example, if the model would predict all training images in the helium dataset to be Spherical/Oblate and nothing else (out of 7264 images, 6589 are indeed Spherical/Oblate) then Accuracy would be 0.767, which translates to 77 % of all labels correctly assigned.However, if the model estimated all images to be part of no class (setting every label to negative), then Accuracy would be 0.801, because out of 79 904 possible labels (11 independent classes for 7264 images), 64 339 are negative.Therefore, we would have a useless model that still was able to predict 80 % of all labels correct.
Using Precision in these both examples would give 0.907 for the Spherical/Oblate example and 0.000 for the all-negative example.Precision is, therefore, a metric that quantifies how well the positive predictions were assigned.Since 91 % of all images are indeed Spherical/Oblate, setting all labels positive in the Spherical/Oblate class can make sense, and Precision also provides insight when the model makes no positive prediction at all which would be a useless model for our purpose.However, Precision alone is not sufficient as a metric.At this point we dont know if our model predicted almost every possible positive label correct or if only a small fraction of all positive labels were assigned correctly, we, therefore, need an additional measure for the generalization capabilities of our model.For that reason, Precision is always used in combination with Recall.The Recall for our first example is 0.423 and for the second one 0.000.Recall relies on False Negatives instead of the False Positives, used by Precision , which provides a measure about the completeness of all positive predictions compared to all positive labels within our data.Recall states that our model only captured 42 % of all possible positive labels in the Spherical/Oblate example, showing that generalization of the model would not be sufficient for a real-world application.
Therefore, a balanced interpretation of these three metrics is necessary to estimate the quality of the models tested here.

BASELINE PERFORMANCE OF NEURAL NETWORKS WITH CDI DATA
In this chapter we briefly report on what we call baseline results.We used the previously described ResNet [24] neural network architecture in its basic configuration with a depth of 18 layers, termed vanilla configuration or ResNet18 (see chapter 3.1.5)and trained it with the helium diffraction data set as described in section 2 as well as with a reference data set from the literature [38].This reference data set was made freely available on the CXIDB by Kassemeyer et al. [38] 2 .It contains diffraction patterns of a number of prototypical diffraction imaging targets, namely the Paramecium bursarium Chlorella virus (PBCV-1), bacteriophage T4, magnetosomes and nanorice.For further experimental details see Kassemeyer et al. [38].
We selected this dataset because of a previous publication dealing with this dataset [26], that describes, to our knowledge, the current state-of-the-art method for classification and sorting of diffraction images [26].Bobkov et al. [26] trained a support-vector-machine on the CX-IDB dataset and inferred the particle type directly from the diffraction images.Overall, they achieved an Accuracy of up to 0.87, but only on selected high quality images with a high confidence score of the support-vectormachine above 0.75.
Table 3 shows the overall evaluation metrics as well as the training wall time.Train Time M ax is the time when the neural network achieved the highest Accuracy score on the evaluation dataset, and Train Time F ull is the time for training 200 epochs.In practice, we achieved optimal convergence after training for 70 to 100 epochs.
We achieved an Accuracy of 0.967 on not only a high quality subset of the CXIDB data, like in [26], but on all available data (see table 3), using a vanilla ResNet18 architecture, proving that using a neural network significantly outperforms the current state-of-the-art approach in [26].In the case of the helium dataset we face a much more complicated multi-class learning problem (one image can belong to multiple classes compared to one image belongs to exactly one class as it is in the CXIDB data).However, we reach a comparable Accuracy score of 0.955.Even more promising, Precision and Recall are very high for the helium and the CXIDB dataset, proving that the neural network not only predicted the true positives with high confidence and reliability (high Precision ), it did so for almost all true positive labels in the evaluation dataset (high Recall ).
In the next chapter we show how to further improve on the baseline performance of neural networks with diffraction images as input data.

ADAPTING NEURAL NETWORKS FOR CDI DATA
Here, we describe our contribution for using neural networks in combination with diffraction images.
First, we show in section 5.1 that the performance of a neural network can be enhanced when using a special activation function after the input layer.
Second, in section 5.2 we benchmark the performance of the neural network when using a smaller amount of training data.The idea is to provide an intuition about how much the prediction capabilities deteriorate when a smaller training dataset is used.This is useful because so far a researcher still has to invest a lot of time preparing the training dataset and, more general, minimizing the time spent looking through the raw data is the ultimate goal for using a neural network in the first place.
Third, in section 5.3 we propose a novel data augmentation in the form of a custom two-point cross correlation map that hardens the network against very noisy data.We show that when using this augmentation the network is more robust to noise from a uniform distribution added on top of the original diffraction image.This simulates the experimental scenario in which a very low signal-tonoise ratio is unavoidable, e.g., during CDI experiments with very limited photon flux [7] or very small scattering cross sections as it is the case with upcoming CDI experiments on single biomolecules [61,62].

The logarithmic activation function
One of the key additions of this paper is the proposed activation function, formally stated in Equation 11.It is designed to account for the inherent property of diffraction images of scaling exponentially.More general, the intensity distribution of scattered light on a flat detector follows two laws, depending on the scattering angle that is recorded.For very small angles (SAXS and USAXS experiments) the Guinier approximation is the dominant contribution to the recorded intensity, while for larger scattering angles (SAXS and WAXS experiments) Porod's law becomes dominant [63,64].Where the scattering intensity in the Guinier approximation is proportional to ≈ exp −q 2 , in Porod's law the intensity scales with ≈ q −d .q is the scattering vector that describes the scattering angle as well as the used wavelength and d is the so-called Porod coefficient, which can vary significantly depending on the object from which the light was scattered [63].
In any case, the recorded detector intensity for diffraction images scale exponentially.For this reason we propose a logarithmic activation function of the form: where α > 0 is a tunable scaling parameter, c 0 = exp (−1), c 1 = 1 and x is the input.We define c 0 and c 1 so that the activation function is anti-symmetric around 0, which helps speed up training and avoids a bias shift for succeeding layers [65,66].
Since we are using a gradient-based optimization technique we need to take care that the gradient can propagate throughout the whole network, otherwise it would lead to so-called gradient flow problems, which befalls deep architectures [46,67].There are two possibilities for insufficient gradient flow, either the gradients are getting too small (vanishing gradient) or too large (exploding gradient) when propagating throughout the network.Both scenarios lead to numerical instabilities during training making convergence for large architectures very hard or even impossible.The reason for this is the backpropagation algorithm which invokes the chain rule for calculating the gradients.Every gradient is therefore also a multiplicative factor for the gradient of a succeeding layer.For our case the derivative of Equation 11 w.r.t.x is given by: It shows that the gradient scales with x −1 with a discontinuity of size α c −1 0 at 0. If we used this activation function for all activations throughout the network, the gradient would have an increased probability to vanish -or explode -the deeper the architecture gets.In addition to that, the discontinuity at x = 0 could lead to gradient jumps, which would further decrease numerical stability.Therefore, we use the logarithmic activation function only for the first convolutional layer and use a LeakyRelu activation with leakage of 0.2 on all hidden layers.This compromise still captures the exponential scale of the diffraction images but without losing numerical stability.
Since α is a tunable hyperparameter, we conduct experiments with three values for α ∈ [0.2, 0.5, 1.0] and evaluate its impact on the performance of the neural network.
In Table 4 we provide the evaluation metrics for ResNet18 used with the logarithmic activation function, trained with three different values for α.For comparison, we also provide the results of the unmodified ResNet18 labeled unmodified.The best performing configuration is with an α value of 0.2, maxing out with an Accuracy of 0.965.Therefore, providing a boost in Accuracy of a full percentage point compared to the unmodified ResNet18.The lowest value for the maximum Accuracy was reached without the logarithmic activation function, topping at 0.955.Precision and Recall both increase with the addition of the logarithmic activation function.These improvements all come without increasing training time or complexity of the model.The maximum achieved Accuracy seems to be anti-correlated to α, with the ResNet18 α=1.0 variant performing worst.We suspect that this is related to the smaller size of the discontinuity of the derivative of h(a j ) when choosing a small value for α, see Equation 12.However, choosing even smaller values for α did not improve the Accuracy further, either because the benefit from the activation function plateaus there or because we reached the classification capacity of this ResNet layout.
These results show convincingly that the addition of the logarithmic activation function improves the overall performance and generalization of the deep neural network.This is in so far expected because we imposed a form of feature engineering on the network, by exploiting a known characteristic of the dataset.Therefore, without increasing the complexity, the depth or the training time, we showed that using the logarithmic activation improves all relevant evaluation metrics.For this reason, we use the logarithmic activation function with an α value of 0.2 as default for all following experiments.

Size of the training set
In this section, we evaluate the impact of the training set size on the evaluation metrics, we trained the ResNet18 α=0.2 with a varying amount of labeled images.The reason for this is to provide intuition for how many images are needed to be classified manually before the employment of a neural network is useful.We uniformly select images from the training set but kept the same evaluation dataset described in section 3.1.5.We decreased the size of the training set in three stages (25 % ≡ 1544 images, 50 % ≡ 3088 images, 75 % ≡ 4631).
Table 5 shows the performance of ResNet18 α=0.2 when trained with datasets of different sizes.For the helium dataset, the maximum achieved Accuracy is dropping from 0.965 to 0.797 when using only 1544 images instead of the full 6174 images.Even more pronounced is the decline in Precision and Recall from 0.922 and 0.870 to 0.673 and 0.593 for the smallest training set size.The steeper decline rate for Precision and Recall, compared to Accuracy, can be understood as the helium dataset predominantly consists of Negative ground truth labels (64 339 out of 79 904 labels) to which the neural net-works resorts in the absence of sufficient training data.Precision and Recall, on the other hand, provide only information about the positive prediction capabilities and their completeness and therefore decrease faster when a smaller training set size is used.
This shows that the number of images is critical for the prediction capabilities of the neural network.The drastic decrease in training set size results in a much worse generalization of the model, detecting only those images that are very close to the ones from the training set, missing most from the evaluation set.The network has not learned the characteristics of a particular class to a point where it can transfer the gained knowledge to other images, which is the one critical property for which we employed a neural network in the first place.
Therefore, if time is limited, one may be well advised to concentrate efforts on preparing a sufficiently large, high-quality training dataset while using e.g.our here presented neural network approach in its standard configuration.

Using two-point cross-correlation maps to be more robust to noise
This section introduces an image augmentation based on the two-point cross-correlation function, which increases the resistance to noise.We prepare four training sets, each with an increasing amount of noise sampled from a uniform distribution and analyze the noise dependence of the neural network.
One of the principal problems in CDI experiments, or imaging experiments in general, is recorded noise.Noise often leads to computational problems due to noise resistance being a known weak point for a significant fraction of predictive algorithms [27].In particular, deep neural networks are known to be easily fooled by noise.When adding noise to an image, whose addition may be invisible to the human eye, a neural network can come to entirely different conclusions and this even with high confidence; Seeing a panda where there was a wolf [68,69].Therefore, we propose an additional pre-processing step for the input images to increase the noise resistance of the neural network.
To quantify the quality of an image, the signal-to-noise ratio is often used.It is a measure for how much noise is present when compared to some information content, where low values indicate that information might be indistinguishable from noise.It has been shown that higher orders of the two-point cross-correlation function (CCF) can be used to increase the signal-to-noise ratio of a diffraction image [70].Furthermore, since the CCF can be interpreted as an image, see Figure 4 e) to h), we employ this method in a slightly tweaked manner for to optimize the use-case with a convolutional deep neural network.
In general, the CCF is defined as: where ∆ is the angular separation, φ is the angular coordinate, and (i, j) denotes the index of the two scattering vectors q i and q j .For discrete φ and written as Fourier decomposition, Equation 13 yields [70]: where n denotes the order of the CCF.I n i is given by: Since C i,j = C j,i , we can split the final correlation map into an upper and a lower triangle matrix.To maximize information, and to optimally use the local receptive fields of the convolutional layers, we merge the lower triangle from the full CCF calculation, Equation 13, and the upper triangle of order n = 8 from Equation 14.Therefore, we combine the plain correlation map with a higher order map that is more resistant to noise, see Figure 4 e) to h) for a full example.
To test the robustness of this method, we use the ResNet18 α=0.2 and train it with various pre-processed datasets.
From our original dataset we derive three additional datasets that only differ in the amount of noise added.We do this as follows; First, we calculate the mean, the standard deviation (std) and the maximum intensity values of each image in the original dataset.From these values we calculate the median, instead of the mean (due to increased robustness against outliers); ending up with three statistical characteristics describing the intensity distribution throughout all diffraction images.With that, we define three continuous uniform distributions to sample noise from.A continuous uniform distribution is fully defined by an upper and a lower boundary; a and b, respectively.The probability for a value to be drawn within these boundaries is equal and non-zero everywhere.For our three noise distributions we always use a lower boundary of 0 and vary the upper boundary so that b is either the mean, the mean + the std.or the maximum of the intensity distribution on the images (the three statistical characteristics described above).
For example, for creating the maximum noise dataset, we looped through every diffraction image and added noise sampled from the maximum noise distribution.We do this for all three noise distributions.From these three noise embedded datasets, as well as our original dataset, we calculate the here proposed CCF maps.This leads to a total of eight data sets; for each of them we train a ResNet18 α=0.2 .An example of one image in all eight datasets is in Figure 4.
The results for these eight data sets are given in Table 6.The performance of the neural network without added noise is much stronger when using the original diffraction images instead of the CCF maps.However, as soon as noise is added, the performance of the neural network trained on diffraction images deteriorates much faster as compared to the performance with CCF maps as input.When the upper boundary of the added noise excels the median values of mean + std., the neural network is performing better with the CCF maps instead of the original diffraction images.Especially with the noisiest dataset the differences in performance are significant.Precision is increased by 4 percentage points when using the CCF maps as input, showing that our data augmentation may serve as a helpful asset when dealing with very noisy data.
In general, it is a viable alternative to use the CCF maps as input to the convolutional deep neural network, which should be considered an option in the case of very noisy data where it provides a boost to classification results.The downside is, calculating the CCF for every image comes at an additional computational cost.It took us three full days to calculate the CCF maps for all 39 879 images of both datasets on an Intel 6700K quad-core machine using a multi-threaded Python script (Also released on Github).

WHAT THE NEURAL NETWORK SAW
Neural networks are often considered being a black box approach.We usually do not impose a-priori knowledge on our model, the network learns this on its own.Although this is part of the reason why they are so successful it also gives rise to doubts about the interpretability of their predictions.Some ways to interpret the processes of decision finding within a trained neural network have been presented in the literature [39,[71][72][73].In order to get a better understanding of why our deep neural network assigned images to certain classes, we calculated heatmaps using the GradCam++ algorithm [39].These heatmaps are making visible where the network has looked for in a particular class, which we do by tracing back the gradient flow from the output layer to the last convolutional layer.The network's class-specific interest directly correlates with this gradient signal because, in essence, we simulate a training step using backpropagation and interpolate the feature maps from the last convolutional layer.A full description of this process is given in the Supplemental Material [REF] .The out-  We chose these classes due to their distinct and distinguishable characteristic shapes which can easily be identified using the contour maps provided by the GradCam++ algorithm.For each class, we plot the schematic from Figure 2 also at the beginning of each row.GradCam++ contour levels are plotted as dashed lines and used as transparency value for the images from which we calculated them.This way regions with strong gradients are also brighter.
put of the GradCam++ algorithm provides contour maps whose amplitude is a normalized measure for how much the gradient would impose corrections on the weights if used during training.This gradient flow directly corresponds to what the network deemed the most relevant regions.
Figure 5 shows the GradCam++ results for the Streak and Bent classes using our best performing network -ResNet18 α=0.2 .We present results from these classes, because the distinct spatial characteristics are obvious to the human eye.Therefore they are an ideal candidate to test if the neural network understood these characteristics.In each row of Figure 2, a schematic sketch of the key feature together with five randomly selected images from this class are depicted.
The GradCam++ contour maps are overlaid on the image, in addition, the contour levels are also used as an α mask for the diffraction image so that the brightest areas in each plot correspond to the ones with the highest gradient flow.In the case of the Streak class, Figure 5 clearly shows that the neural network was able to identify the dominant streak feature regardless of its orientation or size.Results on the Bent class also show a strong correlation between the shape of the contour maps and the bent shape of the diffraction pattern.Therefore, combining these metrics and the Grad-Cam++ images we think that the Streak class feature identified by the neural network indeed corresponds to the one seen by the researcher.Also, the Bent class contour maps from the network show a clear resemblance of the feature intended by the researcher, albeit not so strongly pronounced.Although the deep neural network learned these representations on its own, they co-align with the intentions of the researcher.This demonstrates that neural networks are capable of learning these complicated patterns on their own.

SUMMARY AND OUTLOOK
In this paper, we give a general introduction on the capabilities of neural networks and provide results on the first domain adaption of neural networks for the use-case of diffraction images as input data.The main additions of this paper are (i) a novel activation function that incorporates the intrinsic logarithmic intensity scaling of diffraction images, (ii) an evaluation on the impact of different training set sizes on the performance of a trained network and (iii) the use of the point-wise cross-correlation function to improve the resistance against very noisy data.In addition, we provide a large benchmarking routine, utilizing multiple neural network architectures and layouts in the Supplemental Material [REF] .
We have shown that even in the most basic configuration, convolutional deep neural networks outperform previously established sorting algorithms by a significant margin.More importantly, we improved on these baseline results by modifying the activation function for the first layer.For the case of very noisy data, often a problem in diffraction imaging experiments, we showed that two-point cross-correlation maps as input data instead of the original diffraction images improve the robustness of the classification capabilities of the network.Our re-sults set the stage for using deep learning techniques as feature extractors from diffraction imaging datasets.The ultimate goal will be establishing an unsupervised routine that can categorize and extract essential pieces of information of a large set of diffraction images on its own.We envision for the near future, to utilize the gained insights to merge these advances into an unsupervised approach that connects the recent research from Generative Adversarial Network theory [74][75][76][77] and mutual information maximization [78] with the results of this paper.All of the written code, written in Python 3.6+ and using the Tensorflow framework, is available at Github, free to use under the MIT License 3 .We hope the community uses and improves the code provided in this repository.

ARCHITECTURAL DESIGN CHOICES
In this section, we describe and explain our choices for neural network architecture to establish as baseline performance when working with diffraction patterns, before the inclusion of our diffraction specific activation function, see section 5.1 in the main manuscript.We present the theory and background on available architectures and provide results on two architectures with five depth layouts.
There are different layer styles from which we can build a neural network.Nomenclature is that a full arrangement of all layers is called architecture, or configuration, of the network.
For our tests, we use two different neural network architectures, a ResNet and a VGG-Net, both with multiple depth layouts.For the ResNet, we train and evaluate three depth variations (18, 50 and 101 layers), and for the VGG-Net we train two variants (16 and 19 layers).
The structure of this section is as follows: First, we explain how a convolutional layer works in general.Second, we motivate the derivation of the VGG-Net from preceding architectures, and third, we show how the ResNet architecture can be explained by expanding the core ideas used in the VGG-Net.In the following section, we will then present the results for all the here trained configurations.
Almost every architectural design is empirically derived [1][2][3] and constitutes of multiple combinations of only a few basic layer styles, namely the fully connected layer, convolutional layer, a pooling operation and a batch normalization operation.We discuss the pooling and batch normalization layer only in section 3, because of their minor role within the neural network.The reader is also referred to the exhaustive overview in Schmidhuber [2] and LeCun et al. [3].Since the convolutional * julian.zimmermann@mbi-berlin.de layer serves as a fundamental basis for image analysis with neural networks, we explain it here in more detail.
The very basic idea of a convolutional layer is that nearby pixels in an input image are more strongly correlated than more distant pixels, this is called a local receptive field.Therefore, by calculating a convolution over an input image with a trainable filter of size > 1 × 1 we can approximate these correlations.
In a convolutional layer, N filter, with size M × M , slide over an input image and produce N convolved maps, called feature maps.One filter uses the same weights on all parts of the input image for producing one feature map; this is called weight sharing.Weight sharing reduces not only the complexity of the model but provides a bridge towards the convolution function in mathematics.With weight sharing, we can identify the filter within the convolutional layer as a kernel function from the mathematical convolution function.Figure 1 a) shows a schematic of a convolutional layer with one filter.
This exemplary filter with size 3 × 3 slides over an image of size 7 × 7 and producing a feature map of size 3 × 3. The feature map is smaller than the input image because the filter moves two pixels for each step.This step-size is called stride.
Hereafter we use the notation conv(a, b, c) for a convolutional layer with filter size a × a, number of filters b and stride c.The example from figure 1 a) could, therefore, be written as conv (3, 1, 2) and would result in 9 trainable weight parameters plus 1 bias parameter (not shown in the figure).
This concept was introduced with the LeNet architecture by Lecun et al. [4] which is considered the seminal work in the field and the first deep convolutional neural network.After Yann LeCun proposed the LeNet architecture, further research [5] led to the now de-facto standard for plain convolutional networks, the VGG-Net.Simonyan and Zisserman [6] proposed the original architecture which consists of up to 19 weight layers of which 16 are convolutional layers, and 3 are fully connected ones.
It is easy to build, easy to train and provides in general good results [2,3].For these reasons, we include two variations of it in our tests, namely version D and E, nomenclature is from [6]).Table 1 shows the details of the architecture, using the naming convention we introduced with the convolutional layer.
Simonyan and Zisserman [6] derived the VGG-Net directly from the LeNet by arguing that three convolutional layers with filter size 3 and stride 1 (VGG-Net) achieve better results than only one filter with size 7 and stride 2 (LeNet), which equals to the same effective local receptive field size [6].Three layers perform better than one due to having 2 additional non-linear activation functions and reduced complexity (less weight-parameter because of the smaller filter sizes), which enforces the neural network not only to be more discriminative but to find sparser solutions [6].
Building on the results achieved by the VGG-net, it was shown that the depth of a deep neural network directly relates to its classification capabilities [7][8][9].This led to the introduction of the so-called residual skipconnections which further exploit this depth-matters concept [8,9].These residual skip connections are the namegiving components for the ResNet architecture.
In principle, a ResNet still uses the VGG architectural layout but exchanges the convolutional blocks 1 to 4 with residual skip connections, compare tables 1 and 2. This exchange drastically reduces the complexity of the whole network while increasing the number of layers.
The VGG-architecture can be broken down into six blocks, one input block, one output block, and four convolutional blocks (see table 1).Block 2 is the first block in which there are distinctions between VGG variant D and E. The VGG-net architecture proved that increasing the depth and decreasing the amount and size of the filters increases the accuracy, which ultimately gave rise to the plain skip connections: Blocks of few convolutional layers designed to replace the large amounts of filters in one layer for multiple layers with fewer, and smaller, filters.Two types exist: A classical and a bottleneck skip connection, both differ only in the amount of how much the depth is increased and the complexity decreased.
This addition has so far only modified the depth and complexity of the network and is called a plain network, see He et al. [8].It performs reasonably well but not significantly better than VGG-net.A residual skip connection differs from a plain skip connection only in adding the identity of its inputs to its outputs.This way all the convolutional layers in a skip connection learn only a residual of their input.This simple technique enables a ResNet to outperform all other convolutional deep neural network architectures [1,10].Figure 1 b) exemplifies a classical residual skip connection.There is still an ongoing debate about why a residual neural network performs so well [8,9,11].Research has shown that ResNets find sparser solutions faster due to their layout, and that they behave like ensembles of shallower networks with information flow only activated on 10 to 34 layers even when the neural network has a depth of 101 layers [8,9,11].
However, besides empirical success, one of the critical advantages of ResNets is that reaching training convergence is not getting significantly harder when increasing the depth of the neural network, which is usually the case with other architectures.Therefore, the training of very deep residual neural networks is no more difficult than training shallow plain neural networks [1,12].
For these reasons, we train three variants, with 18, 50 and 101 layers, of a further optimized version of the classical ResNet, called pre-activated ResNet [10] (see table 2 for implementation details).
Table 3 shows the overall evaluation metrics on the helium and the CXIDB dataset.Table 4 shows the perclass evaluation metrics for the helium dataset, which are not needed for the CXIDB dataset because predictions on the helium dataset are a multi-class problem whereas predictions on the CXIDB data are single-class.Singleclass -or one-hot -problems have identical overall-and per-class-evaluation metrics.We trained all models as described in section 3.1.5 in the main manuscript.
Table 3 shows the overall evaluation metrics as well as the training wall time.Train Time M ax is the time when the neural network achieved the highest accuracy score on the evaluation dataset, and Train Time F ull is the time for training 200 epochs.However, in practice we achieved optimal convergence after training for 70 to 100 epochs.After this the network showed overfitting.
Both VGG models took significantly longer to train than the ResNet variants, needing between 6.5 h to 6.7 h, for 200 epochs on both datasets, whereas training ResNet101 took only 2.8 h and 2.6 h respectively.Furthermore, the maximum reached accuracy of both VGG networks is more than half a percentage point below the maximum of ResNet101 -0.959 compared to 0.964 for the helium data and 0.970 vs. 0.978 for the CXIDB data.Also, accuracy did not change much when increasing the depth from 16 to 19, precision even decreased slightly and recall remained unchanged.
On the other hand, increasing complexity within the ResNet architecture helped to boost the accuracy from 0.955 (CXIDB data: 0.973) with ResNet18 to 0.964 (CX-IDB data: 0.978) with ResNet101.
For the multi-class results in Table 4, we chose the ResNet101 layout, as it is our best performing configuration.For classes Oblate, Spherical, Streak and Empty a precision of 0.9247 to 0.9770 and a recall of at least 0.9763 show that the majority of all predictions in these classes were correct and virtually no image was missed.
For the classes Prolate, Bent, Double Rings and Layered, the ResNet reached a good precision, but a recall score of ≈ 0.65 shows that it missed almost a third of all available images, indicating we failed to generalize the network for these classes.
For Elliptical, Newton Rings and Asymmetric images, the recall of 0.2207 to 0.4836 shows that these images were a lot harder to find, observable in the relatively low precision scores for those classes.Elliptical is the only class of these three where precision is high enough for using the neural network as a predictor.For the Newton Rings and Asymmetric class, with precision scores around 0.6, the neural network is effectively guessing.
The performance of all variants clearly shows the general good classification capabilities of a convolutional deep neural networks in the use case of diffraction patterns.Even the lowest performing neural network can outperform previous classification approaches by a large margin -compare with [14].In particular, the results of ResNet18 are compelling; it is small, easy to train and has relatively low complexity.Although having only a fraction of trainable parameters, it performed almost always on-par with the much more complex VGG architectures and all this while taking only 0.2 h for reaching the maximum accuracy during training.
Therefore, we chose the ResNet18 layout as the default configuration for all the following experiments; it is an ideal compromise between complexity, training time and classification accuracy.

DERIVATION OF THE BINARY CROSS-ENTROPY
Here, we give an derivation for the binary cross entropy (Equation 6 in the main manuscript).We start with the most general form of the cross-entropy given by: where H(p) is the Shannon entropy of p, and D KL (p q) is the KullbackLeibler divergence of p and q [15].This is equivalent to: where p i and q i are two probability distributions over the same set of events.p i is the "correct" distribution, and q i is the approximation of p i from the deep neural network.Since we are using a Bernoulli distribution as our probabilistic model there are only two outcomes that one event (k) can have: k ∈ {0, 1}.The probability for both outcomes of one event and of both distributions can be written as: x is some event, y is the ground truth label and ŷ is the approximate probability assigned by the deep neural net-  where x is an event (e.g. the activation in the output layer of the deep neural network) and y is the real label of this event.

FURTHER BUILDING BLOCKS OF DEEP NEURAL NETWORKS
This section describes the pooling layer and the batch normalization layer in more detail.Since these components are not critical for the neural network their explanation is only here in supplemental material.

Pooling
There are two commonly used variants of pooling layers, the max pool, and the average pool.The idea is to reduce the dimensionality of the output from a preceding layer (dim x = (N × X)) by letting a filter, with size a × a slide over parts of the image with step size b, called stride, and let them perform a down-sample operation.
A max pool filter only takes the maximum value, and a avg pool filter averages over all values, within its perceptive field [4,5], this process is equivalent to a convolutional operation but instead of a matrix multiplication with a convolutional kernel the pooling operation is carried out.

Batch Normalization
Every layer within a deep neural network is to some point modeling the probability distribution given to it by its preceding layer.It is a hierarchical regression problem, which becomes harder if one layer changes key characteristics of the modeled probability distribution (e.g. the mean, variance or the kurtosis).This shift is then further multiplied in every succeeding layer and is therefore dependent on the depth of the network.This phenomenon is called a covariate shift [16].Although this problem is solved in a deep neural network via domain adaptation, the costs of a covariate shift are usually much longer training times and reduced accuracy [17].
For this reason a batch normalization layer (bn) is used to shift the mean of the mini-batch input to zero and to set the variance to one.This significantly reduces the amount of training time and increases accuracy [18].bn consists of 4 steps after which a normalized mini-batch is returned: 1. Calculated the mini-batch mean: x i 2. Calculated the mini-batch variance: where y i is the normalized output of input x i and γ and β are adjustable parameter.

GRADCAM++
In chapter 6 of the main manuscript we show what the neural network deemed the most relevant areas within an input image.We calculated these so-called heatmaps with an algorithm called GradCam++.The main idea is based on Cam [19] and Gradcam [20] and allows for a very intuitive explanation for the decisions made by a convolutional deep neural network [21].
The core principle is that the output of a convolutional deep neural network can be expressed as a linear combination of the globally average pooled feature maps of the last convolutional layer.
where A k ij is one feature map of all k maps from the last convolutional layer and w c k are the weights for a particular class prediction c of feature map k.Y c is the predicted probability that the input image belongs to this certain class c.In the GradCam++ formalism the weights can be calculated: where a kc ij are the gradient weights and LeakyReLu (•) is a rectified linear unit activation function, very similar to the one we used throughout the main manuscript.a kc ij depends only on A k ij and Y c via: The final heatmap, often called saliency map, can then be obtained: So the algorithm propagates an image forward through the network, then calculates the gradients until the last convolutional layer, and using equation 3 and 4, obtains a heatmap of the areas within the input image that shows the gradient flow from the convolutional layer.
FIG.1.a)and b) are showing a tic-tac shaped particles whose orientation and size differs.The scattering images are calculated using a multi-slice Fourier transform (MSFT) algorithm that simulates a wide-angle x-ray scattering experiment which includes 3D information about the particle[7, 20].Both incoming beams (indicated by the arrow on the left-hand side) produce very different scattering images, yet the dominant feature, an elongated bent streak, is distinctly visible in both calculations.A handcrafted algorithm is typically not able to identify the similarity between both scattering patterns and would classify these two images in two distinct classes, although they belong to the same tic-tac shape class.A deep neural network can learn these complicated similarities on its own when we provide a few manually selected diffraction patterns that contain this feature.

FIG. 2 .
FIG. 2. Characteristic examples for all the classes assigned to the 7264 images by a researcher, except for the Empty class.The top row of every class shows a representative diffraction pattern and the bottom row in b) -d) shows a stylized drawing of the characteristic feature of this class.The bottom row in a) shows an illustration of the name-giving particle shape for the Spherical/Oblate and Prolate class.The shapes are derived from the analysis of the data in Langbehn et al. [1], and they serve as a form of superordinate classes.They are mutually exclusive to each other, and all diffraction patterns are part of one of these two classes.Also, both superordinate classes have subclasses.For example, b) is showing the Spherical/Oblate subclasses Round, Elliptical and Double Rings.While a diffraction pattern can be part of the Round and the Double Rings class, it cannot be part of the Round and the Elliptical class.For the Prolate superordinate class, we find analog subclass rules, although there is no exclusivity rule as it was with the Round and Elliptical class.Therefore, an image belonging to Bent can also be in the Streaks class.Furthermore, all Spherical/Oblate and Prolate patterns can not only be part of their respective subclass but can also be part of one or more of the classes in the non-exclusive other subclass categories -shown in d).These classes describe general features within the image which are to some extent independent of the particle shape.We derived the superordinate classes from these general features.These complicated inter-class relationships demonstrate the capabilities of a researcher to interconnect mostly distinctive appearing features into a consistent description and ultimately leading to a valid physical interpretation.A hand-crafted algorithm could not account for these relationships normally, but now these interconnections can serve as an additional evaluation metric for the neural network.Since there is no diffraction pattern which belongs to the Spherical/Oblate and Prolate class simultaneously, we can check if the neural network mislabeled a diffraction pattern according to these rules.We can then interpret this as a reliable indicator for a failed generalization of the network.The physics behind these patterns are quite complicated as well, but for a rigorous interpretation and analysis of these patterns, please see Langbehn et al.[1].

FIG. 3 .
FIG. 3. Schematic visualization of a convolutional neural network; It shows the hierarchical structure of the network with the function hierarchy z1, . . .zn above each layer.Depicted as input is a diffraction image, which is getting expanded by 9 trainable convolutional kernel into 9 feature maps.Note, only 1 kernel, producing the last feature map, is shown.The output of the first layer is then passed through multiple convolutional layer, this is the feature extraction part of the neural network.Ultimately, a fully connected layer with a logistic function as activation function produces the predictions.Every layer consists of 2 stages, also indicated by the brackets underneath z1, . . .zn.The first stage is an affine transformation and the second one is a non-linear function, called an activation function.The operation that is used as affine transformation is then the name-giving component for the layer, e.g., a convolutional layer uses a convolution as affine transformation.The choice of the activation function is subject to empirical optimization with various choices possible.Section 3.1.1describes the affine transformations in more detail and section 3.1.2covers the basics on activation functions.

FIG. 4
FIG. 4. a) to d) showing the various stages of added noise to a standard scattering image.e) to h) are the calculated correlation maps with the upper triangle of order n = 8 and lower triangle from the full CCF calculation.

FIG. 5 .
FIG. 5. Showing the GradCam++ results for two distinct classes from the helium dataset.a) shows five randomly selected images from the Streak class and b) shows five images from the Bent class.We chose these classes due to their distinct and distinguishable characteristic shapes which can easily be identified using the contour maps provided by the GradCam++ algorithm.For each class, we plot the schematic from Figure2also at the beginning of each row.GradCam++ contour levels are plotted as dashed lines and used as transparency value for the images from which we calculated them.This way regions with strong gradients are also brighter.

FIG. 1 .
FIG.1.Schematic for a convolutional operation inside a convolutional layer in a), and for a classic skip connection found in the ResNet architecture in b).a) illustrates the local receptive fields and shared weights concept.The convolutional filter has size 3 × 3 and stride 2 and is sliding over the input image of size 7 × 7, which produces an output, called feature map, of size 3 × 3. The stride is the distance the filter is moving in each step which is implied by the gray shading every 2 pixels in the input image.Using a local receptive field describes the inclusion of nearby pixels, and weight sharing means using the same filter weights for the whole input image.The calculation at the bottom is for the second entry in the feature map.b) A classical skip connection is shown with two convolutional layers that approximate a sparse residue which gets added to the identity at the output.

4 .
Scale and shift according to adjustable parameter:

TABLE 1 .
Statistics of the helium nanodroplets dataset.Non-exclusive labels assigned by a researcher.One image can be in multiple classes.Total dataset size is 7264.Note that Spherical/Oblate as a class also contains Round patterns, only Prolate shapes are excluded from this class (see also caption of Figure2).

TABLE 3 .
Overall evaluation metrics for the ResNet18 architecture (vanilla configuration) and both datasets.The table gives the max values during training for Accuracy, Precision , and Recall.The training time after which the neural network achieved the highest Accuracy score on the evaluation dataset is labeled Train TimeMax and the time for training the full 200 epochs is labeled Train Time F ull .See also the Supplemental Material [REF] for further details.

TABLE 4 .
Evaluation metrics for the ResNet18 network with and without the logarithmic activation function.We benchmark three values for α.Results are shown for both datasets and are the maximum value recorded during training.Bold numbers indicate the best scores across their respective category.

TABLE 5 .
Evaluation metrics of the ResNet18α=0.2network with the logarithmic activation function and an α value of 0.2.Results are shown for the helium datasets and reflect the maximum achieved value reached throughout the training process, assessed on the evaluation dataset.Bold numbers indicate the best scores across their respective category.

TABLE 6 .
Evaluation results when training a ResNet18α=0.2 on the original diffraction images and on CCF maps calculated from them.The results reflect the maximum value achieved throughout the training process, assessed on the evaluation dataset.Bold numbers indicate the best scores across their respective category.Input data CCF Maps Diff.Imgs.CCF Maps Diff.Imgs.CCF Maps Diff.Imgs.CCF Maps Diff.Imgs.

TABLE 1 .
The deep neural network architecture of the VGG variant D and E. conv(a, b, c) is a convolutional layer with filter size a × a, number of filters b and stride c. max pooling(d, e) is a max pooling layer with filter size d × d and stride e.Note that we changed the fully connected layer of the original architecture to a convolutional layer.

TABLE 2 .
Used ResNet variants, see also 18, 50 and 101 layer layout in[8].Note that we added the pre-activated layer layout from[13].conv(a, b, c) is a convolutional layer with filter size a × a, number of filters b and stride c. max pooling(d, e) is a max pooling layer with filter size d × d and stride e. avg pooling is a global average pooling layer, and fc(f ) is a fully connected layer with output size f.Layers in bold emphasis have a stride of 2 during their first iteration, therefore reducing the dimension by a factor of 2.

TABLE 3 .
Overall evaluation metrics for all architectures and both datasets.The training time after which the neural network scored the highest accuracy score on the evaluation dataset is labeled Train TimeMax and Train Time F ull is the time for training the full 200 epochs.The table gives the max values during training for accuracy, precision, and recall.Bold scores are the best results in their respective category..9580.964 0.958 0.959 Precision 0.918 0.917 0.925 0.923 0.920 Recall 0.866 0.864 0.878 0.867 0.867 Train TimeMax [h] 0.231 0.605 0.940 3.271 6.615 Train Time F ull [h] 0.694 1.814 2.820 6.541 6.726 Dataset CXIDB Accuracy 0.967 0.973 0.978 0.970 0.970 Precision 0.932 0.937 0.949 0.944 0.943 Recall 0.933 0.937 0.941 0.904 0.904 Train TimeMax [h] 0.278 1.205 1.093 4.374 4.480 Train Time F ull [h] 0.668 1.807 2.623 6.562 6.720

TABLE 4 .
Per-class accuracy, precision and recall values for the best performing ResNet configuration with 101 layers.Samples are the number of images whose ground truth label is positive in the evaluation dataset.Results are shown for both datasets and reflect the maximum achieved value reached throughout the training process, assessed on the evaluation dataset.