Identification of Anomalous Diffusion Sources by Unsupervised Learning

Fractional Brownian motion (fBm) is a ubiquitous diffusion process in which the memory effects of the stochastic transport result in the mean squared particle displacement following a power law, $\langle {\Delta r}^2 \rangle \sim t^{\alpha}$, where the diffusion exponent $\alpha$ characterizes whether the transport is subdiffusive, ($\alpha<1$), diffusive ($\alpha = 1$), or superdiffusive, ($\alpha>1$). Due to the abundance of fBm processes in nature, significant efforts have been devoted to the identification and characterization of fBm sources in various phenomena. In practice, the identification of the fBm sources often relies on solving a complex and ill-posed inverse problem based on limited observed data. In the general case, the detected signals are formed by an unknown number of release sources, located at different locations and with different strengths, that act simultaneously. This means that the observed data is composed of mixtures of releases from an unknown number of sources, which makes the traditional inverse modeling approaches unreliable. Here, we report an unsupervised learning method, based on Nonnegative Matrix Factorization, that enables the identification of the unknown number of release sources as well the anomalous diffusion characteristics based on limited observed data and the general form of the corresponding fBm Green's function. We show that our method performs accurately for different types of sources and configurations with a predetermined number of sources with specific characteristics and introduced noise.


Introduction
Anomalous diffusion has been observed in numerous systems, and a variety of underlying mechanisms have been discussed [7,28]. Anomalous diffusion is associated with the nonlinear dependence of the mean square displacement on time, ∆r(t) ∼ t α with α = 1. Different mechanisms can lead to the same asymptotic dependence of the mean square displacement on time, or alternatively, to the same diffusion exponent. However, such processes can differ in their propagator, probability distribution function [7,28], aging [4], and ergodic properties [5]. Fractional Brownian motion (fBm) is a common model for anomalous diffusion which stems from long range correlations, stationarity and scaling of the increments [27]. The fBm has applications in many fields including finance, climate, solar activity, hydrology, turbulence and many others [6]. The fBm diffusion exponent, α, determines the diffusion regime. When α < 1, the process is subdiffusive; when α > 1, the process is superdiffusive; and when α = 1, we have the normal Brownian diffusion [7]. Examples of subdiffusion are the kinetics of passive molecular tracers in lipid bilayers [20], hopping transport in disordered systems [25], propagation of nonlinear waves in quasiperiodic potentials [22], evolution of index prices in financial systems [36], and many others. Superdiffusion is typically associated with active processes, and has been observed in living cells [9], in radiative transport [33], in intracellular particle motion [34], in transport processes in porous media [31,30,14], and in many other cases.
Due to the non-Markovian nature of fBm, much effort has been devoted to the development of inference methods for the diffusion characteristics from observed data. Most of this research has focused on parameter inference using inverse modeling [13,10,39,40,17,19,32,18,43,8]. Inverse modeling typically requires identification of the characteristics of the sources as well as of the medium. For fractional diffusion, there is the additional complexity of the propagator describing the process, which has focused the inverse modeling efforts on one-dimensional problems [24,10,39,17,19,18,43], while less attention has been devoted to two-and three-dimensional problems [42,32]. Furthermore, most research in this field assumes that the measurements exist at high spatial and temporal resolutions, a situation that is rarely encountered. Usually, the identification of release sources relies on solving a complex ill-posed inverse model against limited amounts of observed data. Importantly, in most of the naturally occurring fBm phenomena, the number of the release sources (and their locations and strengths) is unknown, which severely limits the reliability of the classical inverse modeling.
Recently, a hybrid method based on Matrix Factorization (NMF) combined with Brownian diffusion Green's function, called hNMF, was proposed for identifying the properties of unknown number of emission sources releasing simultaneously [38].
Here, we report a generalization of hNMF that enables identification of fBm diffusion processes (sub, super or normal diffusion) based on limited observed data and fBm Green's functions. To demonstrate the performance of our generalized method, we generate examples that include anisotropic two-dimensional fBm with drift and a predetermined number of release sources. We show that our method determines accurately the unknown number, locations, and properties of the release sources used to generate the data (including point sources as well as spatially and temporally extended sources). We also demonstrate that the generalized hNMF correctly determines the generalized diffusion coefficients, the advection velocity, and the diffusion exponent, α, and that it accurately estimates the spatial and temporal extension of the emission in the presence of noise.

Problem Formulation
We focus on the problem of identifying the anomalous diffusion process characteristics based on a limited number of concentration time series recorded by spatially fixed detectors. This is fundamentally an inverse problem, but unlike many previous works, our method is based on the recording of the concentration at specific locations rather than on single particle tracking. The general problem is complex due to the large number of parameters and the need to identify several different characteristics. The fundamental step is to be able to identify the unknown number of sources and their locations and strengths. This is not a simple task due to the fact that the concentrations, recorded by each detector, potentially contain contributions from all sources. Further, each source may have a finite extent in space and may have a temporally varying amplitude of emission. In addition to the number of sources,Ñ s , their locations, x s , y s , and strengths, q s (where s = 1, 2, ...,Ñ s ) the medium characteristics must also be determined. The medium, in which the fBm process takes place, has the following characteristics: 1) the type of diffusion, i.e., super, sub or normal diffusion, which is estimated by the value of the diffusion exponent, α (see subsection 3.1); 2) the generalized diffusion coefficients, D x and D y , which for the general anisotropic case, can be different in the two (x and y) directions; and 3) the drift velocity, u, which for transport in porous media, affects the diffusion.
To illustrate the complexity of the problem, we present in Fig. 1 temporal snapshots of concentration profiles for subdiffusive (α = 0.2), normal diffusive (α = 1), and superdiffusive (α = 1.8) transport (to generate the synthetic data, we used parameters that are typical of the transport and dispersion of contaminants in an aquifer [38]). In each case, emission originates instantaneously at t 0 = −5 years from three point sources, S 1 , S 2 , and S 3 , marked by red diamonds. In all three cases, the flow is subjected to a drift velocity u = (u x , 0), where u x = 0.05 km/year in the x-direction. In the subdiffusive case, the contribution from each source remains relatively distinct throughout the illustrated time period of 20 years. However, for the normal and superdiffusive cases, the contributions from the three sources increasingly mix and become indistinguishable. In reality, complete spatio-temporal concentration profiles as shown in Fig. 1 are unavailable. Concentration information is only available at a limited number of locations, where the detectors are positioned, shown as black points in Fig. 1. Examples of concentration times series from a single detector located at (x d , y d ) = (0.4 km, −0.2 km) (marked by a red circle in the upper right panel of Fig. 1) are given in Fig. 2 for the three different types of diffusion discussed in Fig. 1. Again, as a result of the slow diffusion in the subdiffusive case, α = 0.2, the signal arrives at the detector in the form of a narrow and concentrated peak. However, for normal diffusion, α = 1, and super diffusion, α = 1.8, the signal is less concentrated and spread over a much longer time period as a result of the faster dispersion processes and the mixing of the source contributions. The extension of hNMF, we report here, correctly estimates the unknown number of point sources and evaluates the sources' locations and strengths, as well as the medium properties (diffusion exponent, anisotropic generalized diffusion coefficients, and drift), from a limited number of time series similar to those shown in Fig. 2. More complicated scenarios, such as temporally varying emission rates and spatially extended sources, were also considered.  3 Methods

Fractional Brownian motion Green's function
There are various mechanisms that lead to anomalous diffusion [7,28]. The simplest mechanism is fractional Brownian motion (fBm). fBm relies on a memory effect that can be introduced into the Langevin equation describing the dynamics of a particle [26]. The mechanism was suggested to be relevant to many processes [6]. Despite the memory effects, fBm is a Gaussian process, in which, unlike classical Brownian motion, the increments of the fBm are not independent. The 2D Fokker-Planck equation describing Galilei invariant fBm is: where C is the concentration, α is the diffusion exponent (0 < α < 2), D x , D y are the generalized diffusion coefficients in the x and y directions, respectively, u x is the drift velocity in the x direction (the coordinate system is chosen such that u y ≡ 0), and Q( r, t) is the source function.
The Green's function (propagator) of the above Fokker-Planck equation, with a physical boundary condition requesting the concentration to vanish at an infinite distance from the source, is given by (see [29] for the 1D propagator): , the concentration is given by the sum of the contributions of the different sources (the contribution of each point source is give by Eq. (2) multiplied by the source strength, q s ): For release sources whose emission is finite in duration, the source function, Q T ( r, t), takes the form, where Θ(t) = 0, if t < 0, and Θ(t) = 1, if t ≥ 0, while t on s , t f s are the beginning and end times of emission by the s'th source, respectively. In this case, for t > t f s > t on s for s ∈ [1,Ñ s ], the concentration is given by: Here the limits of integration in Eq. (5) arise from the finite emission duration of the sources.
For spatially extended rectangular sources, the source function, Q S ( r, t), takes the form, Here, x l s , x r s are the left/right boundaries of the s'th source and similarly for the y coordinate to define rectangular sources. t s is time at which the s'th source emitted. The concentration in this case is given by:

The Hybrid Nonnegative Matrix Factorization (hNMF) method
The Hybrid Nonnegative Matrix Factorization (hNMF) method reported in Ref. [38] combines: (i) Green's function of the Fokker-Planck equation (FPe) that describes normal diffusion transport with (ii) a nonlinear iterative minimization procedure and (iii) a customized clustering, introduced previously as a method for estimating the number of the latent features in NMF [2]. It was shown that the hNMF can successfully identify the unknown number of release sources, as well as the parameters of the FPe. To do this, the hNMF's algorithm explores the space of plausible solutions and narrows the set of possibilities by estimating the optimal number of release sources needed to reconstruct the observed data in a robust manner. Here, we generalize the hNMF to be adequate for fBm processes and show that the extended method can accurately determine the number of release sources and their locations and strengths, as well as the diffusion exponent and the transport properties of the medium (D x , D y , u x ). We also demonstrate the importance of this extension for the correct identification of the unknown number of release sources. In what follows, we assume point sources with the Green's function given by Eq. (2); for the other source types, it has to be replaced by the appropriate Green's function.

Nonnegative Matrix Factorization (NMF)
The usual interpretation of NMF is as a method for a low-rank matrix approximation of the observed data-matrix, C, whose size is T × n, by two unknown matrices W and H, C ≈ W H, both containing one small dimension, N s (which is the estimate for the actual number of sources,Ñ s ). This approximation (aka matrix decomposition) is performed through non-convex minimization with a given distance metric, ||...|| dist : min||C ij − Ns s=1 W is H sj || dist constrained by the non-negativity of W and H, W is ≥ 0; H sj ≥ 0. NMF has proven very useful for face recognition, text recognition, dimension reduction, unsupervised learning, anomaly detection, Blind Source Separation (BSS), etc. [12]. NMF is underpinned by a well-defined statistical model of superimposed components that, when the distance metrics ||...|| dist is the Euclidean distance, can be treated as a Gaussian mixture model [16]. In this probabilistic interpretation, NMF is equivalent to the expectation-maximization (EM) algorithm [15]. EM is developed to find the maximum likelihood estimates of parameters in statistical models, when the model depends on unobserved, i.e., latent or hidden, variables. In this probabilistic interpretation of NMF, the observables, c 1 , c 2 , ..., c n (c i is a column vector with T elements), which are the columns of the data, C, are generated by N s latent variables h 1 , h 2 , ..., h Ns . Specifically, each observable c i is generated from a probability distribution with mean, c i = Ns s=1 W is h s , where N s is the (unknown) number of latent variables. The influence of h s on c i is through the basis patterns/features of the considered phenomenon w :s represented by the columns of W [23]. It is known that the probabilistic interpretation of NMF is particularly valuable when dealing with stochastic signals [16].
In our case, the observed data, C, are formed by the mixing ofÑ s signals, at the locations of each one of the n detectors, r d ≡ (x d , y d ); d = 1, ..., n at the times of the records t m ; m = 1, 2, ..., T . Each one of the n detectors is situated at a different location, r d = (x d , y d ), and measures a mixture of the signals originating fromÑ s point sources located at the positions r s = (x s , y s ), s = 1, 2, ...,Ñ s with respective strengths q s . The fractional diffusion, characterized by the diffusion exponent α, occurs in a medium with generalized diffusion coefficient (D x , D y ) and drift u x . The transient signal recorded by the d th detector is assumed to be generated from a normal probability distribution with mean where d marks the detectors, and m marks the time points of the records.
The minimization of the objective function, O, assumes that each measurement at a given space-time point ( r d , t m ) is an independent Gaussian-distributed random variable. If each detector has its own distinct (possibly time-dependent) measurement error, the objective function in (9) should be replaced by a weighted sum of least squares, where each deviation from the corresponding observation is weighted by the inverse square of its measurement error.

Custom clustering and NMFk
NMF is sufficient to carry a constrained optimization problem to extract desired parameters when the number of sources, N s , is known; however, we rely on the signals measured by the detectors, which record mixtures arising from an unknown number of sources. To determine this unknown number of sources (that is, the number of the latent features), we utilize an approach called NMFk, introduced in earlier works [2,1,38].
NMFk explores the possible number of sources, N s , starting from N s = 1, 2, ..., P (P is less than min(n, T )). For each explored number of sources, N s , a set U Ns (we call it a run) of M ∼ 200 minimizations are computed, each minimization generated from random initial guesses for the unknown parameters (within specific physical bounds). The set of M solutions in the set U Ns , for N s sources, is given by where (X, Y, Q) i denotes the unknown coordinates and strengths of the N s sources; [(x 1 , y 1 , q 1 ), ..., (x Ns , y Ns , q Ns )] i in the i th NMF minimization with advection velocity, generalized diffusion coefficients and diffusion exponent This condition has to be enforced since each NMF minimization (producing a given [(X, Y, Q) i , (u x , D x , D y , α) i ] tuple) contributes only one solution, and accordingly has to supply exactly one element to each cluster. During the clustering, similarity between the elements is measured by cosine similarity, which calculates the cosine of the angle between two vectors and is a natural choice for similarity between non-negative vectors [21].
Further, to identify the optimum number of sources, NMFk calculates the clusters' stability for each explored number of sources, N s . The optimum number of sources, N s , which is used to estimate the trueÑ s , is the number of sources for which the corresponding clusters are relatively stable and separable and whose centroids result in a small reconstruction error (see below).
To quantify the stability and separability of the clustering for a given number of sources, NMFk utilizes the Silhouette statistics, S [35], which is developed to measure the similarity between an element and the elements of its own cluster compared to the centroids of other clusters. The S values are between [−1, 1], and S measures how well each element has been classified by the clustering. The main idea of NMFk is to use the cohesion of the clusters (how compact they are) and the separation between them as a measure of the stability of the solutions of the minimization with different random initial guesses and hence the quality of a particular choice of N s . In previous works, NMFk even "shuffled" randomly the observed data, C, to increase the effect of robustness of the average solutions [3,37] and for the large scale variations [11].
The intuitive reasoning behind the idea of stability is as follows. In the case of underfitting, i.e., for solutions with a number of sources less than the actual number of sources, the clustering could be good; for example, several of the sources could be combined to produce a "super cluster." However, the clustering will break down significantly in the case of overfitting, when the estimated number of sources exceeds the true number of sources. Indeed, in this case, we do not expect the solutions to be well clustered, since there is no unique way to reconstruct the solutions with number of clusters >Ñ s , and at least some of the clusters will be artificial, rather than real entities.
The other metrics that NMFk utilizes is the relative reconstruction error, R = ||C − W H||/||C||, which measures the relative deviation of the obtained solution, W H, from the original data, C. The reconstruction error, R, evaluates the accuracy with which the average solutions, that is, the solution constructed with the parameters taken from the centroids of the clusters, reproduce the observed data, C. In general, the solution accuracy increases (while the stability of the clusters decreases) with the increase in the number of the sources (increasing the number of the parameters in the minimization).
In summary, NMFk calculates the average Silhouette width, S, and the average reconstruction error, R, for each choice of the unknown number of sources, in order to estimate the true number of release sources,Ñ s . NMFk determines N s to be equal to the number of sources that accurately reconstruct the observations (i.e., their relative reconstruction error, R, is small enough) and the clustering of the sets of solutions corresponding to N s , obtained with random initial conditions, to be sufficiently robust (i.e., the average Silhouette width, S, to be close to 1).

Generation of Synthetic Data
In order to validate the generalization of the hNMF method, we generated synthetic data sets using Eq. (3) for different types of diffusion (dictated by the values of α). The parameters that were used here to generate the synthetic data are typical for groundwater contamination [38]. The parameters correspond to the case of normal diffusion, and for the case of anomalous diffusion, the values of the Brownian dispersion coefficients were used as generalized diffusion coefficients. We used: D

Generalized hNMF is needed for the accurate identification of fBm sources
In order to illustrate the importance of the generalized hNMF method, we considered synthetic data generated using α = 0.8 (subdiffusion) and α = 1.2 (superdiffusion), the parameters listed in section 4, and the configuration of sources and detectors shown in Fig. 1. Both values of the diffusion exponent are close to normal diffusion, α = 1. We ran the hNMF with α set to 1 (i.e., removing the degree of freedom of the diffusion exponent). The reconstructed signals at the locations of three different detectors are depicted in Fig. (3). The synthetic data corresponding to α = 0.8 and α = 1.2 are presented by solid black lines in the left(right) panels, and the circles represent the reconstructed signals at these locations using 2, 3, or 4 point sources. Fig. (3) demonstrates that in all cases, the detector observations are well approximated despite the wrong propagator used and the wrong number of sources. It is, therefore, clear that the ability to approximate detector observations alone is not sufficient to determine the number of sources and their locations or the diffusion characteristics. If the α is not forced, we see slightly better reconstructions as generalized hNMF recognizes the correct number of sources with the appropriate diffusion characteristics. shows the same information for α = 1.2. The different rows correspond to different detectors whose coordinates are specified. The estimated signals are all for normal diffusion α = 1.
In Fig. 4, we present the reconstruction metrics, R, and average Silhouette, S, for the different numbers of sources (N s ) for α = 0.8 (subdiffusion) (upper row) and α = 1.2 (superdiffusion) (lower row). The left column corresponds to a forced α = 1, that is, to the old hNMF. The right column corresponds to the results of the generalized hNMF with the diffusion exponent derived by the minimization described in section 3.2. From Fig. 4, it can be concluded that with α set to one, (left column), both the reconstruction, R, and the average Silhouette, S, decrease as the number of sources increases, for both subdiffusion and superdiffusion. This observation implies that, despite the better fit of the data by the estimated parameters, the increase in the number of sources worsens the clustering quality (a reduced Silhouette). The reduction of the average Silhouette stems from the fact that the old hNMF model used to describe the data does not account for anomalous diffusion. In the right column, we show that when the generalized hNMF is used, with the diffusion exponent extracted by the minimization, for both subdiffusion and superdiffusion, the maximum value of the silhouette and the minimal value of the reconstruction are obtained for the correct number of sources (three) that was used to generate the synthetic data.

Generalized hNMF accurately identifies fBm sources and properties of the medium
We applied the generalized hNMF method to different configurations of point sources and detectors and found that the method provides a correct estimate of the number of sources; their strengths and positions; and medium characteristics (drift velocity, generalized diffusion coefficients, and the diffusion exponent). Figure 5 demonstrates the results of our extended hNMF method, for two different numbers of release point sources and spatial configurations, and for three values of the diffusion exponent corresponding to subdiffusion, normal diffusion, and superdiffusion. Subfigures The left column corresponds to the method with α set equal to one (i.e., looking for the best normal diffusion parameters to describe the measurements), and the right column corresponds to the full method with α being extracted by the method.
subfigures that our extended method correctly estimates the number of sources, their characteristics, and the transport characteristics for all the cases of the diffusion exponent considered here.
We also tested the performance of the method with configurations consisting of temporally extended and spatially extended sources. In all these cases, the generalized hNMF method provides accurate estimates of the duration of emission and the spatial extension of the sources. We also verified that for the synthetic data that was generated using point sources, when the method was implemented using the propagator for spatially extended sources, the estimated size of the sources was smaller than any spatial scale of the configuration (the dispersion radius between sequential measurements, the distance between detectors, the distance between sources, and the drift distance between sequential measurements).

Noisy data
Real data often includes noise. The noise can be due to the measuring device/s or due to other effects which are not accounted for in the Green function describing the process. In many realistic cases, the noise is proportional to the signal. In order to test the applicability and robustness of the hNMF, we considered noisy data. The data was generated in the same way described in section 4 for the configuration shown in Fig. 1. In order to add the noise, the precisely calculated concentration (at each time and position) was multiplied by a random number drawn from a uniform distribution according to: The probability density function of z ( r, t) is simply In Fig. 6, we present the reconstruction error and the average silhouette for different numbers of sources. The three panels correspond to the specified values of the diffusion exponent, α (corresponding to sub, normal and super diffusion).
For the cases of subdiffusion (α = 0.5) and normal diffusion (α = 1), the reconstruction error and the average silhouette show a similar behavior to that found for data in the absence of noise, and the correct number of sources is identified. For the superdiffusive case, where the mixing of the signals from the three sources is stronger, for some realizations of    the noise, the average silhouette does not show the same sharp decrease, and the identification of the number of sources is more challenging. For lower values of the noise, the hNMF shows the same performance as for the exact data.
The noise also introduces uncertainty in the estimated values of the parameters. In Fig. 7, we present the estimated values of the parameters for the same diffusion exponents shown in Fig. 6.
Each parameter was normalized by the true value in order to present the relative error and uncertainty. The uncertainty range is larger for larger diffusion exponents where the mixing of the signals from the different sources is larger. For the normal diffusion and the superdiffusion, the uncertainty in the estimation of the amplitude of the third source is the largest. The noise is proportional to the total signal, and the third source has the smallest amplitude. Therefore, the noise relative to the signal from this source is much larger and it affects the uncertainty associated with its estimated amplitude.

Conclusions
Inverse modeling of anomalous diffusion, is of great interest in many fields, including material science, physics, finances, biology, and many others. The value of an inverse modeling approach lies in its ability to correctly estimate the unknown number of release sources and their characteristics, as well the medium characteristics and the diffusion exponent. We demonstrated that the generalized hNMF introduced here, which is based on the integration of unsupervised learning and Green's function of the fBm governing equation, is capable of providing accurate estimates of the number of sources, their properties, and the medium characteristics that result in sub, Brownian, or super diffusion. The method was shown to work with different types of sources, including point sources, temporally extended sources, and spatially extended ones with various configurations. The generalized hNMF method also provides information regarding the validity of the type of the propagator that is used. If the propagator is not appropriate (e.g., if it is the propagator of normal diffusion), the method showed that it is impossible to have maximal Silhouette and minimal reconstruction metrics for any number of sources. When the adequate propagator is used, the optima that were found corresponded to the configurations that were used to generate the synthetic data. The results presented here and previous successful applications of the hNMF for inverse problems suggest that this method might be further extended to account for different mechanisms of anomalous diffusion (such as the continuous time random walk and quenched disorder). Of particular interest are generalizations of the hNMF that would enable the extraction of the scaling relations based on limited observations.