Regressive and generative neural networks for scalar field theory

We explore the perspectives of machine learning techniques in the context of quantum field theories. In particular, we discuss two-dimensional complex scalar field theory at nonzero temperature and chemical potential -- a theory with a nontrivial phase diagram. A neural network is successfully trained to recognize the different phases of this system and to predict the value of various observables, based on the field configurations. We analyze a broad range of chemical potentials and find that the network is robust and able to recognize patterns far away from the point where it was trained. Aside from the regressive analysis, which belongs to supervised learning, an unsupervised generative network is proposed to produce new quantum field configurations that follow a specific distribution. An implicit local constraint fulfilled by the physical configurations was found to be automatically captured by our generative model. We elaborate on potential uses of such a generative approach for sampling outside the training region.


I. INTRODUCTION
Deep learning with a hierarchical structure of artificial neural networks is a branch of machine learning aiming at understanding and extracting high-level representations of big data [1]. It is particularly effective in tackling complex non-linear systems with a high level of correlations that cannot be captured easily by conventional techniques. Traditionally employed for tasks like pattern recognition in images or speech, automated translation or board game-playing, applications of deep learning have been found recently in many areas of physics including nuclear [2][3][4][5], particle [6][7][8][9][10] and condensed matter [11][12][13][14][15][16][17][18][19][20] physics.
Significant progress has been made in utilizing machine learning methods for condensed matter systems like classical or quantum spin models. Specific tasks in these settings include the discrimination between certain phases and the identification of phase transitions [11][12][13][14][15], the compressed representation of quantum wave functions [16] or the acceleration of Monte-Carlo algorithms [17][18][19]. Recently, deep neural networks have also been considered in particle physics, for the processing of experimental heavy-ion collision datasets [2] and in the context of algorithmic development for numerical lattice field theory simulations [21][22][23].
Pattern recognition, especially classification and regression tasks, have been discussed previously in interacting many-body systems for condensed matter physics. In the present paper, we generalize the application of deep learning for the classification of phases in a lattice quantum field theoretical setting. We further demonstrate the capability of deep neural networks in learning physical observables, even with highly non-linear dependence on the field configurations and with only limited * Email: zhou@fias.uni-frankfurt.de training data -providing an effective high-dimensional non-linear regression method. In addition, we proceed by implementing, for the first time, a Generative Adversarial Network (GAN) [24] for lattice field theory in order to generate field configurations following and generalizing the training set distribution. This is an unsupervised learning framework that uses unlabeled data to perform representation learning. Such a GAN-powered approach is not a full-fledged alternative to the Monte-Carlo algorithm, which possesses desired properties like ergodicity, reversibility and detailed balance. However, it can result in a one-pass direct sampling network where no Markov chain is needed. Our aim here is to provide a proof of principle that generative networks, if trained adequately, are capable of capturing and representing the distribution of configurations in a strongly correlated quantum field theory. On the practical side, generative networks would prove useful when combined with traditional approaches to accelerate simulation algorithms, e.g., by improving decorrelation for proposals in a Markov chain process. Further potential use of such setups would be for reducing large ensembles of field configurations into a single (highly trained) network as an efficient representation for the quantum statistical field ensembles, thereby significantly reducing storage requirements.
Specifically, we consider two-dimensional quantum scalar field theory discretized on a lattice, and implement a deep neural network for the investigation of field configurations generated via standard Monte-Carlo algorithms. We aim at testing whether the machine is capable of recognizing known features of the system including phase transitions and the corresponding behavior of various observables. More interestingly, we also look for hidden patterns discovered by the network, i.e. correlations of the gross features of the system with further low-level variables. Furthermore, we explore the approach of generating field configurations using GAN and reproducing physical distributions. This paper is structured as follows. In Sec. II we outline the scalar field theory setup, including the details of the configuration space and the definition of the observables. This is followed by Sec. III, where we describe our neural network hierarchy and discuss the classification and regression tasks, together with the details of the generative network approach. In Sec. IV we summarize our main findings and conclude. Two appendices contain the details of the dualization of scalar field theory (App. A) and the specifications of the GAN approach (App. B).

II. OBSERVABLES IN SCALAR FIELD THEORY
We consider a complex scalar field φ with mass m and quartic coupling λ in 1 + 1 dimensional Euclidean spacetime at nonzero temperature T . This system is studied in the grand canonical approach, introducing a chemical potential µ that controls how the charge density n fluctuates. For low temperatures, two different regimes of the parameter space can be distinguished: At low µ the density is suppressed, usually referred to as the Silver Blaze behavior [25], whereas above a threshold µ > µ th the density increases considerably. 1 People conjecture that the QCD phase diagram also holds such a behavior in the region at low temperatures and medium to high densities [25].
This interesting behavior is a non-perturbative phenomena and cannot be observed directly, as the action becomes complex for µ ≠ 0, hindering standard simulations in terms of the field φ. However, using the worldline formalism, the partition function can be reexpressed using dual variables and the action rendered real and positive [26], see details in App. A. The dual variables are the integers k ν (x) and ν (x) that are associated to the links starting at the point x = (x 1 , x 2 ) and lying in the direction ν = 1 (space) or ν = 2 (time). The total number of variables is therefore N ≡ 2 × 2 × N 1 × N 2 , where N ν denotes the number of lattice sites in the direction ν. The partition function becomes a sum over this Ndimensional configuration space, with the lattice action S lat described in Eq. (A3). While the -integers can take arbitrary values, the k-integers satisfy a zero divergence-type constraint, 1 In the following we will refer to the pronounced change in the density at the threshold as a transition, keeping in mind that -in accordance with the Mermin-Wagner theorem -it is not connected to spontaneous symmetry breaking. whereν is the unit vector in the ν direction and a the lattice spacing.
The partition function (1) contains all information about the system. In particular, the expectation values of the particle density and of the squared field are related to derivatives of Z with respect to µ, and to m 2 , respectively. Using the explicit form of the action (A3), the corresponding operators read where the weight W [s] and the function s(k, ; x) are defined in Eq. (A4). Summarizing, in our representation of complex scalar field theory, one field configuration consists of N integers k ν and ν and the path integral is a sum over these field configurations, generated with the appropriate probabilities. The observables on a given configuration are obtained according to Eqs. (3)-(4). This is a simple sum over the k 2 variables for the density and a highly nonlinear function for the squared field operator that depends on all k ν and ν variables, see (A4).
We consider a low-temperature ensemble N 1 × N 2 = 10 × 200 (the dimensionality of the configuration space is therefore N = 8000) generated with mass m = 0.1, coupling λ = 1.0 and a range of chemical potentials 0.91 ≤ µ ≤ 1.05 around the threshold value µ th ≈ 0.94 (all dimensionful quantities are understood in lattice units). For µ < µ th , ⟨n⟩ is almost zero and ⟨φ 2 ⟩ is constant. In contrast, both observables rise approximately linearly beyond the threshold. This is demonstrated in Fig. 1. An additional remark about the density operator (3) is in order. Due to the constraint (2), the k ν variables always form closed loops. The contribution of such loops to n depends on how many times the loop winds around the ν = 2 direction. Therefore, the density operator may only assume 1 N 1 times integer values on any configuration. For the presently investigated lattice geometry this means that n is an integer multiple of 0.1. Note that this discreteness of the density is a finite volume artifact and n becomes continuous in the thermodynamic limit.

III. SCALAR FIELD THEORY IN A NEURAL NETWORK
In the present work we apply deep neural networks for the complex scalar field configurations generated with the dualization approach using standard Monte-Carlo methods, as described in Sec. II. Specifically, the lattice configurations consisting of N integers are considered as input to the machine. We investigated the ability of the neural network to perform different tasks including phase transition detection and physical observable regression. Finally we propose a new configuration generation method using the Generative Adversarial Network (GAN) approach.

III.1. Classification of phases
We first employ the network to detect the transition between the low-and high-density phases of the system by performing a classification task. In particular, we train the neural network to identify the threshold chemical potential µ th without specific physical guidance. The recognition of high-level abstract patterns in the data is essential for this classification, thus we consider a Convolutional Neural Network (CNN), which is usually designed for such tasks. We train a CNN to target at a binary classification: The configurations are either in the low-density "Silver Blaze" region (⟨n⟩ ≈ 0) or in the condensed region (⟨n⟩ ≠ 0). To perform a semi-supervised training, we feed the lattice configurations at µ = 0.91 ≪ µ th and at µ = 1.05 ≫ µ th as input to the CNN network which has a topology as shown in Fig. 2. The training points are also highlighted in Fig. 1.
The input configurations are viewed as images with 4 channels representing the 4 integer field variables (k 1 , k 2 , 1 and 2 ) and lattice size 200 × 10. We use three convolutional layers each followed by average pooling (except for the first convolutional layer, see in Fig. 2), batch normalization (BN), dropout and PReLU activation. In the first convolutional layer there are 16 filters of size 3 × 3 scanning through the input configuration images and creating 16 feature maps of size 200 × 10. After BN and PReLU activation, these feature maps are further convoluted in the second convolutional layer with 32 filters of size 3 × 3 × 32. The output from second convolutional layer are half pooled by a subsequent average pooling layer before further processing. Dropout is applied after the final convolutional layer and in between the first two fully-connected layers. The weight matrix of both convolutional layers are initialized with normal distribution and constrained with L 2 regularization. In a convolutional layer, each neuron only locally connects to a small chunk of neurons in the previous layer by a convolution operation -this is a key reason for the success of the CNN architecture. After the third convolutional layer and a second average pooling the resulting 32 feature maps of size 50 × 2 are flattened and connected to a 256-neuron fully connected layer with batch normalization, dropout and PReLU activation. The final output layer is another fully connected layer with softmax activation and 2 neurons to indicate the two configuration classes. Dropout, batch normalization, PReLU and L 2 regularization work together to prevent overfitting that may hinder the generalization ability of the network.
Supervised learning is applied here for this binary classification problem, where the configurations at µ = 0.91 are labelled as (0, 1) and the ones at µ = 1.05 as (1, 0) in the training dataset. The cross entropy between the true label and the network output (binary vector), which can quantify well the difference for distributions, is taken as the loss function l(θ) for training the network where θ represents the trainable parameters of the neural network. Learning/Training is performed by updating θ → θ − α∂l(θ) ∂θ to minimize the loss function, where α is the learning rate with initial value 0.0001 and adaptively changed using the AdaMax method. The training data set consists of 30,000 configuration samples for each class and fed into the network in batches with batch size selected to be 16. 20% of the training set are randomly chosen to serve as validation set. In our study, the training runs for 1000 epochs for the neural network, during which the model parameters are saved to a new checkpoint whenever a smaller validation error is encountered. Small fluctuations of validation accuracy are observed to saturate at around 99%.
Once trained, we test the CNN by scanning through the configurations at different values of the chemical potential 0.91 < µ < 1.05. The output of the network for each configuration is identified as the probability P that the configuration in question corresponds to the condensed phase. In Fig. 3 we show P predicted by the network as a function of various quantities: the chemical potential, the number density and the squared field. Looking at P (µ) in the left panel of the figure, we observe that while for low/high chemical potentials the configurations unambiguously fall in one of the two classes (P = 0/P = 1), the ensembles at intermediate µ contain configurations from both sectors. This expresses the enhanced fluctuations in the vicinity of µ = µ th , as expected near a transition.
To understand what the deep neural network has learned for its decision making on phase classification, we can investigate the correlation between neural network's output and physical observables. Indeed, much more interesting trends are visible in the plots showing P as a function of n and of φ 2 as shown in Fig. 3 (middle and right panels). The network outputs are strongly correlated with n and φ 2 . Without any specific supervision to the network about their role, the CNN has clearly learned the relevance of these observables for the transition. In other words, the designed CNN managed to identify highly non-linear features in the configurations that correspond to the physical observables (3) and (4). Particularly for the number density n, we see that its nonzero value is perfectly indicated by the non-zero probability P from the trained network.
We mention that we also tried reducing the number of convolutional layers in the network to two. In this case we observed similar signals for P with slightly worse performance.
As can be read off from Eq. (3), the particle number is given by the sum of all the k 2 variables. This simple pattern might be easily learned by the network. We thus performed the same binary classification task with  a restricted training input, including only one of the remaining three variable sets: either k 1 , 1 or 2 . The results for the three restricted inputs, together with the full input, are visualized in Fig. 4, plotting the expectation value of the network predicted condensation probability against the chemical potential. Clearly, the network succeeded in learning essentially the same features also using the restricted inputs, as ⟨P ⟩ starts to rise at around the same threshold chemical potential µ th ≈ 0.94 for all four cases. This inspires us to analyze the correlation between the number density and the similarly defined observables involving either k 1 , 1 or 2 . To this end we consider the normalized correlation coefficient which vanishes for decorrelated data and equals unity for complete correlation. As shown in Fig. 5, ∑ l 1 , ∑ l 2 and also φ 2 are all strongly correlated with n, while ∑ k 1 is fully decorrelated. Still, the machine succeeds in classifying the configurations based only on k 1 , as indicated in Fig. 4. Note that with conventional techniques, neither of the physical observables (3)-(4) sensitive to the transition can be constructed using only the k 1 variables. The excellent performance of our CNN, shown in Fig. 4, indicates the existence of strong hidden features in the k 1 variables that correlate with the phase of the system. According to these results, the network has the ability to decode these hidden correlations in a highly effective manner.

III.2. Non-linear regression of observables
Next, we consider a regression task for learning thermodynamic observables of the system, employing a sim-ilar training strategy as used above for the binary classification. In particular, supervised learning is applied with a CNN to regress the thermodynamic observables including the particle number density and the squared field, based on the lattice configurations. As in Sec. III.1, the training dataset consists of configurations at µ = 0.91 and µ = 1.05. The generalization ability of the machine is investigated by testing the network predictions on configurations at intermediate values of the chemical potential.
To target this regression task, we change the CNN architecture slightly. Specifically, the batch normalization and pooling layers are removed, and one more fully connected layer with 32 neurons is inserted before the final output layer. The latter consists of 2 neurons representing the values of n and of φ 2 . The activation functions are all changed to ReLU and the loss function for the network is chosen to be the mean squared difference between the predictions and the true values. After 2000 epochs of training, our regression network is tested on previously unseen configurations at different values of the chemical potential. The results for the density and for the squared field are plotted in Fig. 6, showing the true values of the observables, calculated using Eqs. (3)-(4), against the network predictions.
As visible in Fig. 6, the network performs well over a broad range of chemical potentials, predicting n and φ 2 accurately, the maximal deviation being around 5% for the density and around 7% for the squared field. Note that the training was performed using only two far-away segments of the range of the target observables, corresponding to configurations at µ = 0.91 and µ = 1.05 (the expectation values of the observables for these ensembles are also indicated in the plots). On the one hand, the high quality of the regression for the density may seem natural owing to the linear dependence (3) of n on the individual variables. On the other hand, the squared field is a highly non-linear function of the high-dimensional input (R 200×10×4 → R 1 ), making the excellent predictive ability of the network very non-trivial and surprising. Put differently, using limited training data (covered small range of the target domain), our CNN network has the ability to correctly reproduce the whole target space mapping, which is curved and even dramatically changing (close to transition point). This means that the network has effectively encoded the configuration into a much plainer and abstract latent space (intermediate layers inside the network). A linear interpolation in these layers can result in non-linear regression in the final output layer.
Just as the middle panel of Fig. 4, the upper panel of Fig. 6 reflects the discreteness of the density operator n, evaluated on any configuration. Notice that while the true values of n are indeed integer multiples of 0.1, small deviations (below 0.007 in magnitude) from this rule occur for the predicted values. Such deviations stem from the approximative nature of the regression network and are observed to decrease as the number of training epochs is increased. We get back to this behavior below in the generative network analysis.

III.3. Configuration production using the Generative Adversarial Network
Generative Adversarial Network (GAN) [24] is a deep generative model that aims to learn the distribution of input variables from the training data. It belongs to the unsupervised learning category within deep learning approaches. The GAN framework contains two non-linear differentiable functions, both of which are represented by adaptive deep neural networks. The first one is the generator G(z), which maps random noise vectors z from a latent space with distribution p prior (usually uniform or normal distribution over z) to the target data space with implicit distribution p G (over data x) that approaches the desired distribution p true through training. The second one is the discriminator D(x) with a single scalar output, which tries to distinguish real data x from generated datâ x = G(z). These two neural networks are trained alternately, thus improving their respective abilities against each other in a two-player minimax game (also called zero-sum game). An optimally trained GAN converges to the state (the Nash equilibrium for this game-theory problem), where the generator excels in 'forging' samples that the discriminator cannot anymore distinguish from real data. Such generative modeling-assisted approaches have been tested in various scientific contexts, including medicine [27,28], particle physics [29][30][31], cosmology [32][33][34] and condensed matter physics [35,36]. Here we employ, for the first time, the generative modeling GAN application in strongly correlated quantum field theory. To ensure training stability, we consider the Wasserstein-GAN architecture [37] with gradient penalty (WGAN-gp) [38] in this study, see Fig. 7 for the main architecture. The theoretical foundations of WGAN are outlined in App. B.
The generator and discriminator architectures are illustrated in Fig. 8. The generator takes as input a randomly sampled 512-dimensional latent vector z following a multivariate normal Gaussian distribution, and gradually transforms z to the desired configuration space (of dimensionality 200 × 10 × 4). The up-sampling is done via transposed convolution, which is also known as fractionally-strided convolution that function backward the convolution operation. The kernel size for the convolutional layer is 3 × 3, while for the transposed convolutional layer 4 × 4. Batch normalization is included to standardize the outputs and to stabilize training. Apart from the last layer we use the Leaky Rectified Linear Unit (LReLU) as activation function. The discriminator aims to evaluate the 'fidelity' of the configurations. The difference between the output of real data and fake data is quantified using the Earth Mover (EM)-distance (also called Wasserstein distance), which serves as the loss function here. Strided convolution is performed for the down-sampling. Note that for the first four convolutional layers plain linear activation is used to let the discriminator more effectively reduce the dimensionality of the input configurations (function like PCA). This helps the GAN to capture the implicit multimodal distribution (of physical observables), as we will see below.
Having trained the GAN, the generator can be used to convert samples from the prior distribution p prior to data points lying in configuration space. To verify the effectiveness of the network and, in particular, whether the generated configurations are indeed physical, we will first check the divergence-type constraint (2) for the complex scalar field. As shown in Fig. 9, the absolute divergence per site for the generated outputs is not exactly zero but is decreasing and converges to zero as the number of training epochs grows. Note that Eq. (2) represents a highly implicit physical constraint inside the training dataset, which is not provided as supervision to the training of the GAN. Instead, the network automatically recognized this constraint for the configurations in a converging way. The generation time for a single configuration using the GAN (on an Nvidia TitanXp GPU) is 0.2 ms.
Next we turn to the distribution of observables in the samples generated by the GAN and check to what extent it agrees with the training distribution. In the top panel of Fig. 10 we visualize the probability density distribution of the number density n from the GAN after training with one ensemble of configurations at µ = 1.05. We observe that the GAN has captured the discrete distribution of n quite well. The ensemble average of the particle number density from GAN is estimated (using 1000 random samples) to be ⟨n⟩ GAN = 0.578, also quite close to the Monte-Carlo value ⟨n⟩ MC = 0.580. As mentioned earlier, the particle number density (3) is simply the sum of time component of the k variables in the configuration. In contrast, the squared field φ 2 calculated using Eq. (4) is highly non-linear in the input variables.
FIG. 11. The mean particle number density on the configurations generated by the cGAN with (blue) and without (red) rounding configuration entries to its nearest discrete value, against the specified condition values for n.
Nevertheless, the multi-modal distribution of φ 2 is also well reproduced by the generative network, see the bottom panel of Fig. 10. The ensemble average of φ 2 from GAN (for the same 1000 samples above) ⟨φ 2 ⟩ GAN = 0.449, is also close to the Monte-Carlo result ⟨φ 2 ⟩ MC = 0.447. Figs. 9 and 10 clearly demonstrate that the generative adversarial network can be trained to capture the statistical distribution of the field configurations even on the level of physical observables.
The above GAN structure is designed to reproduce certain distributions in the training dataset. Next we attempt to use the network to generalize the distribution that it was trained on. The discriminator is provided with relevant labels (in this case the value of the number density n) for the training dataset in order to condition the network (cGAN) [39]. Specifically, the training sample at µ = 1.05 contains cases n = 0.4, 0.5, 0.6 and 0.7. After the training we test the generalization ability of the network by specifying desired number densities outside the above set of values. Fig. 11 shows the performance of the cGAN for this generalization task. We stress that for the training only 0.4 ≤ n ≤ 0.7 values were provided, but the agreement between the desired n (condition) and the measured n on the generated configurations is spectacular over a much broader range of density values. This generalization task might be viewed as converting the grand-canonical ensemble of configurations (at fixed µ) to a series of canonical ensembles (at various values of n).

IV. CONCLUSIONS
In this paper we proposed a set of novel techniques for the investigation of a lattice-regularized quantum field theory by means of deep neural networks, including discovering hidden correlations, learning observables and producing field configurations. Specifically, our analysis was carried out for the dualized representation of complex scalar field theory in 1+1 dimensions.
We first showed that a convolutional neural network can be used in a semi-supervised manner to detect the phase transition in this strongly correlated quantum field theory based on the microscopic configurations. We found that the network is capable of recognizing correlations in the system between various observables and phases classification without the specific knowledge guidance. Very interestingly, the network discovered a correlation beyond the conventional analysis, which enabled it to use a restricted subset of the input variables (in particular, the k 1 variables) alone to decode information about the phase transition.
We continued by designing a regressive neural network to learn physical observables (n and φ 2 ) with limited training samples. The network achieved remarkable agreement with the physical observables and also revealed a great generalization ability when tested at chemical potentials beyond the training set. This approach provides an effective high-dimensional non-linear regression method even with limited (compared to the huge Hilbert space, i.e. number of possible configurations) data points, where traditional interpolation or regression would require much higher statistics that grows exponentially with input dimensionality.
Finally, we proposed to generate new configurations following a specific distribution by adapting the modern deep generative modeling technique GAN. We found that the generator in the GAN has the ability of automatically recognizing the implicit but crucial physical constraint on the configurations in an unsupervised manner, and can represent the distribution of prominent observables with direct sampling. The generalization of configuration production to different parameter domains, e.g. towards a critical region, where conventional techniques slow down considerably, is clearly a fascinating feature that deserves further investigations. In the continuum, the Euclidean action of a complex 1+1-dimensional scalar field φ reads where the covariant derivative is D ν = ∂ ν + iµδ ν,2 , m denotes the mass of the charged scalar field, λ the quartic coupling and µ the chemical potential. The spatial (ν = 1) and time-like (ν = 2) coordinates are labelled by x ν . In Eq. (A1), L denotes the spatial extent of the system and T the temperature and we use periodic boundary conditions in both directions.
On a lattice with spacing a, the derivative operator is discretized by nearest-neighbor hoppings. The chemical potential assigns different weights to forward and backward hoppings in the time-like direction so that the regularization of the continuum action is where x = (x 1 , x 2 ) labels the lattice sites that range over 0 ≤ x ν < N ν . The lattice sizes are related to the volume and the temperature as L = N 1 a and T = (N 2 a) −1 . The partition function of this system is defined by the path integral (1) over the (complex) field configurations. The chemical potential spoils the reality of S lat so that for µ ≠ 0 we face a complex action problem that leads to a highly oscillatory integrand under the path integral. This problem can be solved by the so-called worldline formalism or dualization approach [26]. The chief steps are an expansion of the exponential factors (for each x and each ν), a variable substitution to the polar representation φ = r e iϕ and a subsequent integration in r and in ϕ. The final result is expressed as a sum over the integer expansion variables k ν (x) and ν (x) and reads [26] with Z k, (x) given by (A3b) In the second factor, W is a positive weight that depends explicitly on m and on λ. The third factor in Eq. (A3b) includes a Kronecker-δ with argument which places a constraint on the k-variables so that their discretized divergence must vanish at each point, as indicated in Eq. (2). Finally, the last factor in (A3b) is a positive combinatorial factor Instead of the original complex field φ, the field variables are now the integers k and . According to Eq. (A3), the weight of any field configuration is real and positive, thus the system can be simulated using standard Monte-Carlo algorithms. Specifically, we employ a worm algorithm [26,40,41], which is capable of automatically satisfying the divergence constraint ∇ ⋅ k = 0 on all lattice sites. Differentiating the partition function (A3) we obtain the representations (3)-(4) of the operators in terms of the integers k and .

Appendix B: Generative Adversarial Networks
The definition of the GAN involves the loss functions L D and L G for the discriminator and the generator, respectively. In a zero-sum game L G = −L D , and upon optimization of the respective parameters θ G and θ D the game converges to The original GAN uses the loss function denotes the expectation value over the normalized probability distribution p(x) and we used Note that for the generator, the first term of (B2) has no impact on G during gradient descent updates as it only depends on θ D . The expectation values in the above loss functions are computed from the mean of all the training samples. The parameters of the discriminator and the generator are updated by back propagation with the gradients of the loss functions, where x i is the sample from the training data and z i the latent noise from the prior p prior . The optimal discriminator under the above well-defined loss function for given fixed generator G can be derived to be From the information theory point of view, the above objective from discriminator (thus the training criterion of generator) are nothing else but the Jensen-Shannon (JS) divergence, which is a measure of similarity between two probability distributions, with the KL divergence given as So the best state for the generator is reached if and only if p G (x) = p true (x), giving D * = 1 2 for the global optimum of the minimax game also called Nash-equilibrium. In practice this default setup is difficult to train especially when applied for high-dimensional case. Using the above JS divergence measure, the discriminator D might not provide enough information to estimate the distance between the generated distribution and the real data distribution when the two distributions do not overlap sufficiently. Specifically, when the support of p G and p true both rest in low dimensional manifolds of the data space, the two distributions thus has a zero measure overlap which results in vanishing gradient for the generator. This leads to a weak signal for G updating and general instability. Mode-collapse can easily occur for GAN where the generator learns to only produce a single element in the state space that is maximally confusing the discriminator. In order to avoid this kind of failure training, a multitude of different techniques have been developed recently, like ACGAN [42], WGAN [37], improved WGAN [38], which help stabilizing and improving the GAN training. We used the improved WGAN with gradient penalty [38] in this work. The most important difference of WGAN compared to the original GAN lies in the loss function, where the Wasserstein-distance (also called Earth Mover distance) provides an efficient measure for the distance between the two distributions (p true and p G ) even if they are not overlapping anywhere. The loss functions are now and where the gradient penalty term with strength λ is computed in a linearly interpolated sample space, with uniformly sampled ∼ (0, 1].