Classification and reconstruction of optical quantum states with deep neural networks

We apply deep-neural-network-based techniques to quantum state classification and reconstruction. We demonstrate high classification accuracies and reconstruction fidelities, even in the presence of noise and with little data. Using optical quantum states as examples, we first demonstrate how convolutional neural networks (CNNs) can successfully classify several types of states distorted by, e.g., additive Gaussian noise or photon loss. We further show that a CNN trained on noisy inputs can learn to identify the most important regions in the data, which potentially can reduce the cost of tomography by guiding adaptive data collection. Secondly, we demonstrate reconstruction of quantum-state density matrices using neural networks that incorporate quantum-physics knowledge. The knowledge is implemented as custom neural-network layers that convert outputs from standard feedforward neural networks to valid descriptions of quantum states. Any standard feed-forward neural-network architecture can be adapted for quantum state tomography (QST) with our method. We present further demonstrations of our proposed [arXiv:2008.03240] QST technique with conditional generative adversarial networks (QST-CGAN). We motivate our choice of a learnable loss function within an adversarial framework by demonstrating that the QST-CGAN outperforms, across a range of scenarios, generative networks trained with standard loss functions. For pure states with additive or convolutional Gaussian noise, the QST-CGAN is able to adapt to the noise and reconstruct the underlying state. The QST-CGAN reconstructs states using up to two orders of magnitude fewer iterative steps than a standard iterative maximum likelihood (iMLE) method. Further, the QST-CGAN can reconstruct both pure and mixed states from two orders of magnitude fewer randomly chosen data points than iMLE.

For quantum state characterization, even distinguishing two different quantum states can become challenging. For example, telling a coherent source of light and a thermal state apart can be difficult due to the close similarity of their the data in low-photon regimes [42]. Beyond just identifying properties of the quantum state or classifying them, reconstructing a full quantum state description presents an even more challenging task, called quantum state tomography (QST) [55][56][57][58]. The challenges arise mainly due to the exponentially large Hilbert-space dimension required to fully describe the state [34,[59][60][61]. For example, k two-level quantum systems (qubits) have a Hilbert space of dimension N = 2 k and require up to N 2 − 1 real numbers to fully determine a density matrix describing the state. Therefore, QST requires clever data processing to extract a good representation of a state from noisy data [62][63][64][65][66][67]. The presence of noise further complicates the problem; for additive Gaussian noise, one reconstruction method [68] has computational complexity O(N 4 ).
The success of NNs in other fields has prompted their application to several quantum state classification and reconstruction tasks. The motivation is that NNs are universal function approximators [69][70][71] that can learn maps from noisy input data to class labels, or act as variational ansätze for quantum states [72][73][74]. The variational ansätze can be learned from data by minimizing some loss metric between the predictions from the NNbased model and the data. From a computational learning perspective, approximately learning a quantum state has a linear scaling in the number of quantum bits [75].
In this work, outlined in Fig. 1, we connect the tasks of quantum state classification and reconstruction in a general way to discriminative and generative problems in ML. We demonstrate the feasibility of using DNNs for classification and reconstruction, showing how to flexibly adapt them for different scenarios, e.g., noise or scarce data. Crucial components of our methods include incorporating knowledge of quantum physics and other prior information into the network.
Many previous applications of DNNs for classifying quantum data [41][42][43]76], consider properties like nonclassicality or entanglement. In these works, more complicated noise models, beyond simple detection inefficiencies, are not considered. Since the classification task we tackle seems rather straightforward for DNNs, we attempt to go beyond the standard paradigm (training on simulated data, testing on new data) and demonstrate results with different types of noise for general states and measurements [see Fig. 1(a)-(d)]. We also propose an adaptive data-collection method using a trained DNN to extract interesting patterns in the data and leverage it for adaptive tomography [see Fig. 1 In quantum state reconstruction [see Fig. 1(g)], one of the most popular neural-network approaches is to use Restricted Boltzmann Machines (RBMs) to map the underlying Boltzmann probability distribution of an RBM to the distribution of measurement outcomes on a quantum state [72,73,77,78] [see Fig. 1(h)]. This technique has some shortcomings, e.g., difficulties with sampling and lack of straightforward training for larger models. Recently, there has therefore been proposals Figure 1. An outline of the topics discussed in this work. (a) The data required for quantum state tomography consists of the frequencies of measurement outcomes from observables represented as Hermitian operators. The aim is to reconstruct a description of the quantum state -usually a density matrix or the wave function. (b) Measurements of quasi-probability distributions (Wigner or Husimi Q) reveal interesting visual features in the data. Similarly, the histograms of measurement statistics, e.g., the photon-number distribution, can have patterns. Such features and patterns can be used for classification or reconstruction. (c) Several types of noise can corrupt the state or the data. Some types of noise, e.g., white noise, can be reduced by more data collection. Other types of noise, e.g., state-preparation-and-measurement (SPAM) noise, are more difficult to handle. (d) Classification tasks attempt to assign a label to data, classifying it according to its properties, e.g., if it is a Schrödinger-cat state, has Wigner negativity, or is an entangled quantum state. (e) Neural networks can be trained for classification of states or their properties. (f) Once it has been trained, analyzing how a neural network determines the class of a state can help to focus on the most important features in the data. This can be leveraged for adaptive data collection. (g) Reconstruction of quantum states connects to generative modelling tasks, where the goal is to learn the underlying probability distribution of the data to sample new data from it. In quantum state reconstruction, we aim to learn an underlying model, usually in the form of a wave function or density matrix that can generate statistics for any measurement operator. (h) Neural-network methods can be used for estimating or approximating underlying probability densities explicitly using restricted Boltzmann machines (RBMs) or variational autoencoders (VAEs). Training RBMs is not straightforward due to sampling requirements. This is resolved in VAEs using a reparameterization trick that allows gradient-based backpropagation for training. (i) Generative adversarial networks (GANs) provide a density-estimation technique, where we do not explicitly define the density nor require the reparametrization trick for training. We combine ideas from VAEs and GANs to propose a new quantum state tomography technique with conditional GANs -the QST-CGAN. Our QST-CGAN method allows for explicit estimation of the density matrix and computation of measurement statistics using two custom layers -DensityMatrix and Expectation.
to instead use feed-forward architectures, including recurrent neural networks (RNNs) and Transformers, for QST [79][80][81]. Unlike RBMs, such neural-network architectures are straightforward to train, without any need for sampling steps, using gradient-based optimization with backpropagation. However, state-of-the-art results for generative tasks in ML often use variational autoencoders (VAEs) [12,82] and generative adversarial net-works (GANs) [13,83], which only recently are beginning to be explored for learning quantum states [84][85][86][87] [see Fig. 1 Results on the reconstruction of multi-qubit states suggest several benefits of NN-based reconstruction over standard techniques [78,[88][89][90]. In Ref. [80], states with up to 90 qubits are reconstructed in simulation. The ideas follow from Ref. [79], where quantum state recon-struction using generative models, both RBMs and RNNs is combined with a tensor-network paradigm. Similarly, in Ref. [91], fully connected DNNs were used for denoising data and dealing with state-preparation and measurement (SPAM) errors. In Ref. [90], a CNN was trained on simulated data (with noise) and proved able to reconstruct two-qubit states directly from data, outperforming a standard Stokes reconstruction.
However, such demonstrations are usually on simple states, with limited error models, and/or do not fully ensure that the reconstructed states are physical. For example, Ref. [80] considers Greenberger-Horne-Zeilinger (GHZ) states, which only contain four non-zero elements in the density matrix. In Refs. [91] and [90], the Hilbertspace dimensions are restricted to six and four, respectively. Even then, techniques to include prior knowledge, such as the properties of quantum states or background noise, need to be explored. In Ref. [90], noise is handled by adding it to the simulated training dataset and properties of a quantum state are enforced using a similar idea to our proposal in Ref. [1] independently. In Ref. [80], where GHZ states are reconstructed using a Transformer neural network, some reconstructed states have fidelities exceeding unity, which indicates the lack of quantum-mechanical constraints on the state description. In an experimental two-qubit reconstruction with the RBM ansatz, an improvement was observed when the variational ansatz was restricted to physical states, but this added costs during learning [92].
Furthermore, many of the approaches discussed so far cater specifically to qubit-based tomography. For continuous-variable (CV) quantum systems, which currently are attracting much attention for implementation of quantum computing [93][94][95][96][97][98][99][100], special adaptations are required, as in, e.g., Ref. [78], where RBMs were adapted for CV systems, but required an exhaustive search of all possible configurations to train. Lastly, the reconstruction techniques usually either use DNNs to reconstruct a single state where data from one experiment is enough, or require training datasets [90,91]. We show in Ref. [1] adaptions that allow the same DNN to both reconstruct states from scratch or perform single-shot reconstructions by mapping data to the space of density matrices in a general way.
Our motivation here is thus to address some of the problems discussed above and realize a unified framework that can flexibly work with different types of quantum data in various settings. Our contribution is a general method that allows the use of neural networks to capture patterns in data and explicitly generate a densitymatrix description as an intermediate representation inside the network. This idea is inspired by developments in density estimation with neural networks [101], more specifically, the VAE architecture [12], which learns an underlying complex data distribution using a simple latent noise space to generate new data. We augment the latent space to be a full quantum state description (the density matrix) conditioned on inputs that are both the data samples and the operators that define the measurements. Using this conditioning allows us to have a very general technique that can further handle known noise -we simply add the noise as an input variable. We consider the role of loss functions for reconstruction and motivate our idea of using a learnable loss in the form of a neural network based on the idea of GANs [13,21,22]. Our QST-CGAN technique thus combines concepts from VAEs and GANs, as illustrated in Fig. 1(i). In this work, we present details of the implementation and results for noisy reconstruction, reconstruction of mixed states, and reconstruction from reduced data.
This article is organized as follows. In Sec. II, we briefly discuss the quantum state tomography and state discrimination problems in the context of generative and discriminative modeling. In order to demonstrate our methods, we consider optical quantum states as examples. The various types of data from optical quantum states that will be used throughout the article are presented in Sec. III, including possible sources of noise. In Sec. IV, we describe details of the neural-network architectures and training methods. In particular, we discuss the custom layers that we introduce for reconstruction here and in Ref. [1]. Then, we present the results for the classification task in Sec. V A, where we also analyse the impact of noise on the classification performance. In Sec. V B, we show the performance of the QST-CGAN on noisy data and the role played by various loss functions in the reconstruction. Finally, we conclude in Sec. VI and discuss, in Sec. VII, further possibilities and potential for development of the techniques presented here. In Table I, we list all the abbreviations used throughout the article for easy reference.

II. BACKGROUND
In this section, we set the stage for the paper by providing an overview of the problems of quantum state discrimination (QSD) and quantum state tomography (QST). We then discuss generative and discriminative modelling in machine learning, which is related to these problems. We compare different neural-network approaches to such modelling to motivate our choice of methods in this paper for tackling QST and QSD.

A. Quantum state discrimination
The task in QSD is to classify an unknown state ρ as being one of a given finite ensemble of states {ρ i }, from which states are chosen with probabilities {p i } such that i p i = 1 [102]. The classification is done by performing measurements on ρ, typically positive-operatorvalued measures (POVMs) {O i }, designed such that observing the outcome i, which occurs with probability p i = tr(O i ρ), corresponds to the state being ρ i . The problem of QSD can thus often be rephrased as finding the optimal measurement for discriminating between the {ρ i }. In case the states to be discriminated between are not orthogonal, perfect single-shot QSD is not possible. The optimal measurement should then instead maximize the probability of guessing the state correctly [103]. Note that the non-orthogonality of quantum states, which prevents perfect QSD, does not have a classical analogue; it cannot be explained by merely assuming overlapping probability distributions [102,104].
If repeated state preparation and measurement is possible, adaptive measurement schemes, where new measurements are chosen based on the results of previous measurements, may be optimal. In this paper, we will consider such a situation, where we can make repeated measurements and collect statistics for various POVMs. However, our aim in this paper is not to construct highly optimized complex POVMs or adaptive schemes, but to show that a neural network can learn to perform QSD well when working with limited measurement data from standard, simple measurements of complex optical quantum states. Insights gleaned from the neural-network performance could then be used to minimize the number of simple measurements needed in experiments to classify states with high certainty. Furthermore, rapid state classification could help find a good starting point and parameterization for full quantum state reconstruction. Previous work has shown that neural networks can distinguish thermal and coherent light sources with few measurements [42]; here, we present a general framework for applying such techniques to arbitrary measurements and states. Note that we do not only distinguish between two types of states, but between many types of states at the same time.

B. Quantum state tomography
The goal of QST is more ambitious than that of QSD: to fully characterize an unknown quantum state, usually by obtaining its density matrix ρ. A physical density matrix is Hermitian, positive semidefinite, and has unit trace. In an N -dimensional Hilbert space, N 2 − 1 real numbers have to be estimated from POVM outcomes to completely determine a general ρ. This can be seen clearly from the Cholesky decomposition which is extensively used in reconstruction methods to ensure positivity and Hermiticity. The matrix T is lowertriangular with complex-valued entries except on the diagonal, where the entries are real-valued. The measurement data used for reconstruction of ρ consists of single-shot outcomes from POVMs {O i }. By repeating the measurement on identically prepared quantum states, we can gather statistics. The frequencies d i of various measurement outcomes is proportional to the expectation value tr(O i ρ) and forms our data d. The reconstruction problem can therefore be stated as an inversion problem [105] where the sensing matrix A is given by the choice of measurement operators and ρ f is the flattened density matrix. The invertibility of Eq. (2) depends on the set of measurement operators. A set of measurement operators that enables inversion, and thus allow the complete characterization of the state, is called informationally complete (IC) [106]. For a state in an N -dimensional Hilbert space, up to ∼ N 2 POVMs may be needed for IC (and each measurement needs to be repeated multiple times to gather the statistics). However, with some a priori knowledge of the state, e.g., that ρ is low rank or that certain elements of ρ are zero, the measurements can be cleverly selected and their number reduced.
Reconstructing ρ from d is thus an estimation problem, which can be approached in many ways. Common reconstruction techniques include linear inversion [107], maximum likelihood estimation (MLE) [57,108,109], and Bayesian methods [110][111][112]. Linear inversion, while being straightforward, can fail due to noise in the data or a high condition number of A [105] and produce unphysical entries in the density matrix, e.g., negative diagonal elements [113]. Therefore statistical inference techniques such as MLE or Bayesian estimation are preferred. Such methods give an estimate ρ for the density matrix by optimizing the likelihood function In case of continuous-variable outputs, where d i is a real number, appropriate binning is necessary to apply MLE [114]. Alternatively, the mean squared error between the output and the expected value can be minimized [68].
Although MLE guarantees a physical ρ , it does not provide any error bars to quantify the uncertainty in the estimate. Recently, it has also been argued that MLE is not optimal and is an inadmissible estimator for common metrics such as fidelity, mean-squared error and relative entropy [115]. Bayesian methods for QST, on the other hand, can quantify the uncertainity in the parameters of the density matrix using a prior probability distribution over different states π(ρ) [110,111]. The initial prior π 0 (ρ) should be uniform, or as uninformative as possible, and is updated by applying the Bayes theorem using the likelihood L(ρ |d) to give a posterior π f (ρ) ∝ L(ρ|d)π 0 (ρ). The best estimate of the underlying state is given as the mean over all states ρ µ , defined by the posterior distribution π f weighted by the likelihood computed from observed data: Other examples of methods to optimize the likelihood function and obtain a density matrix estimate include diluted MLE [113], compressed sensing (CS) [116] and projected gradient descent [117]. The CS methods are motivated by simple parameter-counting arguments: we should only require O(rN ) measurements, with r being the (low) rank of the density matrix [118]. Examples of such low-rank states, common in experiments, are pure quantum states corrupted by local noise processes. Recently, other modifications of CS have been proposed and demonstrated experimentally for adaptive tomography [119,120], which only require the a priori information of the density-matrix dimension (an improvement over CS, which requires an a priori guess of r).
However, a good ansatz or model for the state can reduce the effort for reconstruction. If we consider classes of quantum states having particular properties or symmetries, we can write their descriptions with fewer parameters than the N 2 − 1 required for a general density matrix. Matrix-product-state (MPS) [64,121] and tensornetwork (TN) tomography [122,123] are methods that find efficient ansätze for states using MPSs or TNs, and permutationally invariant tomography [124,125] exploits permutational symmetries of the density matrix. Some other improvements assume a noise model, e.g., additive gaussian noise [68], and therefore these techniques are often restricted to specific situations, lacking versatility.
A different formulation from the above techniques comes from the idea of projected gradient descent (PGD) [117]. In this method, a cost function is constructed that distinguishes between model-predicted data and the true data to apply gradient-based optimization to find the best estimate for the model (the density matrix). The benefit of the PGD technique is that it quickly converges to the MLE state in a wider variety of scenarios, even when the problem is ill-conditioned. The PGD method also sets up this notion of a cost-function, thereby translating the QST problem into an optimization problem.
Neural-network-based reconstruction methods have also shown significant promise. In such approaches, neural networks are either used as an ansatz for the state to obtain probabilities of measurement outcomes [78,88,89], or to directly estimate ρ [90]. However, a general framework to study quantum state reconstruction using standard feedforward neural networks is missing. In this article, we present a framework that allows any standard neural network to be used for quantum state discrimination and reconstruction by adapting the generative and discriminative modelling framework from machine learning to QSD and QST.

C. Discriminative and generative modeling
Quantum state discrimination and reconstruction can be related to discriminative and generative tasks in machine learning. Consider a data space S from which we obtain samples x of a random variable X. The samples can be classified as having one of k different labels y. A data set can thus consist of a collection of pairs {x, y}.
A discriminative model attempts to predict the class label y for a data point x , i.e., finding the correct conditional probability p(y|x ). We loosely interpret this as identifying whether a data point belongs to one of k possible data distributions p [k] data . A generative model aims to generate new samples x that are similar to the observed data, which is assumed to be drawn from a data distribution defined by a probability density p data (x). In general, real-world data distributions can be very complex, making it a hard problem to model them in a way that is both easy to compute and expressive enough to capture subtleties of the data. In Sec. II D, we discuss how deep neural networks are used to tackle such challenging distributions for generative and discriminative tasks.
The ideas of discriminative and generative modelling can be connected to QSD and QST in the following way. First, we identify the data space S with the space of measurement outcomes for operators {O i }. The outcomes can be collected either as single shots or average values; we denote the collected outcomes by d. The expectation value O i = tr(O i ρ) replaces the classical expectation value Thus ρ takes the role of a probability density function for the quantum system. If the data comes from one of k different quantum states ρ [k] , we can assign it a label y. Our data set is then formed by pairs {d, y}.
The discrimination task of assigning one of the k labels to some observed data d is QSD. Reconstruction of ρ can be considered a generative modelling task, where we aim to generate outcomes d of new measurement operators {O i } after having observed some results of POVM measurements. To fulfil that task, we either need to obtain ρ directly or find some parameterization of ρ that lets us Just like complicated classical data distributions p data , ρ can depend on many parameters and be difficult to estimate. However, efficient parameterizations of the quantum state using matrix-product states [64,126], tensor networks [122,123], or neural networks [72,73,[77][78][79][80][81][90][91][92] have reduced data and computation costs for quantum state reconstruction. In this article, we provide a general method to obtain ρ as the output of neural networks, allowing the conversion of any neural-network architecture into a generative model for QST. Our ideas are applicable to any parameterization of ρ.

D. Neural networks as discriminative and generative models
Neural networks can approximate any function arbitrarily well [69]. They can be treated as functions that map an input space to a target space: where θ are parameters that are learned from training on (labelled) data samples {x, y}.
To use neural networks for discriminative tasks (classification) is fairly straightforward. In this case, the output f (x; θ) of the network is interpreted as the conditional probability p(y|x) [11]. Then, by constructing a loss function that quantifies the total error of predictions on a training set, we can optimize the parameters θ to minimize the classification error.
Using neural networks as generative models is not as simple as mapping input data to target labels. Since a standard feed-forward neural network is a deterministic function f (x; θ), it cannot be sampled to generate new data x . Early schemes used to circumvent this problem were neural networks with stochastic outputs, e.g., restricted Boltzmann machines (RBMs). Later, deterministic feedforward neural networks were adapted to give stochastic outputs for generative tasks; examples include variational autoencoders (VAEs) and generative adversarial networks (GANs). Below, we briefly discuss these methods to motivate our choice of using the conditional variant of GANs for quantum state reconstruction, and to show how our architecture also has connections to the other models.

Restricted Boltzmann machines
Restricted Boltzmann machines [127][128][129] are stochastic neural networks that can represent arbitrary data distributions. An RBM consists of visible (v) and hidden (h) units which give stochastic binary outputs v, h ∈ {0, 1}. In single evaluations of the RBM, the states of the hidden units h j are updated to 1 if the probability is greater than a random number uniformly distributed between 0 and 1 (sampled in each update step). Here σ is the sigmoid activation function and {b j , w i,j } are parameters determining the interaction between different units. A visible unit v i is similarly updated depending on the states of the hidden units and another parameter a i . The result of updating the RBM units iteratively in this way from a random initial state is that the states of the visible units converge to a Boltzmann distribution where Z(θ) = v,h e −E(v,h;θ) is the partition function and the energy is given by parameterized by θ = a, b, w. To train an RBM is to find parameters θ which make the probability distribution p(v; θ) mimic the data distribution p data , as measured by some statistical divergence, e.g., the Kullback-Leibler (KL) divergence. After training, new data points can then be generated by sampling p(v; θ). Since standard RBMs only output binary-valued data, continuousvalued data needs to be handled either in a binary encoding or by using variants like Gaussian-Bernoulli RBMs [130].
Although RBMs have been around for a long time, it was only recently that effective techniques for training them, e.g., contrastive divergence [129,131], were found and enabled them to play a significant role in the initial success of image processing with deep neural networks. These training methods have later been successfully applied to QST with RBMs [132]. However, RBMs are still not straightforward to train and are less flexible than feedforward or convolutional neural networks. In particular, the partition function Z(θ) can be difficult to compute since it involves a sum over an exponential number of states [133]. Furthermore, the sampling methods can have convergence issues for typical high-dimensional problems. These issues have stimulated the development of standard feedforward neural networks converted for generative modelling.

Variational autoencoders
Variational autoencoders are an early example of an adaptation of standard feedforward neural networks to generative modelling. The idea of VAEs is to generate new data by sampling from a latent space Z and mapping it to the data space S, The latent space is used to define the data distribution, parameterized by θ, as the marginal of a joint distribution p θ (x, z) over the data and latent variables [12,82,134]: The latent variable model p θ (x, z) can be specified by using some prior noise distribution p z (z), assuming the following factorization representing an infinite mixture model: In VAEs, a neural network g acts as a stochastic decoder to map the latent space to data: Even if the factors in Eq. (12) are simple, e.g., Gaussians, their mixture can be very expressive and thus capture complex data distributions. However, the marginal p θ (x) is typically intractable due to the integral in Eq. (11). Finding θ by some gradient-based optimization is thus not feasible. The intractable nature of the marginal stems from the intractability of the posterior In VAEs, this posterior is approximated using a stochastic encoding in an encoder neural network e, parameterized by φ, that maps the data space to the latent space: The VAE architecture thus closely resembles that of an autoencoder -a neural network that finds a compressed representation of data by encoding it in a latent space and reconstructing it back from there, In general, VAEs assume q φ (z|x) and p θ (x|z) to be Gaussians specified by g θ and e φ . The encoder network processes an input x to give the mean and covariance for a multi-dimensional Gaussian, which is sampled to obtain latent vectors z. The decoder then generates new data x from another multi-dimensional Gaussian with the mean and covariance determined by the sampled noise vector.
To obtain the parameters (θ, φ), we want to maximize log p θ (x), but since it is intractable, the variational approach maximizes the evidence lower bound or minimizes the loss (17) However, training such a variational model comes with its own challenges due to the stochastic nature of the encoder and decoder. Even using a reparameterization trick making backpropagation-based training work on VAEs, several critical issues leave VAEs susceptible to generate samples that do not match the data distribution well. In image-generation tasks, this leads to blurry images as the Gaussian mixtures, used for their simplicity, are not the best for representing natural data distributions.

Generative adversarial networks
Generative adversarial networks [13] and their conditional variant, conditional GANs (CGANs) [22], solve the problem of approximating data distributions in a different way than RBMs and VAEs. In the GAN framework, a standard feedforward neural network G, with parameters θ G , generates new data using noise vectors z: The network G is trained by letting a second neural network, the discriminator D, evaluate the outputs from G. Unlike in a VAE, the second neural network does not map the input space to the latent space. Instead, the discriminator directly trains the generator to find the map from the latent noise space to data. The discriminator D is a standard classifier network, parameterized by θ D , that takes an input x and outputs a probability D(x ; θ D ) that x comes from the data distribution. The parameters {θ G , θ D } are optimized in an alternating fashion until the generator produces outputs that the discriminator cannot distinguish from samples of the real dataset, i.e., both x ∼ p data and x ∼ p data . In each optimization step, θ D is first updated to maximize Then, θ G is updated to minimize When the training is completed, new samples can be generated from G using noise vectors z.
A standard GAN can only generate samples randomly according to the data distribution. However, we can modify the inputs to G and D by adding a conditioning variable c to guide the output. This leads to the CGAN architecture, where The optimization of parameters for the CGAN networks follows the same procedure as for the GAN, i.e., maximizing Eq. (19) and minimizing Eq. (20).
The CGAN architecture provides a very flexible method for modelling complex conditional maps between different spaces. The flexibility stems from using the discriminator network, instead of a fixed loss function, as an evaluator of the generator performance. Conditional GANs have been used successfully in many types of generative tasks, e.g., generating images [135,136], converting edges to images, converting day to night pictures [22], etc. In this article, we use CGANs for QST, but we also borrow ideas from VAEs for this task.

III. DATA
In this section, we define the data that we use for testing our methods of classification and reconstruction. We consider eight classes of optical quantum states, which we define in Sec. III A. The data for these states is given by measurements consisting of applying coherent displacements followed by sampling of the photon number distribution for the resulting state, as we explain in Sec. III B. We consider six types of noise, described in Sec. III C, that can distort the data.

A. Optical quantum states
Optical quantum states are states of photons, i.e., of bosonic fields. In general, such states live in an infinitedimensional Hilbert space, but we can obtain a finitedimensional description by introducing a cutoff on the energy of the state. In the Fock basis for a single bosonic mode, a harmonic oscillator, the state is written where n represents photon number, N is the size of the Hilbert space, and c n are complex-valued amplitudes such that |c n | 2 = 1. Pure and mixed states in this Hilbert space are represented as N × N density matrices ρ.
Throughout this paper, we use a Hilbert-space cutoff of N c = 32, except for some specific examples and demonstrations. We restrict the maximum photon number of the various states to 16 to avoid artefacts due to truncation once the displacements are applied to these states.
Below, we define the various types of states used in this work. The first three are well-known, basic classes of quantum optical states. The following four are from bosonic codes, i.e., states that are designed for quantum error correction. For these latter states, we adopt the definitions from Ref. [137], where µ = {0, 1} denotes whether the state encodes logical 0 or 1. Finally, we also use random states as noise for representing mixed states.

Fock states
The Fock states fock are the eigenstates of the Fock basis, We consider Fock states with photon number 1 ≤ n ≤ 16.

Coherent states
Coherent states coherent are displaced vacuum states, characterized by the complex displacement amplitude α: where D(α) = exp αa † − α * a is the displacement operator and a (a † ) is the annihilation (creation) operator of the bosonic mode. The parameter α gives the position of the state in phase space. We consider 10 −6 ≤ |α| ≤ 3 to keep the mean photon number |α| 2 well below the Hilbert-space cutoff.

Thermal states
Thermal states thermal are mixed states where the photon number distribution follows super-Poissonian statistics: where the probability distribution for the photons are given by where n th is the mean photon number. We consider thermal states with n th ∈ [0, 16].

Num states
Num states are a specific set of bosonic-code states, consisting of superpositions of a few Fock states, numerically optimized (hence the name num) for quantum error correction, and characterized by their average photon numbern ∈ {1.562, 2.696, 2.770, 4.149, 4.336} [137,138]. The logical states for each code are orthogonal; forn = 1.562, they are

Binomial states
Binomial states bin are bosonic-code states constructed from a superposition of Fock states weighted by the binomial coefficients [137,138]: Here the parameter N plays a similar role as α in coherent states, determining the size of the state. For the parameter S, we use integers in the range [1,10]. Together with the Hilbert-space cutoff N c , this determines a maximum value for N. We use 2 ≤ N ≤ N c /(S + 1) − 1.

Cat states
Cat states cat are bosonic-code states consisting of superpositions of coherent states, with the simplest example being (|α ± |−α ) up to a normalization. In general, we can define cat states, parameterized by an integer S and a complex displacement α, as projections given by where N is a normalization. The projections are on even or odd Fock states given by with the variable r ∈ {0, 1, 2, ..., 2S + 1}. The parameter S corresponds to the number of photon-loss errors that the code can correct for. A simpler formulation for large α, i.e., 2|α| sin( π S+1 ) >> 1, is an equal superposition of the 2(S + 1) coherent states αe i π S+1 k 2S+1 k=0 around a circle of radius |α|. We use S ∈ {0, 1, 2} and |α| ∈ [1, 3].

Gottesmann-Kitaev-Preskill states
Gottesmann-Kitaev-Preskill states gkp are bosoniccode states defined on a square grid in phase space [100,137,139]. The ideal gkp states can be seen as a superposition of vacuum states displaced to the points of this grid: However, a finite gkp state limits the lattice and adds a Gaussian envelope to make the state normalizable, thus parameterizing the state with a real parameter ∆ ∈ [0, 1] as where the complex amplitudes α are calculated from the grid K(µ) = π 2 (2n 1 + µ)) + i π 2 n 2 with some finite cutoff for n 1 , n 2 . We use n 1 , n 2 ∈ {−20, 20} and ∆ ∈ [0.2, 0.5].

Random states
Random states are mixed states generated using the QuTiP [140,141] function rand_dm by choosing a density (proportion of non-zero elements) for the density matrix ρ random . The elements of ρ random are sampled from a uniform distribution, ensuring that the density matrix is physical (Hermitian, positive semi-definite, and with unit trace). We choose the density 0.8 for all tasks in this paper and allow ρ random to be full-rank mixed states.

B. Measurements
Measurements on optical states are usually performed with a displace-and-measure technique. Applying a coherent displacement of amplitude β and measuring the photon number distribution gives the generalized Q function [142] From the generalized Q function we can easily obtain other quasi-probability distributions describing the state, e.g., the Husimi Q function (photon field quadratures) and the Wigner function (photon parity) (f) The corresponding Wigner function computed using the different Q β n as (2/π) n (−1) n Q β n . Note that even when choosing a Hilbert-space cutoff of 100 for this demonstration, the corners in the Wigner-function plot have spurious non-zero values at large displacements β ≈ ±5 ± 5i. To mitigate such effects, larger cutoffs are required for states that have a high photon number or we need to restrict the computation to smaller values of β. Other methods of computing the Wigner function from ρ do not suffer such problems even with a cutoff of 16 for this specific example. QuTiP provides several such implementations and we use one of them, the numerically stable Clenshaw method, to compute Wigner functions in the rest of the paper.
In Fig. 2, we plot the Q β n functions for a binomial state to illustrate the different types of data. We also show how combining the various levels of the generalized Q function leads to the Wigner function.
In this paper, we mostly consider classification and reconstruction of optical quantum states based on Husimi-Q-function data, but our methods can also be used with Wigner-function data (as we show when reconstructing a state from experimental data in Ref. [1]), generalized-Qfunction data, or data from any other observables.
In Fig. 3, we plot Wigner functions and Hinton plots of the density matrices for representative examples of all classes of states defined in Sec. III A above.

C. Noise
Noise is an inevitable factor in most experiments. Thus, methods for state classification and reconstruction should be made sufficiently robust against various types of noise. In this subsection, we define the different types of noise that we use to test our neural-network-based classification and reconstruction.
Noise can enter the problem at different stages. First, the preparation of the state to be classified or reconstructed could have errors that lead to a slightly different state, ρ → ρ noisy (state-preparation errors). Second, the measurement protocol could have errors due to calibration such that we are not measuring exactly what we sought out to measure, {O i } → {O noisy i } (measurement errors). Lastly, there can be errors in the data collection, e.g., errors incurred during amplification of the signal or photon shot noise which corrupts the data, d → d noisy (data errors).
The state-preparation and measurement (SPAM) errors can be systematic and thus hard to correct. Recently, deep neural networks have been demonstrated to be effective in learning such errors and correcting them [91] by training a supervised model to correct the data d noisy → DNN → d. The neural network is thus used as a sophisticated filter to de-noise experimental data which can be agnostic to the underlying SPAM noise. In this work, during reconstruction, we do not train our networks to correct SPAM errors; we only deal with specific errors on a case-by-case basis. But for classification, we show that the neural-network approach is robust against the various types of SPAM and data errors defined below.

Mixed states
In many experiments, thermal and other environmental noise will affect the quantum state. We model this noise by considering mixed states [see Fig. 4

(a)]
with σ ∈ [0, 0.5]. In the classification task, the correct label for such a mixed state is defined to be that of the class that ρ belongs to. In the reconstruction task, the aim would not be to reconstruct ρ, but to reconstruct ρ mixed , since that is the actual state created in the experiment.

Convolution with Gaussian noise during amplification
In a measurement scheme which uses linear amplification detectors, one of the effects of noise is modelled by considering additional bosonic modes coming from the amplifier channel [143]. The Husimi Q function in the presence of such linear noise channels [see Fig. 4 Figure 3. Representative examples from each class of optical quantum states considered in this paper. In the top row, we plot the Wigner function for the states, using the same scaling as Fig. 2(f). In the bottom row, we show the values of the density-matrix elements for each state as Hinton plots similar to Fig. 2(b). We can see that the Wigner functions and density matrices have characteristic patterns that a neural network can learn and use for classification or reconstruction. convolution where P h (β ) is the Glauber-Sudarshan P function [144] of the noise mode. While at optical frequencies the noise mode is nearly in the vacuum state, such that Q noisy (β ) ∼ Q(β ), at microwave frequencies, the noise mode is in a thermal state. In this case, Therefore, the effect of noise is simply applying a Gaussian convolution with the variance n th . Note that such a noise is also interpreted as detection efficiency error with the reduced detection efficiency η = 1/(1 + n th ). We consider such noise during reconstruction tasks by allowing it as an input which is easily estimated in experiments, e.g., the detector efficiency or thermal photons in the amplification channel.

Photon loss
If the optical quantum state is created in a lossy resonator, photons may leak out from this resonator before the measurement of the state is completed. We model such photon loss [see Fig. 4(c)] by letting the original state evolve for some time τ according to the master equationρ where H = ωa † a is the free resonator Hamiltonian, ω is the resonator frequency, γ is the photon loss rate, and Similar to the case of mixed states in Sec. III C 1, in the classification task, the correct label is defined to be that of the class that ρ(t = 0) belongs to, while in the reconstruction task, such a noise is not necessarily an error as the aim is to reconstruct ρ(t = τ ).

Affine transformations
An affine transformation is a geometric transformation that can be represented as a composition of a linear transformation and a translation. In two-dimensional (2D) images, it preserves lines and parallelism, but allows for effects such as rotations, displacements, reflections, scaling, and shearing. Our motivation for this type of noise is that such effects can mimic SPAM errors, e.g., poorly calibrated displacement pulses, squeezing, and rotations of the state. We therefore consider rotations, displacements, scaling, and shearing to distort the training data (2D images of Husimi Q or Wigner functions), see Fig. 4(d).
If (x, p) represent the position and momentum values in the phase space, i.e., (Re(β), Im(β)), the affine transformation (x, p) → (X, P ) can be parameterized by the scaling factors (s x , s y ), rotation angle θ, the shear Ω, and linear displacements ∆x, ∆p as We use the TensorFlow [145] implementation for data augmentation that applies such transformations, with the values of the parameters randomly selected within a certain range for each image augmentation: such that the pixels of the images are shifted up to 20 % of the image size.
The range for scaling the image (zoom) is set to 0.2 to allow shrinking or expanding the images within a factor [0.8, 1.2] of the original size. We also allow the images to be flipped horizontally and vertically. The data augmentation described here is only used in the classification task.

Additive Gaussian noise
Measuring the expectation value of a quantum observable often requires repeated measurements to find the average value with good precision. Thus, a limited number of measurements will reduce the precision. Moreover, the precision can also be reduced by binning of measurement results from nearby points in the phase space. We model these types of uncertainty in the data by adding randomly sampled values from a Gaussian distribution N with zero mean and standard deviation σ G to each data point as See Fig. 4(e) for an example.

Pepper noise
Salt-and-pepper noise represents a corruption of data where the signal changes drastically at a few points. We use pepper noise [see Fig. 4(f)] to represent dead pixels or missing data by selecting a random proportion of data points and setting them to zero.

IV. METHODS
In this section, we present the details of how we use deep neural networks for the two tasks -classification (quantum state discrimination) and reconstruction (obtaining the density matrix) using the data discussed in Sec. III. Three different neural-network architectures are considered: Classifier, Generator, and Discriminator. We provide the methods and parameters for training and evaluation of the networks that we have used to obtain our results in this paper (Sec. V) and in Ref. [1].

A. Classification
The problem of quantum state discrimination can be considered as a classification task, a task for which deep neural networks have shown impressive results. The input data d consists of observed frequencies for some measurement, which is related to the probabilities of outcomes of observables. The output is a label ∈ {fock, coherent, thermal, cat, bin, num, gkp}. The neural network we use for the classification is a standard convolutional neural network, which we train by minimizing cross-entropy loss using back-propagation.

Input and output data
Since we consider optical quantum states, the data, e.g., the Husimi Q or Wigner function of the state, can be rearranged into an image on a grid determined by the real and imaginary parts of the displacements β. Our training dataset is generated by randomly constructing states from the seven classes discussed in Secs. III A 1-III A 7, adding noise in the form of random mixed states (see Secs. III A 8 and III C 1) and then calculating the Husimi Q functions of the resulting states for the fixed set of β values evenly spaced in a 32×32 grid with β ∈ [−5, 5].
We use 43,762 states for training and 8,670 states for testing. The input values are normalized to the range [0, 1] by dividing each data instance with the maximum value. In the training phase, affine transformations (see Sec. III C 4) and additive Gaussian noise (see Sec. III C 5) with σ randomly selected between [0, 0.05] are applied to the data. The addition of noise has a dual purposepreventing overfitting and mimicing the effects of measurement noise. In the testing phase, we consider the impact of different types of noise separately.
The output labels are encoded in a 7-dimensional vector using a one-hot-encoding, {t i } with t i ∈ {0, 1} and t i = 1 denotes that the input state has been labelled as belonging to the class i.
Note that the full generalized Q function (see Sec. III B) could be represented as a multi-channel image n × n × N c , where n × n is the grid of β values and N c is the photon-number cutoff. Similarly, we can just input the flattened data vector d for other types of measurements that cannot be seen as an image. However, for data in such form, using convolutional layers in the neural network would not make much sense, since there may not be any spatial correlations in the data.  Figure 5. Sketch of the Classifier network which classifies optical quantum states from Husimi Q data (input image on the left). The blocks represents convolution operations, where filters (exemplified by boxes connecting one layer to another) extract features from the image. We use six such convolutional layers in our architecture (we only show three here). The extracted features are fed to the first of three fully connected layers. The outputs of the last layer are converted to a classification label. For the parameters used, see Table II.
The Classifier network, illustrated in Fig. 5 and detailed in Table II, is a convolutional neural network (CNN). Its first six layers consists of blocks of convolution [147] layers that extract geometric features from the input image. After the first six layers, the output is flattened and fed through two fully connected layers that output a 7-dimensional vector for each input image.
We use the activation function "LeakyReLU" [148] for all layers except the final output. The final output layer has 7 neurons, with outputs {y i }, one for each class. We apply a softmax activation to these outputs, to normalize the outputs such that they can be interpreted as the probability of the input data belonging to one of the seven classes. We assign the predicted label for the input state to the output that has the highest probability.

Training
The parameters of the Classifier network are trained by minimizing the average cross-entropy loss between the predicted probabilities softmax(y i ) in Eq. (45)  We use the gradient-based optimizer Adam [12] with a learning rate l = 0.0002 and exponential decay rates for first and second moment estimates, m 1 = 0.5, m 2 = 0.5, to minimize the cross-entropy loss.
During training, we apply the dropout regularization technique [149], where the output of a random fraction of neurons is ignored at each step of optimization, to prevent overfitting. We use 40 % dropout after the second, fourth, and sixth convolutional layers, and after the first dense layer. To further prevent overfitting, we also add a small (additive) Gaussian noise (σ = 0.005, see Sec. III C 5) after the second and fourth convolutional layers.

B. Reconstruction
We now show how a standard neural network can be used to reconstruct the density matrix ρ of a quantum state by adding custom layers to a generative model. The standard formulation of a generative model with feedforward neural networks (see Sec. II D 3) is a map between Figure 6. Sketch of the Generator G and Discriminator D neural networks adapted for quantum state reconstruction. The two inputs to G are the measurement statistics d and the observables {Oi}. The input d is taken as a flattened vector, which is reshaped to a 16 × 16 × 2 matrix after the first layer of G. Then, after successive transpose convolution operations, we obtain a 32 × 32 × 2 matrix. This intermediate output is converted into a lower-triangular matrix with real elements on the diagonal to obtain a Cholesky decompostion form, TG, that can yield a valid density-matrix representation ρG. The expectation values of the observables are then computed using the Born rule tr(OiρG). In the last layer, any known source of noise is added to the outputs. The inputs to D are a concatenation of d with either the generated data from G or d itself. The output is interpreted as a similarity score between the inputs (the score is ∼ 1 if they match, i.e., for inputs ∼ d). The weights of the two networks, θG and θD are updated alternatingly to minimize their respective loss functions. For the details of all the parameters, see Table III and Table IV. a latent space and the data space. Our data d consists of single shots or average values of measurement outcomes for operators {O i }. We construct a Generator network parameterized by weights θ G ,that first estimates a density matrix ρ G . We then use a custom Expectation layer that can generate the statistics for new measurements The Generator-network formulation, depicted in Fig. 6, resembles a VAE (see Sec. II D 2), but rather than modelling the data distribution using a parameterization with a mixture of Gaussians, we instead use the straightforward parameterization given by the estimated density matrix itself, ρ G . The mapping between the latent space of measurement operators {O i } and the outcomes is simply tr(O i ρ G ), which is the data generation map. Below, we define the input and output data for the Generator network. Then, we show the details of the network architecture with our customized layers that regularize the intermediate output ρ G to a valid density matrix and generates the correct output. Finally, we discuss the training methods used to optimize the parameters of the Generator network. The first training method focuses on minimizing the least-squares and cross-entropy loss between the expected output and generated output. The second method learns a more sophisticated loss function in the form a second, trainable neural network, a Discriminator, also illustrated in Fig. 6. This second training method is inspired by the idea of CGANs [13,22], which we use for quantum state tomography (QST-CGAN) [1].

Input and output data
The input data for reconstruction are the measurement statistics d and the operators {O i } that were measured. Similar to the classification task, we consider the Husimi Q function in a 32 × 32 grid with β ∈ [−5, 5]. The measurement operators O i are 32 × 32 complex-valued matrices. Therefore the input data for a single reconstruction is a combination of the flattened data vector d (1×1,024) and the set of operators {O i } (1 × 1,024 × 32 × 32). Note that it is easy to change the parameters in the data or the neural-network architecture to allow arbitrary phasespace grid sizes and Hilbert-space cutoffs; the fact that they are both set to 32 in most examples here does not have any special significance.
The training data for a single reconstruction thus requires only these 1,024 data points (real-valued num-bers) and the 1,024 operators (complex-valued matrices) as the input for each reconstruction. We consider noise on a case-by-case basis during training (described in Sec. IV B 3 below).
The output of the neural network is a (1×1,024) vector representing the expectation values for the measurements {O i }. Inside the Generator, the full density matrix of the state is estimated as a 1 × 32 × 32 complex-valued matrix ρ G determined by the outputs of an intermediate DensityMatrix layer.
In this way, we allow for a flexible architecture, which can reconstruct a single state with inputs shaped as (1 × 1024, 1 × 1024 × 32 × 32) for (d, {O i }) or allow multiple states as the input simply by concatenating the inputs. For example, to reconstruct 10 states simultaneously with 1,024 measurements each, we simply feed the network a batch of data points as (10 × 1,024, 10 × 1,024 × 32 × 32).
In this article, we only consider single reconstructions, so our inputs will always be of the shape 1 × n for the data d and 1 × n × N c × N c for the measurements, where n is the number of measurement settings and N c is the Hilbert-space cutoff. Note that we allowed the most general description of the measurement setting in the inputs as the full operator descriptions {O i }. We could also use alternative ways to specify the measurement settings, e.g., a set of complex displacements β i , and redefine our Expectation layer to use those β values. In the case of qubit tomography, these measurement settings can be replaced with a set of single-qubit measurement operators such as [Z, X, X, Z, . . .].

Network architecture
Our Generator network G is a modified version of the standard G(z; θ) formulation (see Sec. II D 3), where we first consider the conditional form G(z|(d, {O i }); θ). The conditioning variable is our data and the measurement settings represented as a vector and a set of matrix operators, respectively. Then, inspired by the pix2pix architecture [22], we remove the random noise z and just consider the data and measurement operators as inputs to define G(d, {O i }; θ) as the Generator.
The full architecture, detailed in Table III and depicted in Fig. 6, begins with a fully connected dense layer, which receives the flattened data vector d as input. The output of this layer is reshaped to a 16 × 16 × 2 tensor. This layer converts the input into a matrix with two channels that can be upsampled into the density matrix. The next layers are three blocks of two-dimensional transpose convolution operations (Conv2D-T) and instance normalizations [150] such that the final output is moulded to an estimate of the density matrix ρ G . All the layers described so far use LeakyReLU activation, except the final Conv2D-T layer, whose outputs are fed to a custom DensityMatrix layer.
The DensityMatrix layer converts the output of the final Conv2D-T layer to a valid density matrix. This Table III. Definitions, shapes, and number of trainable parameters for the layers of the Generator network. We denote the transpose convolution layers as Conv2D-T(f, k, s) where f , k, and s represent the filter size, kernel size, and strides respectively. After the first dense layer and the first three Conv2D-T layers, the activation function LeakyReLU is used. Instance normalization is used between the first two Conv2D-T layers. The output from the last Conv2D-T layer passes through two custom neural network layers: a DensityMatrix layer generating ρG, and an Expectation layer generating expectation values. A full implementation of the code as a TensorFlow model can be found in Ref. [146].

Layer
Output output is two matrices (32 × 32 × 2), which are combined into one 32 × 32 complex-valued matrix, T G . The upper triangular part of T G and the imaginary part of the diagonal are set to zero to obtain the Cholesky decomposition of a Hermitian matrix [Eq. (1)]. Finally, we divide the resulting matrix by its trace to obtain a valid density matrix. Therefore, the custom DensityMatrix layer can convert the real-valued outputs of any standard neural network to a Hermitian, positive-semidefinite matrix with unit trace. The final layer is another custom one, called Expectation. It takes as input {O i } during training (the other part of the input to the Generator) and outputs the expected values for measurement outcomes for each component of d as The last two layers, DensityMatrix and Expectation, do not contain any trainable parameters. The Discriminator network used to train the generator is detailed in Table IV. This network receives two inputs: the data d and the generated statistics d , and begins by concatenating the two. The concatenated input is then passed through four dense layers, with the final layer having 64 neurons. All the layers of the discriminator use the LeakyReLU activation, except the final layer, whose outputs are interpreted as a measure of the similarity between d and d. Note that the dimensions, or even the shape, of the final output layer can be arbitrary. The outputs simply need to be interpret as a similarity score between d and d which should be ∼ 1 if d ∼ d and ∼ 0 otherwise. We were inspired by the PatchGAN idea [22] for our Discriminator that motivates penalties at the scale of patches in the input. We have also concurrently found during the course of our work, that similar ideas were effectively demonstrated for X-ray tomography with promising results for denoising [151].

Training
The training for reconstruction can be done in two ways -either we reconstruct a single state or we reconstruct a set of different states using the same Generator network. This flexibility comes from our formulation of the Generator network and reshaping of the data to find a map from data space to the set of density matrices. In this article, we only show how to perform single reconstructions, but in Ref. [1], we show how the same Generator network can perform single-shot reconstructions for many different states.
For each reconstruction in this article, we only consider a single state ρ and the data from measurements of several operators on ρ as the inputs and outputs (see Sec. IV B 1). We train the Generator network to minimize a loss metric that gives some measure of how the reconstructed statistics d , calculated from an underlying ρ G , differ from the data d. If d i are the frequencies of measurements O i and d i = tr(O i ρ G ) are the computed probabilities from the generated density matrix, then maximizing the log-likelihood in Eq. (3) amounts to minimizing the cross-entropy loss between observed frequencies d and d : However, the cross-entropy loss assumes discrete-valued data, i.e., single-shot outputs of POVMs, whereas in many cases we may be looking at continuous-variable outputs instead. If we consider the data to be the expectation values of some continuous-valued observable, e.g., the homodyne current, metrics such as the mean squared error where q is the number of data points, are more suitable. For such continuous-valued data, the error in measurement can be assumed normally distributed with variance σ 2 i . Under this assumption, minimizing the L2 loss maximizes the likelihood where we consider the mean for each measurement outcome as the expectation value µ i = tr(ρ M) for some observable M . Speaking more generally, the loss function uses some metric to measure the distance between two probability distributions P and Q. Such metrics can be divided into two major classes: φ-divergences and integral probability metrics (IPMs) [152,153]. The first are of the form where φ is some convex function φ : R ≥0 → R ≥0 , while the latter are defined as where the class of functions G parameterizes some notion of distance.
In the deep-learning community, the study of such metrics in generative modelling is an area of active research [154,155]. It has been shown that the choice of loss function can greatly impact the quality of image reconstruction [156]. There are several recent attempts to gain better understanding of the role of different loss functions in GAN performance, e.g., using the Wasserstein metric [154] or IPMs [157].
Since the best choice of loss function is far from clear, we train the Generator network to minimize several different loss functions between predicted Husimi Q values and observed data. We first use the well-known L1, L2, and cross-entropy loss functions, as well as the KL divergence. The latter two are closely related, and belong to the class of φ-divergences. The L1 loss is both a φdivergence and an IPM.
Beyond these well-known loss functions, GANs allow for more complex loss functions to be learned. In our QST-CGAN architecture, we train the Generator using the Discriminator network combined with L1 loss, to minimize with the L1 loss coefficient λ L1 ∈ {0, 1, 10, 100}. The discriminator loss function maximizes Eq. (19) by minimizing where the last term is a gradient penalty [158] with weight λ ∆ = 10. We combined the inputs to the Discriminator as the vector x. Therefore, in each training iteration, we alternatively update the generator and discriminator weights using backpropagation with the help of some gradient-based optimizer. Since the choice of hyperparameters, e.g., optimizer or learning rate, can significantly affect the rate of convergence, we try to find settings that enable a fast convergence for all loss functions. To make a fair comparison, we keep the same parameters for optimization for all loss functions. We use the Adam optimizer with the exponential decay rates for first and second moment estimates as m 1 = 0.5, m 2 = 0.5. We also use an exponentially decaying learning rate as a function of iteration number i, with initial learning rate l 0 = 0.0002, decay constant C = 0.96, and s = 1000 steps.

V. RESULTS
In this section, we characterize the performance of our Classifier and Generator networks in various settings. We first check the performance of the Classifier network, including some types of noise, in Sec. V A 1. We then study, in Secs. V A 2 and V A 3, the impact of photon loss and additive Gaussian noise on classification performance. Finally, in Sec. V A 4, we analyze which parts of the data the Classifier bases its decision on. This provides information that can help reduce the number of measurements needed in an experiment or guide an adaptive scheme for tomography.
For reconstruction, we first investigate, in Sec. V B 1, the result of using different loss functions, including the Discriminator network, to train the Generator network. We show the performance of the Generator network against a standard maximum-likelihood-based reconstruction algorithm (iMLE) [159] under additive Gaussian noise in Sec. V B 2 and gaussian convolution noise in Sec. V B 3. Then, we show the results of reconstruction for mixed states in Sec. V B 4 and finally, in Sec. V B 5, demonstrate how few data points are needed for the network to reconstruct a state well.  The performance of the Classifier network on a test set is shown as a confusion matrix in Fig. 7(a). The test set consists of ∼ 1, 200 different instances of each of the seven classes in Secs. III A 1-III A 7, with noise in the form of state mixing (see Sec. III C 1) applied with σ ∈ [0, 0.5] and density 0.8.
The accuracy of the classification (number of correct classifications divided by the total number of classifications) on the whole test set is 98.6 %. For a validation set with the same states as the test set, but where we have added noise in the form of affine transformations (see Sec. III C 4) and additive Gaussian (see Sec. III C 4 with σ ∈ [0, 0.05]) on top of the state-mixing noise, the accuracy of the Classifier remains very high, 97.7 %.
It is clear from the confusion matrix in Fig. 7(a) that the class which presents challenges for the network is cat. All other classes are correctly identified in virtually every case, but the cat states are misclassified in about 9 % of the cases. In these cases, the network misidentifies the cat states as all other classes except thermal, with the most common mislabellings being coherent, fock, and binomial. The reverse misidentification, where a state is misclassified as cat, occurs for about 1 % of the binomial states.
A few examples of misclassifications are shown in Fig. 7(b), where we consider pure cat states with low values of α. These examples demonstrate that there are parameters for which states from different classes are very similar. For example, a cat(α = 4, S = 0) state ρ and a binomial(S = 1, N = 16) state ρ have a fidelity greater than 0.99. Note that the fidelity F reduces to the squared overlap | ψ|ψ | 2 for pure states. Similarly, the fidelity of cat(α = 3, S = 4) and fock(10) is greater than 0.996. It is thus not surprising that the network found some states hard to classify. A human quantum physicist would likely have made the same misclassifications from the data in Fig. 7(b).

Recognizing cat states with photon loss
We now investigate the performance of the Classifier network in the presence of photon loss (see Sec. III C 3). In Fig. 8(a), we show how well the Classifier manages to recognize a set of cat(α, S = 0) states, with µ = 0 and |α| ∈ [2,3], as more and more photons are lost. Before any photons are lost, the softmax probabilities for different labels show that the Classifier assigns the highest probability to the label cat. After ∼ 70 % of the photons have been lost, the probability of the state being classified as a cat decreases and the labels coherent and binomial become equally probable. It is an interesting question whether these probabilities reflect the characteristics of the state in such a way that it could be used as a starting point for reconstruction. When almost all the photons are lost, the classification label is always coherent.
Even though we did not include any photon-loss noise during the training phase, the Classifier is still able to identify cat states after many photons have been lost. It should be noted that once photons have been lost, it is not certain that the state can be considered a cat state anymore. A distinctive feature of cat states is the interference between the coherent states making up the super-  Husimi Q functions for one of the cat states in the dataset with, from left to right, 0 %, 20 %, and 100 % of photons lost with respective fidelities 0.76 and 0.19 for the states with photon loss. It is not straightforward to assert from just the Husimi-Q data when a cat state stops being a "cat" as it still possesses cat-like features (two coherent blobs) even after losing 20 % of the initial photons. (c) The photon-number occupation probabilities for the states in (b). Note that the occupation probability for the vacuum state is ∼ 1 but we set the limits of the y-axis to 0.5 for better distinguishability. position state. This interference results in zero probability of odd photon numbers in the state [see the left panel in Fig. 8(c)]. Once photon loss starts acting on the state, these occupation probabilities become non-zero [see the middle panel in Fig. 8(c)], but the Classifier network can still identify general features leading it to classify the data with the label cat. However, once more pho-  tons have been lost, the state ceases to be a cat state and is classified as a coherent state.
We note that the results presented here for classification under photon loss may be different if the network instead is trained on data in the form of Wigner functions (see Sec. III B) instead of Husimi Q functions. The Wigner function for cat states has characteristic interference fringes, some with negative values, between the coherent-state blobs. These features are not clearly seen in the Husimi Q function; it only takes very small nonzero values (∼ 10 −4 ) between the two coherent blobs in a cat(α = 2, S = 0, µ = 0) state. Another approach to identify the lossy cat states better would be to train a classifier to distinguish cat states and mixtures of coherent states from the Husimi Q function. Just like the Classifier, this does not require explicitly specifying criteria for what is or is not a cat, but works in the spirit of "Software 2.0" [160] -replacing explicit programming with learning from data.

Classification in the presence of additive Gaussian noise
Next, we test the performance of the Classifier in the presence of additive Gaussian noise. As explained in Sec. III C 5, this type of noise models uncertainty in the data due to averaging over a limited number of measurements and binning of data. In Fig. 9(a), we plot the classification accuracy as a function of the standard deviation σ of the added Gaussian noise (see Sec. III C 5). The dataset is the same as that in Fig. 7, but with the Gaussian noise added. In Fig. 9(b), we show an example of how the Gaussian noise impacts a cat state in the dataset.
The accuracy of the predictions from the Classifier remains high until σ ≈ 0.05 and then decreases gradually. However, even at σ = 1, the accuracy is almost 25 %, clearly better than ∼ 1/7, which is what one would obtain for a random guess among the seven classes. At these high levels of noise, the network can still correctly classify up to ∼ 60 % of the fock states, ∼ 30 % of the coherent states, and ∼ 55 % of the cat states in the test set. However, at such a high level of noise, the Classifier almost always predicts the label as one of fock, coherent, or cat. Hence accuracy is not the best indicator of performance in all scenarios.
Therefore, in addition to the accuracy, we also quantify the Classifier performance by considering the receiveroperating-characteristic (ROC) curve [161,162]. The ROC curve for a binary classification problem is a plot of the true positive rate (TPR, the ratio between correctly classified positive labels and the number of real positive labels) versus the false positive rate (FPR, the ratio between false predictions of positive labels and true predictions of negative labels). The area under the ROC curve gives an indication of the discriminative power of the classifier: the area is 1 for perfect classification and 0.5 for random guesses. For our multi-class problem, we use the one-vs-rest strategy in Scikit-learn [163] to calculate the area under the ROC curve. The result, averaged over all classes, is shown in Fig. 9(a). The area under the ROC curve shows a behaviour similar to the accuracy, but indicates a somewhat better performance than the latter does.

What does the network see?
The Classifier network is a highly nonlinear function that maps data to a label. In order to find the patterns in the data that the network uses to determine the label, we apply gradient-weighted class activation mapping (Grad-CAM) [164]. The Grad-CAM method works by fixing a class label and finding the gradients of the score for this target class (before the softmax activation) with respect to the last convolution layer of the network. This is a form of back-propagation that allows us to construct a heatmap of numbers showing which pixels of the input image influence the output the most. In Fig. 10(a), we show three examples of noisy input data, from which we calculate Grad-CAM heatmaps, shown in Fig. 10(b). These heatmaps are then used in Fig. 10(b) to show the parts of the noise-free input data that contribute the most to the classification. Affine transformations (see Sec. III C 4) and additive Gaussian noise (see Sec. III C 5) with σ = 0.2 have been applied to the input data to simulate an experiment with SPAM errors and little averaging. We chose to only show the parts of the data where the heatmap has high values (exceeding 0.9), to demonstrate that, even in the presence of significant noise, the Classifier makes its decision based on the data in the regions that contain the important patterns characterizing the state. A non-machine-learning way to achieve similar results would be to hand-craft an algorithm that can clean noisy data and detect the regions with a high signal using some boundary-finding al-gorithm. However, instead of hand-crafting solutions for each type of state and noise, our trained Classifier can easily adapt to a variety of different scenarios.
The Grad-CAM results suggest an interesting possibility for adaptive tomography: using Grad-CAM during the data collection in an experiment to identify regions that are important and then sample from these regions more that from other places. In this way, our Classifier network can identify specific POVMs (defined by displacements β) that give the most useful data for discriminating optical quantum states.

Impact of loss metric
We first investigate how the choice of loss function affects the performance of our neural-network reconstruction method. In Fig. 11, we compare the impact of different loss metrics used to train the Generator network. For each loss function, we train the network using the same data. We show the reconstruction fidelity for data from a binomial state reconstructed with six different methods (see Sec. IV B 3). In Fig. 11(a), we show results for the QST-CGAN with various weights λ L1 of the L1 loss term in Eq. (54). For all values of λ L1 , including λ L1 = 0, corresponding to pure Discriminator loss, the reconstruction fidelity converges to unity. The convergence is faster with L1 loss added than without it, but a large weight on the L1 part of the loss function leads to worse performance than a moderate weight. The best performance is seen for λ L1 = 1, when the network converges to the correct reconstruction in a little more than 100 iterations, i.e., 100 updates of our estimate for the density matrix. The standard iMLE method, shown in Fig. 11(b), also converges to unit fidelity, but does so two orders of magnitude slower than the best QST-CGAN in terms of the number of iterations required.
In Fig. 11(c)-(f), we plot the results of training the Generator using the cross-entropy, KL-divergence, L1, and L2 loss functions, respectively. In all cases, the reconstruction fidelity converges to unity. The Generator trained with cross-entropy loss [ Fig. 11(c)] displays the fastest convergence, on par with the best QST-CGAN. Training with KL-divergence loss [ Fig. 11(d)] gives almost as good results. The L1 [ Fig. 11(e)] and L2 [ Fig. 11(f)] loss functions result in slower convergence, but still perform better than iMLE for the example considered here. We note that the L1 and L2 loss functions lead to a wider distribution of the number of iterations required for convergence for the same data than any of the other methods.
To ensure that the results in Fig. 11 were not particular to the state used as input data there, we also show the results of reconstruction of a cat state in Fig. 12. The results in Fig. 12(a) are similar to those in Fig. 11(a): the QST-CGAN always converges to unit fidelity, and (e) it does so the fastest when L1 loss is added to the Discriminator loss with weight λ L1 = 1. The main difference to Fig. 11(a) is that the convergence with pure Discriminator loss is considerably faster in Fig. 12(a) and is almost as fast as when L1 loss is added. Just as in Fig. 11, the iMLE method, shown in Fig. 12(b), converges to unit fidelity about two orders of magnitude slower than the best QST-CGAN.
The plots in Fig. 12(c)-(f) show the results of training the Generator using the cross-entropy, KL-divergence, L1, and L2 loss functions, respectively. Whereas these methods all eventually lead to unit fidelity for the recon-  Fig. 11, here they all sometimes fail and end up in a state giving reconstruction fidelity zero instead. In the cases where they do end up at unit fidelity, the convergence is approximately as fast as in Fig. 11, perhaps somewhat faster for the L2 loss in Fig. 12(f).

struction in
In the cases where the standard loss functions lead the Generator to reconstruct a state with fidelity zero, the reconstructed state is a cat state, shown in the inset of Fig. 12(d), orthogonal to the cat state, shown in the inset of Fig. 12(b), that provides the data. The two cat states have the same α and virtually indistinguishable Husimi Q functions. The only difference between the two is that the correct state has non-zero values in a narrow line along Im(β) = 0 between the two prominent lobes in the Husimi Q function, while the orthogonal state has non-zero values at two narrow lines along Im(β) ≈ ±0.5 instead. The differences between the two states are more clearly seen if one plots their Wigner functions instead. We consider reconstruction from Wigner-function samples in Sec. V B 4 and from experimental data in Ref. [1].
For the KL divergence in Fig. 12(d) and the L1 loss in Fig. 12(e), the Generator network seems to start moving towards one of the two cat states and then eventually converge to that state. However, for the L2 loss in Fig. 12(e), there are some runs where the Generator network reconstructs an orthogonal state (with very low fidelity to the target), but then corrects and jumps to the correct state within a few iterations. In the specific case of a cat state, the orthogonal state is reached by applying the photon annihilation operator a to the correct state. It remains to be explored if the Generator network learns to represent quantum states in a way that it can apply such non-trivial quantum operations to find the correct state from an initially incorrect prediction.
In our attempts to tune the hyperparameters of the training, we have noticed that higher values of the parameters m 1 and m 2 for the Adam optimizer removes the behaviour seen in Fig. 12(c)-(f). Instead, for these values the Generator always finds the correct state and not its orthogonal counterpart, similar to how the QST-CGAN in Fig. 12(a) always converges to the correct state. Another way to achieve this convergence could be sampling more around Re(β) to focus on the data that distinguishes the two cat states. In any case, it is noteworthy that the Generator network could possibly apply nontrivial steps to quickly reconstruct the state while the iMLE converges in small steady steps.
To summarize the results in Figs. 11 and 12, the main finding is that the best QST-CGAN reaches unit fidelity orders of magnitude faster than the standard iMLE method. The QST-CGAN performs best when its loss function is an approximately equal mix of L1 loss and the trained Discriminator loss. Further tuning of hyperparameters leads to even better performance in some cases. Among the standard loss functions, cross-entropy loss and KL divergence lead to somewhat better performance for the Generator network than did L1 and L2 loss. However, as we will see in the following subsections, there are other situations, e.g., when the reconstruction is performed in the presence of noise, where these losses give better performance and where a different value for λ L1 may be more suitable for the QST-CGAN. The fact that different situations seem to require different loss functions is an important argument in favour of the flexibility of the Discriminator loss, which can adapt to the situation, allowing the QST-CGAN to perform well in a more general setting.

Reconstruction in the presence of additive Gaussian noise
We now compare how different loss functions affect the neural-network performance in the presence of additive Gaussian noise (see Sec. III C 5). In Fig. 13, we show representative results of reconstructing a binomial state from Husimi-Q-function data where Gaussian noise has been added. For each β in the Husimi Q function of the state, we add a random value sampled from a zero-mean Gaussian with standard deviation σ G = 0.05. Before adding the noise, we rescale the data to the range [0, 1] by dividing it with the maximum value of the Husimi Q function.
To enable the neural network to learn the state underlying the noisy data, we augment the Generator output with the known noise by introducing a GaussianNoise layer. This layer applies the same type of noise to the reconstructed data by sampling from a Gaussian with σ G = 0.05 at each gradient-descent step of the Adam optimization. Note that at each step the noise added has the same variance, but differs due to the random sampling. The application of this method in practice requires knowing the type of noise, and its variance, in the experimental setup, but we believe this is feasible to extract.
It might also be possible to simply let the neural network learn the noise. However, applying backpropagation techniques for training requires calculation of gradients with respect to the parameters. The automatic differentiation methods usually employed for gradient calculation in neural networks are not straightforward to apply when such stochastic noise layers are present in the networks. Nevertheless, methods such as the reparameterization trick [12] can still make it possible to learn the noise. However, we have not explored this possibility further in this work.
Looking at the reconstructed Husimi Q functions in Fig. 13(b)-(i), it appears that the Generator with L1 or L2 loss and the QST-CGAN with λ L1 = {0, 1, 10} outperform the Generator with cross-entropy or KL divergence loss, and clearly outperform the iMLE. However, a small difference in the appearance of the Husimi Q function does not necessarily mean that two states are similar (compare the orthogonal states depicted in the insets of Fig. 12). We therefore plot, in Fig. 13(j)-(r), the photon-number occupation probabilities corresponding to the noiseless data and the reconstructions in Fig. 13(b)-(i). The noiseless data has non-zero probabilities for 0, 3, 6, and 9 photons. This is only reproduced well by the Generator with L1 or L2 loss and the QST-CGAN with λ L1 = 1. The QST-CGAN with λ L1 = 1 also reproduces the equal probabilities of 6 and 9 photons in the data better than the QST-CGAN with λ L1 = {0, 10}.
To further investigate how different loss functions affect the neural-network performance in the presence of additive Gaussian noise, we plot, in Fig. 14, how the reconstruction fidelity develops, for all reconstruction methods, as a function of the number of iterations for 30 different realizations of the noise in the data (the same binomial state as in Fig. 13). The average reconstruction fidelities and standard deviations are summarized in Table V, where we exclude the reconstructions when the state converges to an orthogonal state [see, e.g., Fig. 14(c) for λ L1 = 10]. Note that we have not tuned the hyperparameters for training, which can lead to improvements for all methods.
However, it is clear from Fig. 14 that the average reconstruction fidelities alone do not give a complete picture of the performance. Looking at the best runs for each method, we see that the QST-CGANs, the iMLE, and the Generator with L1 or L2 loss all are able to reach fidelities very close to 1, while the the Generator with cross-entropy or KL-divergence loss never ends up above fidelity 0.9. Looking at the spread of results, we see that the iMLE is very unstable, while the QST-CGAN and the Generator with L1 loss have a small number of runs ending up at a very low fidelity. The Generator with L2 loss appears to produce high-fidelity reconstructions with the greatest consistency.
Finally, we can compare how fast the different methods reach high reconstruction fidelity. Here, we see the same trend in Fig. 14 as in Figs. 11 and 12: the QST-CGAN is faster than the Generator trained with standard loss functions, and the fastest QST-CGAN is the one with λ L1 = 1.
To summarize the results in Figs. 13 and 14, the QST-CGAN and the Generator trained with L1 or L2 loss outshone the other methods when reconstructing a state in the presence of additive Gaussian noise. For the Generator trained with standard loss functions, the results were thus different from the noiseless case in Sec. V B 1, when training with cross-entropy or KLdivergence loss gave better results. When comparing the best QST-CGAN, the one trained with λ L1 = 1, to the best Generator trained with standard loss functions, the one trained with L2 loss, we find similar performance in terms of reconstruction fidelities, but the QST-CGAN is faster to reach a good reconstruction.
The errors in reconstruction using the cross-entropy and KL-divergence loss are expected, because these loss functions, similar to iMLE, assume the incorrect likelihood [Eq. (3)] for the data, which does not include the Gaussian error model of Eq. (51). The QST-CGAN reconstruction performs better than these methods since it has the flexibility to learn an appropriate loss function. We only provide the overall objective of making the reconstructed statistics similar to the data.
The reason for the good performance using the L2 loss can be further understood from arguments presented in   Fig. 13 in the presence of additive Gaussian noise (σ = 0.05). We consider 30 different sets of noise for each type of loss function. The full trajectories for the fidelities, as each method iteratively updates the estimate of the density matrix, are shown in Fig. 14.

Loss
Mean There, the authors note that the expectation value for the loss function when using L2 loss remains unchanged if the targets are replaced by random numbers distributed such that their expectation value matches the target. The crucial insight is that "the training targets of a neural network can be corrupted with zero-mean noise without changing what the network learns". Similarly, the L1 loss recovers the median values of the targets and is thus not affected by outliers. As a final remark, we note that an important factor to consider is that fidelity may not be the best metric to compare results, since sometimes completely random quantum states can have a high fidelity with a desired state [166]. Moreover, for continuous-variable quantum states such as the optical states considered here, several states with high overlap can have very different characteristics [166,167].

Reconstruction in the presence of Gaussian convolution noise
The additive Gaussian noise discussed in the preceding section models statistical errors due to a low signal-tonoise ratio. Such noise can be reduced (averaged out) by taking more measurements. However, in many cases we have other types of noise that can corrupt the data. Removing such noise in the context of image processing constitutes an inverse problem that is often difficult or ill-posed and requires regularization techniques [168].
We now show how to deal with one such type of noise: Gaussian convolution noise (see Sec. III C 2) due to a linear amplification channel [143]. In such a setup, a background noise, which usually is easy to estimate, corrupts our signal via a convolution operation. Similar to Sec. V B 2, we consider this known background noise as an input to the Generator and augment the Generator with a GaussianConv layer such that the Generator output is convoluted with the noise in the same way as the data. This noise layer is not learned, but fixed to the pre-determined background noise. The addition of the noise layer forces the Generator network to learn a density matrix ρ that can generate similar statistics as the data after convolution with the background noise.
In Fig. 15(a), we show the results of reconstructing a single-photon Fock state from the Husimi-Q-function data after convolution with a background noise arising due to the amplification channel being in a thermal state. In the simulations considered in preceding sections, we used a 32×32 grid of measurements. However, this coarse grid led to numerical aberrations in the convolution operation. For the present section, we therefore considered an 81 × 81 grid instead. The results show that the underlying single-photon state is reconstructed perfectly with unit fidelity by a QST-CGAN with λ L1 = 10 despite the presence of significant noise .
However, since the inverse problem can be ill-posed, it is also possible to obtain a result that reconstructs the data well without getting the underlying state right. In Fig. 15(b), we show the one such reconstruction using a binomial state in the presence of the same Gaussian convolution noise as in Fig. 15(a). The reconstructed density matrix gives rise to measurement statistics that match the measured (simulated) data exactly. However, the state itself is incorrect with a fidelity of just 0.45. We note that the symmetries of the state are captured in the reconstruction, but due to the convolution operation, the information of the exact state is lost and the inversion is not unique.

Reconstruction of mixed states
So far, we have only considered pure quantum states, where the density matrix has rank r = 1. However, in real experiments, we will almost always be dealing with mixed states. Such states may be harder for a neural network to handle, since they do not admit as compact a representation as a pure state, which can be written ρ = |ψ ψ|. In this section, we therefore discuss how the QST-CGAN method performs for mixed states with rank r > 1.
In a realistic experiment, it is reasonable to assume that the mixed state will have a dominant part, e.g., a target state which decoheres due to photon loss. Figure 16 shows results for the QST-CGAN reconstruction on a mixed state with a cat state (the same state as in Fig. 12) being the dominant component. The figure shows that the QST-CGAN can reconstruct such a mixed state easily for ranks up to r = 4 with close to unit fidelity (≥ .99). For ranks 1 and 2, the QST-CGAN method converges almost two orders of magnitude faster than iMLE. As the rank increases, both QST-CGAN and iMLE show a slower convergence. Although we did not run the iMLE for enough iterations to be certain, the increase in the number of iterations required for convergence appears to be greater for iMLE than for the QST-CGAN.  Figure 15. Reconstruction in the presence of Gaussian convolution noise. We assume that the noisy amplification channel has a thermal noise with n th = 5 photons (see Sec. III C 2). This noise is convoluted with the data from the Husimi Q function in (a) a fock(n = 1) state and (b) a binomial(S = 2, N = 4) state. We had to use an 81 × 81 grid for the data since we noticed that the convolution operation leads to effects such as loss of radial symmetry in (a) when using a grid of 32 × 32. The image obtained by subtracting the background from the convoluted data shows the symmetries of the state. (c, d) Reconstruction of the underlying states in (a, b) by the QST-CGAN (λ L1 = 10) where the Generator output is convoluted with the same background noise. We recover the underlying state from the DensityMatrix layer of the Generator. Note that even though the convoluted outputs and subtracted counts match the data well in both cases, the fidelity between the underlying reconstructed state itself and the true underlying state is 1 in (c), but only 0.45 in (d).
To explore further for higher ranks, we consider in Fig. 17 the reconstruction of a full-ranked (r = 32) thermal state with a mean photon number n th = 1. Here, the iMLE method converges very fast, almost instantaneously, while the QST-CGAN requires several hundred iterations. Although both methods reconstruct the state with a high fidelity ≥ 0.99, the photon-number populations of the reconstructed state do not exactly match the expected super-poissonian distribution for thermal states for the higher photon numbers (the tail of the distribution), neither for iMLE nor for QST-CGAN. The Husimi Q function of the reconstructed states match well in Fig. 17(b), but the Wigner functions for the iMLE and QST-CGAN methods do not match the smooth Wigner function for the thermal state in Fig. 17(c). However, changing the input data for reconstruction to the Wigner function (displaced parity measurements; see Sec. III B) can lead to a better reconstruction as we discuss in Fig. 18. For the QST-CGAN, this change is as simple as replacing the input measurement operators to the Generator from projections on the coherent state (Husimi Q) to displaced parity operators (Wigner).
Having explored reconstruction performance for lowand high-rank density matrices, we next turn to intermediate rank. In Fig. 18, we consider a random density matrix of rank 11 and show its reconstruction from both types of input data (Husimi Q and Wigner). Here, the difference between using the two types of data becomes clear. With the QST-CGAN method, we obtain a reconstruction fidelity of 0.8 using Husimi Q function and ∼ 0.99 when using the Wigner function. In Fig. 18(h), we can clearly see that details of the Wigner function for the true state in Fig. 18(e) are not captured when we take Husimi Q as our data, even though the reconstructed Husimi Q function in Fig. 18(g) matches perfectly with the data in Fig. 18(d). However, there are big differences in how different ML models perform, within the limitations of the data; there is no silver bullet. In this particular case, we can argue that the Husimi Q represents a convolution over the Wigner function [169] and therefore using it for reconstruction could be ill-posed [168].

Data reduction
In Ref. [1], we show that for a particular pure cat state, the QST-CGAN method requires much fewer data points, ∼ 100, for reconstruction than the iMLE method, which requires more than 10,000. In Ref. [170], it is argued how homodyne tomography can be IC when the number of independent quadratures measured is equal to the dimension N of the density matrix. More specifically, IC requires N quadratures to be measured, each of which can be discretized into 2N − 1 bins. Therefore, a full-rank density matrix of dimension N requires O(N 2 ) measurements in the phase space for IC. However, for low-rank states the data requirements can scale as O(rN ) [171]. These arguments suggest that for states described with density matrices of dimension N = 32, thousands of measurements are required for IC when considering full-rank states. However, for low-rank or rank-1 pure states, the number of data points for IC could  Fig. 12) and fock(n) states, where r ≥ 2 denotes the rank. For r = 1, the state is just cat(α = 2, S = 0, µ = 0). The input data is the Husimi Q function of ρ measured in a 32 × 32 grid. (a, b, c, d) States with ranks 1, 2, 3, and 4, respectively. The solid lines show the mean and the shaded regions show one standard deviation from the mean for the QST-CGAN (red) and iMLE (blue) over 15 reconstructions. In each repetition, we use the same data, but start from a different random initial state for iMLE and random weights for the QST-CGAN. We choose the weight λ L1 = 1 for the QST-CGAN and keep all other training hyperparameters the same as used for previous results and described in Sec. IV B 3. The QST-CGAN runs are stopped when they have converged on a reconstructed state.
be much smaller (∝ 32r). Note that the IC limit does not necessarily specify which measurements are important and give maximum information; the limit also depends on the density-matrix dimensions, which we can set to have different cut-offs for optical quantum states. Our QST-CGAN approach consistently required only ∼ 100 measurements for reconstruction of pure (r = 1) states using a random set of measurement settings.
In this section, we benchmark the QST-CGAN performance further by testing how much data it needs to reconstruct states of higher rank. In general, a density matrix of size N × N with full rank r = N is specified by N 2 − 1 real numbers. The number of parameters that needs to be determined during reconstruction is significanly reduced if the state is pure or if we have some prior information about the state [170]. For example, if we know that the state that we are reconstruct-ing is a thermal state, then even if the density matrix is full rank (r = N ), we only need to estimate a single parameter, the mean photon number n th , to reconstruct the state. Such priors can thus make it easy to reconstruct the state with data from only a few measurements. Similarly, the analyticity of the Husimi Q function makes it possible to apply other reconstruction methods, e.g., Lagrange interpolation [172], which sample from the Husimi Q function and obtain the exact density matrix without requiring any iterations. Similarly, Ref. [173] reconstructed a state description in the Fock basis using Wigner-function-overlap measurements and semi-definite programming, requiring less data.
In Fig. 19, we show how reconstructing a rank-4 state from a random selection of 256 points of the Husimi Q function fails for iMLE, which gets stuck. The convergence of iMLE is not guaranteed since there could be steps which strictly reduce the likelihood, producing cycles where the method does not improve its estimate of the density matrix [113]. The QST-CGAN, on the other hand, reconstructs the state almost perfectly, as shown in Fig. 19(d). However, the QST-CGAN requires more than 1,000 iterations to converge, which is more than what was needed when it reconstructed the same state using all data [see Fig. 16(d)].
In Fig. 20, we show the reconstruction fidelity with QST-CGAN and iMLE for mixed states of rank r = 2, 3, 4 as the number of measurements (data points; values of the Husimi Q function at different β) is reduced. We choose the β values at random inside the circle |β| = 5. We note that there could be better ways to choose points to sample the Husimi Q function, e.g., the socalled Padua points discussed in Ref. [172].
The QST-CGAN clearly outperforms the iMLE in terms of the amount of data needed for reconstruction in Fig. 20. The QST-CGAN reaches fidelity close to unity with somewhat less than 100 data points, around 100 data points, and a little more than 100 data points, for states of rank 2, 3, and 4, respectively.
This slow growth in the number of data points needed appears consistent with previous results showing that reconstruction of low-rank states in the best case can be done with ∝ rN data points [118]. Meanwhile, the iMLE cannot reconstruct ρ even when given a large number of data points. However, this is not just due to the lack of information but also due to the random selection of the data points themselves.
Although the results here do not establish any bounds on the minimum number of data points necessary for the QST-CGAN method to reconstruct a quantum state, they show that the QST-CGAN approach can perform much better than conventional reconstruction methods when data is scarce. An intuitive explanation of this is that since neural networks are universal function approximators, the Generator network might learn to find an approximation for the state in terms of a few parameters, e.g., the mean photon number for a thermal state, and estimate it better. However, the theoretical underpinnings of the QST-CGAN performance for few data points needs to be explored further, which is beyond the scope of this work.

VI. CONCLUSION
We have shown how deep neural networks can assist in the characterization of quantum states. The states we considered here were optical quantum states, including bosonic error correction codes, but our methods are general and should be applicable also to systems with qubits.

A. Classification
We first showed how a neural network with convolutional layers followed by a dense layer can discriminate and classify several different types of quantum states with near-perfect accuracy. The input to the network was measurement data from phase-space descriptions of the states. The rare few misclassifications could be explained by the existence of parameter ranges where states from different classes are extremely similar and the problem of classification thus becomes ill-defined.
We further demonstrated the robustness of this classification method against two prominent noise sources Using Wigner Using Husimi Q

QST-CGAN reconstruction
True state -additive Gaussian noise and single-photon loss. For the former, the network performance remained almost perfect until the standard deviation of the added noise reached as high values as 20 % of the largest input data values. For the latter noise, we showed a specific example where the network could identify a cat state even after it had lost 70 % of its initial photons. By using the Grad-CAM method to extract and visualize which parts of the input phase space that the neural network bases its classification decision on, we proposed a simple adaptive technique for tomography that could significantly reduce the data-collection time for an experiment. Since the neural network learns the characteristic features of the states it is set to classify and can be trained to be robust against simple noise sources in the data, we can deploy it online at the initial stages of an experiment for guided data-collection during tomography.

B. Reconstruction
We next introduced, here and in Ref. [1], a densitymatrix-estimation technique using a combination of ideas from VAEs and GANs: the QST-CGAN method. This method uses custom neural-network layers that convert the output of any standard neural network into valid density matrices using the Cholesky decomposition. Therefore, we can convert any neural-network architecture into a variational map from input data to a density matrix. Following this scheme, we constructed a custom Gener- ator network that maps input data to a density matrix and computes statistics for measurement operators.
By training the Generator network using gradientbased methods, we showed that the density matrix for the underlying state can be easily reconstructed. Instead of using a standard straight-forward loss function that requires an assumption on the likelihood for the data, we used a second Discriminator neural network to help train the Generator. Our choice of this adversarial training framework was motivated by an analysis of how standard loss functions, e.g., L2 or KL-divergence loss, perform for different states and noise in the data. We found that some of these standard loss functions resulted in good performance in the absence of noise, while other loss functions gave better performance in the presence of certain types of noise, but none of the loss functions led to a consistently good performance in a general setting. However, we showed that the QST-CGAN method is flexible and can easily adapt to different noise, states, or measurement settings. We ascribe this flexibility to the ability of the Discriminator to learn a loss function suited to the situation at hand.
We showed that the QST-CGAN-based reconstruction can be up to two orders of magnitude faster than iMLE, counted in the number of iterations required for reconstruction. Although the actual time for each iteration in the QST-CGAN can depend on the design of the neural networks, this presents a significant advantage for data post-processing during tomography. We also note that the neural-network based method seems to be performing non-trivial operations during reconstruction, e.g., apply-ing a quantum operation to almost instantaneously jump from an orthogonal state to the correct state. This suggests that the neural networks learn to represent the state in a way that is well suited for the problem.
Having first benchmarked the reconstruction of pure states with no noise, we next considered how the QST-CGAN method can be augmented further to deal with noise in the data. We leveraged the flexibility of having a loss function that combines the Discriminator loss with a simple L1 loss, since our objective is simply to make the generated data look like the training data. For the case of additive Gaussian noise of up to 5% of the maximum signal value, our QST-CGAN method performs denoising and reconstruction much better than iMLE without needing any change in the architecture or loss function. Gaussian convolution noise corresponding to having a thermal state with mean photon number n th = 5 in a linear detection scheme was also tackled quite easily. The QST-CGAN only required the expected background noise as input which was added as special noise layers to the Generator network.
Lastly, we showed that the QST-CGAN method clearly outperforms iMLE also when reconstructing mixed states. The QST-CGAN proved superior not only in terms of how few iterations it needed to reach high reconstruction fidelity, but also in terms of how little input data it required to reconstruct the state well. For a cat state, the QST-CGAN required almost two orders of magnitude fewer data points than iMLE (as well as an RBM-based reconstruction shown in Ref. [78]) to achieve high reconstruction fidelity. It has been demonstrated . Reconstruction of a mixture of cat(α = 2, S = 0, µ = 0) and fock(n) states (the mixture is constructed using the same formula as in Fig. 16) with a reduced number of measurements. (a, b, c) Reconstruction fidelities for ranks 2, 3, and 4, respectively, for QST-CGAN (red) and iMLE (blue). The solid lines show the mean and the shaded regions show one standard deviation from the mean. The fidelity shown is the one reached after a certain convergence criteria set by a tolerance value. We choose a tolerance such that if the average fidelity in 100 iterations does not change by 10 −5 over 5 steps (i.e., 500 iterations) we stop the reconstruction. that iMLE method can become stuck in cycles for some choices of input data, but our QST-CGAN method works well even with random sets of measurements generating the input data for the examples considered.
In conclusion, by connecting ideas of generative and discriminative modelling to quantum state classification and reconstruction, we have attempted to bridge the gap between deep neural networks and quantum information and computing. We have shown how some of the latest ideas from deep learning can be quite easily adapted and applied to quantum-information tasks with just a few tweaks to incorporate the rules of quantum physics. This opens up a wealth of possible applications, as we discuss further below and in Ref. [1].

VII. OUTLOOK
Our work suggests several new practical possibilities in data analysis of quantum experiments. At the same time, it leads to new questions regarding the limits of using neural networks for quantum state characterization.
It is expected that image recognition algorithms will be good at distinguishing different optical quantum states from their phase-space data. However, the benefit of using neural networks is their resilience to known types of noise. If we have to classify a "cat" state, it is rather easy to see that two lobes and a connecting bridge in the phase space should be a "cat". But what if, due to noise, the phase-space plots are shifted or rotated? An algorithm that relies on the fixed definition of a cat state will see poor overlap between the definition and the data, and hence cannot recognize the cat even if all the features are present. The neural-network method, on the other hand, is implicitly taught the important features that characterize cats and therefore works even in the presence of systematic or random noise.
In the case of reconstruction, we see that the the QST-CGAN method is a very powerful alternative to RBMs. We leverage the universal approximation capabilities of a deep neural network to have a tractable representation of the state by explicitly constructing the full density matrix. Standard loss functions such as fidelity, L1, L2, cross-entropy, etc. will always have some shortcomings, since they require an assumption on the underlying likelihood for the data. Instead, with the CGAN framework, we let the loss metric be implicitly defined with the objective of simply making the data look similar to the generated data. However, we have not explored the theoretical underpinnings of using such a learned loss function for reconstruction. This remains to be analysed.
The future work that leverages these new ideas would go in two directions, beyond the suggestions for improvements and tweaks already mentioned in connection with the results. The first is further theoretical analysis of the techniques. For example, it remains to be well understood how the neural network can reconstruct states using much fewer data points than a standard maximum likelihood method or perform non-trivial operations during a reconstruction. The second direction is validation with more experimental data and comparison to other standard methods for reconstruction. For example, we have not explored thoroughly how the QST-CGAN method compares with RBM-based approaches for tomography. This would be an interesting comparison since much of the work in QST with machine learning is focussed on using RBMs.
Since we ask for the full density matrix during reconstruction, our method cannot directly scale up for very large quantum systems. Even if it is straightforward to replace the density-matrix description with other efficient ansätze, it remains to be answered how to obtain efficient representations such that one does not use the millions of parameters in the deep neural network to estimate a few hundred parameters of the density matrix.
The methods discussed here are ready to be applied to real experiments such that adaptive, online tomography schemes can be designed that can deal with noisy data. The techniques for classification and reconstruction could even be combined: the result of classifying a state with one neural network could be used to as a good starting point and parameterization for the training of another network for full quantum state reconstruction. We foresee that our ideas will lead to better techniques for quantum state characterization and bring the power of deep-learning-based tools to the quantum physicist.