How To Use Neural Networks To Investigate Quantum Many-Body Physics

Over the past few years, machine learning has emerged as a powerful computational tool to tackle complex problems in a broad range of scientiﬁc disciplines. In particular, artiﬁcial neural networks have been successfully used to mitigate the exponential complexity often encountered in quantum many-body physics, the study of properties of quantum systems built from a large number of interacting particles. In this article, we review some applications of neural networks in condensed matter physics and quantum information, with particular emphasis on hands-on tutorials serving as a quick start for a newcomer to the ﬁeld. The prerequisites of this tutorial are basic probability theory and calculus, linear algebra, basic notions of neural networks, statistical physics, and quantum mechanics. The reader is introduced to supervised machine learning with convolutional neural networks to learn a phase transition, unsupervised learning with restricted Boltzmann machines to perform quantum tomography, and the variational Monte Carlo method with recurrent neural networks for approximating the ground state of a many-body Hamiltonian. For each algorithm, we brieﬂy review the key ingredients and their corresponding neural-network implementation, and show numerical experiments for a system of interacting Rydberg atoms in two dimensions.


I. INTRODUCTION
"Quantum many-body physics" refers to the mathematical framework to study the collective behavior of large numbers of interacting particles. The emerging cooperative phenomena that result from seemingly simple interactions can produce an astounding variety of phases of matter, such as conventional metals and magnetically ordered states, as well as unanticipated states, including hightemperature superconductivity, strange metals, and spin liquids [1]. In addition to naturally occurring quantum systems, many-body physics studies synthetic quantum matter (e.g., ultracold atoms, superconducting qubits, and trapped ions), which simultaneously reveals new phenomena in highly controlled laboratory settings and advances the development of quantum computers and other quantum information processing devices.
In spite of the simplicity of the physical laws that govern such multiparticle quantum objects, the theoretical and experimental analysis of these systems confront us with complexities that are ultimately rooted in the dimensionality explosion associated with the exponential scaling of the size of the space where quantum many-body states live. Traditionally, the study of many-body systems is performed with the help of tools designed to circumvent this complexity and to produce a succinct, low-dimensional description that captures the essential aspects of a quantum system. Such descriptions arise from the analysis of data generated in a wide range of theoretical, computational, and experimental devices. These include numerical simulations of model Hamiltonians based on quantum Monte Carlo methods [2][3][4] or variational algorithms [5][6][7], but also experimental arrays of complex electronic structure images obtained from spectroscopic imaging scanning tunneling microscopy, or measurements of quantum states prepared on a physical quantum computing platform [8][9][10][11].
Such an explosion of activity indicates that machine learning techniques may soon become commonplace in quantum many-body physics research, both in experiments and in numerical simulations. These clear trends call for the development of resources to stimulate researchers to familiarize themselves with the wealth of concepts, intuition, algorithms, hardware, software, and research culture entailed by the adoption of machine learning and neural networks in physics research. Here we take a step forward in this direction and develop a set of hands-on tutorials focused on a set of recent prototypical examples of applications of neural-network technology to problems in statistical physics, condensed matter physics, and quantum computing.

A. Outline
This article is organized as follows. Starting with a preliminary discussion, we introduce in Sec. A some fundamental concepts in machine learning and neural networks. In Sec. B we present a concise description of the physical system studied in our numerical experiments, a two-dimensional (2D) array of interacting Rydberg atoms. In Sec. III we discuss our first application, the classification of phases of matter with supervised machine learning of projective measurement data using a convolutional neural network (CNN), and demonstrate it on the quantum phase transition in the Rydberg atoms. In Sec. IV we introduce quantum state tomography (QST), and show how this problem can be phrased as an unsupervised machine learning task. Using the restricted Boltzmann machine (RBM), we show quantum tomography of the Rydberg ground states, as well as of the ground state of a small molecule from qubit measurement data. In Sec. V we present the simulation of the ground state of a many-body Hamiltonian using the variational Monte Carlo (VMC) method with a recurrent-neural-network (RNN) wave function. For each of these applications, we also show the key components of the underlying software, with full code tutorials available in an external repository [131].

A. Machine learning with neural networks
Artificial intelligence is the scientific discipline that deals with the theory and development of computer programs with the ability to perform complex tasks that are usually performed by humans, such as visual perception, game playing, speech recognition, decision-making, and translation between languages. Before its current widespread adoption, artificial intelligence first saw early success solving problems that are relatively straightforward to formalize in an abstract way. The solutions to this breed of problems are typically described by a list of very precise formal rules that computers can process efficiently. As a remarkable example, computers have been beating humans at playing chess since 1997, in part because chess involves a large set of formal rules.
Modern machine learning, instead, deals with the challenge of automatizing the solution of real-world tasks that may be easy for humans to process but that are hard to formally describe by simple rules. These techniques have 040201-2 HOW TO USE NEURAL NETWORKS TO INVESTIGATE QUANTUM... PRX QUANTUM 2, 040201 (2021) spurred a recent revolution where algorithms trained using data have started to match the ability of humans to recognize objects in an image, decipher speech, or translate text into multiple languages, which are tasks that are difficult to formalize and articulate through simple rules.
A key element behind these recent developments can be largely traced back to a series of breakthroughs in the development of powerful neural-network models, where data are processed through the sequential combination of multiple nonlinear layers [132]. Such models solve a fundamental problem in learning real-world tasks-namely, the problem of automatically extracting knowledge from raw noisy data rather than relying on hard-coded knowledge directly inscribed in the algorithms by a human. Neural networks automatize the construction of sets of increasingly complex representations of the data, which can be understood as the computational disentangling of complex concepts (e.g., an object in a cluttered image) from simpler concepts (e.g., pixel values and basic shapes such as edges). These representations, in turn, lead to solutions to learning tasks with unprecedented success.
For practical purposes, machine learning algorithms can be divided into the categories of supervised, unsupervised, and reinforcement learning, all of which have found applications to quantum many-body systems [13]. While there is no formal difference between some of the algorithms in these categories when expressed in the language of probability [132,133], such a division is often used as a way to specify the details of the algorithms, the training setup, and the structure of the datasets involved.
Supervised learning tasks aim at predicting a target output vector y associated with input vector x, both of which can be discrete or continuous. The training data are thus a list of pairs of input-output tuples {x i , y i } M i=1 , where target output conveys that such a vector corresponds to the ideal output given the input vector [133]. Starting with a training dataset with M entries, the learning algorithm outputs a functionŷ = f (x) that estimates the output values for unseen input vectors x. Examples of supervised learning include classification, where the objective is to assign each input vector to one of a set of discrete categories, and the task of regression, where the output is a vector with continuous entries. Examples of classification and regression include, respectively, the problem of recognizing images of handwritten digits and the problem of determining the orbits of bodies around the Sun from astronomical data.
Unsupervised learning deals with the learning tasks where the training data are composed of a set of input vectors without a corresponding target output [133]. These algorithms are typically used to discover hidden structure in the datasets. Examples of tasks in unsupervised learning problems include clustering, where the objective is to discover groups of similar examples within the data, density estimation, where the objective is to estimate the underlying probability distribution associated with the data, and low-dimensional visualization of high-dimensional data algorithms, which depict complex data in two or three dimensions while trying to retain key spatial characteristics in the original data.
Finally, reinforcement learning develops algorithms dealing with the problem of discovering actions that maximize a numerical reward signal [134]. The learning algorithms are not necessarily directly exposed to examples of optimal actions. Instead, they must discover them by a process similar to a guided trial and error. Reinforcement learning augmented by deep neural networks has successfully learned policies from high-dimensional sensory input for game playing, achieving human-level performance in several challenging games, including Atari 2600 [135] as well as the board game Go [136]. Likewise, reinforcement learning has been applied to the control of quantum systems [67,137] as well as to the optimization of quantum error correction codes [110,114,115], one key ingredient in the development of fault-tolerant quantum computers. Because of the specific choice of many-body problems we entertain in this article, we do not discuss reinforcement learning techniques.
To follow this tutorial, a basic knowledge of machine learning is recommended. This includes fundamental properties of neural networks, basic Monte Carlo techniques, gradient-based optimization methods, and an elementary understanding of model overfitting and generalization. For a pedagogical and comprehensive introduction to these concepts, we suggest the reviews in Refs. [138][139][140], specifically phrased in the context of quantum physics.

B. Rydberg atoms in two dimensions
In the following sections, we detail how to repurpose machine learning algorithms based on neural networks to tackle some problems encountered in quantum many-body physics. We focus on a many-body system composed of interacting Rydberg atoms [141,142], which we showcase in the various numerical experiments and hands-on code tutorials. Because of the high control and manipulation that can be achieved in experiments [143], these platforms have revealed themselves to be extremely useful to investigate a broad range of applications, including Ising-like quantum magnetism [9,[144][145][146], quantum dynamics [147][148][149], spin liquid physics [10], and quantum computing [150,151].
Our choice of physical systems is motivated by several reasons. First, quantum hardware based on neutral atoms provides a highly programmable platform for analog quantum simulations, and it is being increasingly explored for quantum information processing. Second, these systems are engineered in a way that each atom can be found only either in the atomic ground state or in a highly excited (Rydberg) state. Because the system is effectively described by a qubit wave function, the 040201-3 techniques involved are relatively simplified. Third, the Rydberg atom platform provides access to large volumes of data in the form of accurate projective measurements that open the door to studying the physical properties of the system through machine learning algorithms. Finally, as explained below, the Rydberg Hamiltonian has the very special property that its ground-state wave function is real and positive in the atomic occupation number basis. Under this assumption, the definition of a neural-network wave function is also simplified. We will, however, describe the more generic case of a wave function with a sign structure in Sec. C.
We consider a square array with linear dimension L containing N = L 2 atoms. Each atom is described by a local Hilbert space spanned by the states {|g , |e }, referring respectively to the atomic ground state and the highly excited Rydberg state. The atoms are subjected to a uniform laser drive with Rabi frequency and detuning δ, and they interact with one another via the van der Waals potential V(x) ≈ r −6 at short distances. The resulting many-body Hamiltonian iŝ whereˆ (r) = |e e| r is the projector onto the Rydberg state at position r,Ŝ x (r) = 1 2σ x (r) are spin-1 2 operators, and V(r − r ) = V 0 / r − r 6 is the van der Waals potential between atoms at positions r and r . In the following, we assume = 1 MHz.
The phase diagram for the ground state of the Rydberg Hamiltonian is dictated by the mechanism of Rydberg blockade, a constraint that prevents two atoms at sufficiently short distances from being simultaneously excited to the Rydberg states. We can characterize the phase diagram in terms of the detuning δ and the interaction strength V 0 . On the square lattice, several different orders have been detected by numerical simulations [152]. Here we specifically focus on the Z 2 transition between a disordered phase at large and negative detuning, where all atoms are found in the ground state, and an ordered phase at large and positive detuning, where the system is found in one of the two symmetry-broken Néel states characterized by a checkerboard pattern in the atomic occupation number [ Fig. 1(a)].
We perform numerical simulations of the ground state of Hamiltonian (1) using the density matrix renormalization group (DMRG) [5,6,153] implemented using the software package ITensor [154]. We adopt a matrix product state (MPS) variational wave function | with a snakelike geometry as shown in Fig. 1(b). We fix the interaction strength to V 0 = 3 MHz, and retain up to the third-nearestneighbor interactions. For several values of the detuning δ ∈ {−5, 5} MHz, we run DMRG to find an approximation of the ground state, using a singular value decomposition cutoff of 10 −10 and a target energy accuracy of 10 −5 . To certify convergence to the ground state, each run is repeated for different initializations of the starting MPS.
We show the results of the simulations for an 8 × 8 array with open boundary conditions in Figs. 1(c)-1(f). We plot as a function of the detuning the ground-state energy per site E 0 /N = 0 |Ĥ | 0 /N and the staggered magnetization N = N −1 r (−1) x+y Ŝ z (r) , which can be used to detect Néel order. Whenever all atoms are in the ground state, N ≈ 0, while for an ordered state with a checkerboard pattern, one has N ≈ 0.5. We also show the average occupation number in momentum space, wheren(r) = 1 2 (1 − 2Ŝ z (r)). We observe a peak at k = (π , π) for large detuning δ = 4 MHz [ Fig. 1(e)], and a featureless state at negative detuning (not shown).
The two phases of the Rydberg atoms are separated by a second-order quantum phase transition at a critical point δ c . We can extract an approximation of δ c by measuring the energy gap = |E 0 − E 1 | between the ground state and the first excited state | 1 . We compute E 1 by running DMRG on the HamiltonianĤ =Ĥ + ω| |, where ω is an energy penalty. From the energy gap curve, we estimate the detuning where ≈ 0 to be δ c ≈ 1.3 MHz. This approximate value is sufficient for the purpose of this article, although a more systematic scaling study with appropriate boundary conditions (to minimize finite-size effects) should be performed to accurately determine the critical point and critical exponents of the transition.
Once we have solved for the ground states of the Rydberg Hamiltonian, the corresponding MPSs can be used to generate data to train the neural networks for the different applications. In this case, the data consist of projective measurements in the atomic occupation number basis |σ = |σ 1 , . . . , σ N , where σ j = 0 and σ j = 1 refer, respectively, to the j th atom being in the ground state and the j th atom being in the Rydberg state. Given a wave function | , the probability to observe an atomic pattern σ following a measurement is given simply by the Born rule P(σ ) = | σ | | 2 . Because of the intrinsic onedimensional geometry of a MPS, it is possible to efficiently sample the probability distribution P(σ ) by using the chain rule of probabilities. Moreover, the sampling is exact in the sense that the samples are completely independent of one another [155].

III. LEARNING A QUANTUM PHASE TRANSITION
An important task in condensed matter physics and statistical physics is to characterize different phases of matter and the associated phase transitions between them. Typically, phases of matter are described in terms of simple real-space patterns and their associated order parameters, which are theoretically understood using the Landau symmetry-breaking paradigm [1]. While a wide array of theoretical and experimental tools to study interacting quantum systems have been constructed in relation to these patterns, there is an increasing set of states of matter whose theoretical and experimental understanding eludes the Landau symmetry-breaking paradigm. The characterization of these phases may rely on, for example, out-ofequilibrium properties of the system as in the many-body localized phase [156,157] or on topological invariants in topological phases and spin liquids [1,158,159].
Machine learning provides an alternative route to the characterization of phases of matter and their associated phase transitions in a semiautomated fashion without the direct use of manually specified real-space patterns and/or other signatures, provided that a sufficiently large training set is available. In its simplest form [15], given the existence of a classical or quantum phase transition between two phases in a physical system, one can use supervised learning to attempt to classify experimental or numerical snapshots of the phases of matter separated by the transition. This task can be achieved using most classification algorithms, for example, those based on a neural network or a support vector machine [133], trained on snapshots of two phases of matter labeled according to the corresponding phase from which the snapshot originated.
The architectural choice of the classifier is flexible, but a natural strategy is to take into account the structural properties of the problem such as the symmetries of the physical system and its spatial dimensionality. These choices are important since they impact the computational complexity of the learning algorithms, where encoding the knowledge of the physical problem typically increases the overall complexity of the algorithm. For instance, we might be interested in classifying numerically generated snapshots of a large two-dimensional array of spins governed by an Ising model in two different phases, for which a natural choice is a 2D CNN [132], which accounts for the 2D structure and locality of the datasets.
Although here we explore only the simplest supervised learning strategy [15], we stress that machine learning approaches to studying phases and phase transitions have been significantly expanded and they do not require precise knowledge of the location of the critical point [16,160], can be fully automatized, and can discover ordered phases [15,18,161], topological phases [16,162,163], and phases such as the many-body localized phase, which is characterized by its dynamical properties [24,164,165].
The nature of the snapshots used to train the learning algorithms is vastly flexible, and hence these strategies are of wide applicability, and can include numerically generated configurations visited during a classical or quantum Monte Carlo simulation of the physical system 040201-5 [15,16,[18][19][20]26,166], entanglement spectra [16,24], correlation matrices [167,168], tensors in a MPS [168], numerically generated projective measurements [169], high-resolution real-space snapshots of complex manybody systems obtained with quantum gas microscopes for ultracold atoms [29,170], single-shot experimental momentum-space density images of ultracold quantum gases [30], and spectroscopic imaging scanning tunneling microscopy data [28].

A. Convolutional neural networks and their training
Below we explore learning a quantum phase transition in a 2D array of interacting Rydberg atoms using projective measurements. Because of the 2D arrangement of the atoms, each measurement-obtained from a numerical simulation-can be interpreted as an "image" whose pixels depict the state of each atom after the measurement. Known to be extremely effective image classifiers [132], we use a convolutional classifier, which readily takes advantage of the structure of the classification task since it exploits the 2D spatial arrangement of the Rydberg atoms and their locality, as well as the approximate translation invariance of the system.
CNNs use a mathematical operation called "convolution" to process information for data that have a natural gridlike topology [132]. A 2D convolutional layer implements the operation i,l,m y ,m x at layer q specifies the connection strength between a unit in channel i of the output and a unit in channel l of the input, with a spatial offsets of m y rows (labeled y direction) and m x columns (labeled x direction) between the output and the input variables. The dimensions of the array K (q) i,l,m,n are I out , L input , M y , and M x , which correspond to the number of output channels, the number of input channels, the dimension of the filter in the vertical direction, and the dimension of the filter in the horizontal direction, respectively. The activation at layer q consists of elements h (q) l,j ,k , where j and k label vertical and horizontal directions, respectively, and l specifies the channel. The activation units are labeled by q, where q = 0 corresponds to the raw projective measurement data. Finally, the nonlinear function F(x), which in our examples is typically a rectified linear unit (ReLU) F(x) = max(0, x), is applied element-wise to each of the components of its input. A convolutional neural network equipped with two convolutional layers is schematically shown in Fig. 2. l,j ,k correspond to the outcome of a projective measurement on the Rydberg system. The first operation is a convolutional layer with M y × M x = 3 × 3 kernels with I out = 32 output channels and L input = 1. This kernel is convolved with an input configuration with N = 8 × 8 Rydberg atoms. Likewise, the second operation corresponds to a convolutional layer with M y × M x = 3 × 3 kernels with I out = 32 output channels and L input = 32. The output of the second convolutional layer is flattened and fed to a FC layer with a ReLU activation, followed by another FC layer with a softmax activation, which produces the prediction outcome.
Followed by the convolutional layers, a CNN typically processes information using sets of fully connected (FC) layers that implement a matrix-vector operation followed by a nonlinearity F as where the trainable parameters of the FC layer are the kernel K i . To feed the output of a convolutional layer h (q−1) l,j ,k to a FC layer, the array is reshaped or "flattened" to h (q−1) l so that all the original components are packed into a one-dimensional array with dimension L FC . The last two layers of the CNN in Fig. 2 correspond to two fully connected layers with a ReLU and a softmax nonlinearity, respectively. The softmax function S is given by where v i are the components of a vector v and the exponential function acts element-wise on the components of the vector. The input to the CNN and its trainable parameters are real, so the outcome of the softmax layer can be interpreted as a probability distribution since 0 ≤ S(v i ) ≤ 1 and Finally, we we interpret our CNN as a model for the conditional probability of assigning a phase of matter y = 0, 1 to a projective measurement outcome σ = h (0) , i.e., where the function flatten() l is the lth component of a vector that arises from reshaping the incoming argument of the function to a one-dimensional array.
To estimate the parameters of the CNN we use the maximum likelihood principle, where the parameters of a statistical model are selected by assigning high probability to the observed data. For a dataset with observations {σ n , y n } M n=1 , where y n = 0, 1 label the phase of matter from which a projective measurement σ n was taken from, the likelihood assigned by the model to the dataset can be written as (6) where y = (y 1 , . . . , y M ). Instead of attempting to maximize the likelihood, it is convenient to define a loss function by taking the negative logarithm of the likelihood, which gives the cross-entropy {y n ln (P θ (y n |σ n )) To train the model, we minimize E(θ ) using gradient descent techniques [133]. While it is possible to evaluate the gradients of E(θ) with respect to the parameters θ in the CNN analytically using the chain rule, a more convenient and less error-prone approach is to use automatic differentiation (AD), which is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. A complete survey detailing AD can be found in Ref. [171].
In addition, instead of using the entire dataset in the calculation of E(θ ) and its gradients, we use smaller batches of data of size M batch < M , which means that the gradients used during optimization become stochastic since they fluctuate from batch to batch. In the examples below we use N batch = 32. The gradient update rule used in our examples is a modified version of the usual gradient descent method called "Adam" [172].

Code walk-through
We demonstrate supervised learning with a CNN to learn the quantum phase transition in the Rydberg array using the machine learning software library TensorFlow [173]. We first generate training data by sampling the MPS wave functions obtained from DMRG at different detunings δ. These data are then divided into a training set (used to update the neural-network parameters) and a test set (used to validate the performance of the model). Each dataset consists of a list of atomic occupation patterns σ and their "phase label" y.
We begin by importing the required functionalities and loading the data. Since we are using a CNN with a twodimensional geometry, the atomic configurations in the training and test datasets need to be appropriately reshaped from the one-dimensional MPS structure.
Next we define the neural-network architecture by combining layers predefined in TensorFlow. After initialization, we proceed to implement the model of Eq. (5). First, the raw input data are processed by two stacked convolutional layers, each one with 3 × 3 filters and 32 channels using rectified linear units. The output of the second CNN layer is fed to two stacked fully connected layers with 64 hidden units and two output units. These last units correspond to model output for the disordered and ordered phase. This set of hyperparameters is set using a scaling experiment until convergence (see Sec. VI).

040201-7
Once the architecture is defined, the model can be compiled by adding the cost function [i.e., the cross-entropy in Eq. (7)] and the optimizer to update the model parameters, the Adam optimizer [172]. Internally, TensorFlow builds a computational graph containing each operation being executed from the input state to the final output of the architecture. The gradients of the cost function with respect to each network parameter are then evaluated using AD. At compilation time, we can also add a metric to be monitored during training, in this case the classification accuracy (i.e., the fraction of the test set samples that are being classified correctly).
The model is now ready to be trained using the Rydberg data for a set number of epochs, which is the number of passes of the entire training dataset the machine learning algorithm has completed. After training is complete, we can evaluate the model on the held-out test data to quantify its accuracy.
We show the results of the training in Fig. 3 evaluated on various test sets at different detuning values δ. We plot the average output signal for the two output neurons, i.e., an estimate of f (y, δ) = σ | (σ )| 2 P θ (y|σ ), on the topmost dense layer. When the detuning is large and negative, the disordered neuron saturates to 1, and the ordered neuron is nearly 0, while the signals reverse at large and positive detuning. We can use the crossing point between the two curves to detect the critical point. We also show the accuracy in the test set, which shows a dip near the critical point. This is when the neural network is most uncertain about assigning a phase label to any given atomic configuration.

IV. QUANTUM STATE TOMOGRAPHY
Quantum characterization, verification, and validation is a framework for algorithms and routines used to assess the quality and the performance of experimental quantum hardware and characterize its components [174]. The workflow underlying these algorithms is inherently data driven: appropriate measurement data are first collected from the quantum device under examination, and are then processed by an algorithm running on a classical computer. Depending on the degree of complexity of the algorithm, different amounts of information can be gained. This could be a single figure of merit, such as the average error rate for a set of quantum gates [175][176][177], or the fidelity (or a proxy thereof) between a quantum state prepared by the hardware and a desired reference state [178][179][180]. One may be also interested in retrieving the full quantum state generated by a device. This procedure-the reconstruction of an unknown quantum state from measurement data-is called "quantum state tomography" [181][182][183][184][185][186][187][188].
There are two assumptions in QST: the ability to prepare many identical copies of the quantum state of interest and the ability to repeatedly perform measurements on it. The set of measurements is, in general, described by positive-operator-valued measures (POVMs) M = { k } [189]. The set M is said to be informationally complete (IC) if it spans the full Hilbert space. Von Neumann measurements in the Pauli bases are an example of informationally (over)complete measurements, where for a single qubit M contains the six rank-1 projectors into the eigenstates of the Pauli matrices. For an IC set, any quantum state can be uniquely identified by the probabilities of the measurements in M, as specified by the Born rule p(k) = Tr k . The simplest method to perform QST is linear inversion, which reconstructs the quantum state by simply inverting the Born rule: where each row in the matrix A is a vectorized POVM element | k , andp is the empirical measurement probability. One issue with linear inversion is that the reconstructed density operator ρ is not necessarily positive, although negative eigenvalues can be appropriately removed to produce a positive state that is closest to the output of linear inversion [187]. A more powerful approach, but also more computationally intensive, is maximum likelihood estimation [182][183][184], where the state ρ is found by minimizing the likelihood function for the observed data under the constraint ρ ≥ 0. Traditional QST algorithms based on linear inversion or maximum likelihood suffer a complexity that scales exponentially with the number of qubits or particles involved. This exponential scaling stems from two reasons. First, the representation of the quantum state ρ, which is inevitably exponential in the system size. Second, the sample complexity; that is, the number of measurements that need to be collected in an experiment to achieve a faithful reconstruction of the quantum state. Typically, statistics from an IC set are required to get a good fit, and the size thereof scales exponentially with the number of qubits, with provable upper and lower bounds [190,191]. For these reasons, traditional QST has remained limited to quantum systems containing only a small number of particles [192].
Several algorithms to overcome this severe complexity have been proposed over the last decade. Notable examples include compressed sensing tomography [193][194][195][196], permutationally invariant tomography [197,198], and tensornetwork tomography [199][200][201][202], which rely, respectively, on the sparsity, translational invariance, and low entanglement of the target quantum state. More recently, a new framework built on neural networks and unsupervised learning was proposed [71], and is based on the assumption that most physical states of interest typically contain some degree of structure (i.e., correlations, symmetries, etc.), in the sense that they can be described using a reduced number of parameters (much smaller than the dimension of the Hilbert space). The general idea is to leverage the capability of unsupervised machine learning to autonomously identify such structure in raw data, and compress it using a neural-network representation of the quantum state.
In what follows, we focus on pure quantum states, and discuss the extension to mixed states at the end of this section. We consider a system of N qubits (or any other two-level system) described by a wave function | with amplitudes (σ ) = σ | in an appropriate reference basis |σ = |σ 1 , . . . , σ N (σ j ∈ {0, 1}). To circumvent the scalability issue of standard QST, we adopt a compact representation of a wave function expressed in terms of a neural network [50]. The resulting neural-network wave function is simply a highly nonlinear parametric function of the basis states ψ θ (σ ), where θ is a set of parameters (e.g., weights and biases). Several types of neural networks have been successfully implemented to perform QST, including feed-forward neural networks [76,78], variational autoencoders [72], generative adversarial networks [82], recurrent neural networks [75,89], and transformers [84]. Here we examine the restricted Boltzmann machine because of its simplicity, high expressive power, and natural interpretation in describing correlated multivariate probability distributions.

A. The restricted Boltzmann machine
The restricted Boltzmann machine is an energy-based model introduced in the early 1980s for generative modeling [203,204], and is built on a connection between cognitive science and statistical mechanics [205][206][207]. The RBM features two layers of stochastic binary units: a visible layer σ = (σ 1 , σ 2 , . . . ) and a hidden layer h = (h 1 , h 2 , . . . ), containing, respectively, N and n h neurons (or units). The two layers in the RBM are fully connected by a symmetric weight matrix W, with no intralayer connections (hence its restricted nature). The visible units are used to represent the variables relevant to the specific problem at hand, such as the pixel values in an image (or the computational basis states for qubits). The size of the hidden layer is a natural control parameter for the representational power of the model. Since RBMs are universal function approximators [208], they can capture any discrete distribution provided the number of hidden units is sufficiently large (possibly exponential in the number of visible units).
The RBM associates with each configuration of the visible and hidden layers (σ , h) the energy where θ = (W, b, c) is the set of parameters, and we also introduced biases b and c for the visible and hidden units, respectively. Given this energy functional, the (stochastic) RBM units are distributed according to the Boltzmann distribution at temperature β = 1: where the partition function is Importantly, because the network architecture is restricted, we can trace out the hidden layer explicitly, obtaining the marginal probability distribution over the visible space: where we defined an "effective energy" The main purpose of the RBM is generative modeling, which is the task of learning a representation of an unknown probability distribution from data, allowing the neural network to produce new data points. In other words, the RBM training attempts to discover low-dimensional features in the data to allow generalization beyond the finite-size dataset. Let us consider a dataset D = {σ k } with underlying (unknown) probability distribution q(σ ). The RBM can be trained using unsupervised learning to minimize the distance between the two distributions. Such distance measure is typically expressed in terms of the Kullback-Leibler divergence [209]: with D KL (q|p θ ) > 0 for all q, p θ and D KL (q|p θ ) = 0 if and only if p θ = q. The exponentially large sum over the full configuration space is approximated using the available data, leading to the cost function where |D| is the size of the dataset. Up to a constant dataset entropy term H D , the Kullback-Leibler divergence reduces simply to the negative logarithm of the likelihood function L(D|p θ ).
The RBM can be trained using one of the many flavors of gradient descent. It is possible to show that the gradients of the cost function are given by where the gradients ∇ θ E θ (σ ) can be computed exactly for any sample σ . We see that the gradients ∇ θ C(θ) have two components. There is the average over the data points ∇ θ E θ (σ ) D , which is fast to compute. In contrast, the average over the model distribution requires the partition function, whose calculation is, in general, intractable. However, this expectation value can be approximated using the Monte Carlo method by drawing N S samples from the model distribution {σ i } ∼ p θ (σ ): This is the most computationally intensive step of the training, and depending on the specific distribution to be learned, advanced Monte Carlo algorithms may be required to collect sufficiently uncorrelated samples.

B. Reconstruction of Rydberg atoms
Now that we have introduced the main features of the RBM and its training, we are ready to explore its use for QST. As a first application, we examine the reconstruction of Rydberg-atom wave functions. An important property of the ground-state wave function | of the Rydberg Hamiltonian (1) is that it is positive in the occupation number basis (σ ) ≥ 0 (where σ j = 0 and σ j = 1 refer to the ground and excited states). This property follows directly from the representation of the Hamiltonian in this basis, in which all of its off-diagonal elements can be gauged to be negative (i.e., the Hamiltonian is stoquastic in this basis [210]).
The positivity of the target state implies that we may parametrize the neural-network wave function simply as ψ θ (σ ) = p θ (σ ) for any normalized probability distribution p θ (σ ). Here we choose the RBM probability distribution [Eq. (12)], where the visible units correspond to the atomic occupations. Moreover, because the wave function is positive, measurement data from a single basis are sufficient to characterize the state. This means that the QST problem, under these assumptions, is equivalent to unsupervised learning of projective measurement data in the atomic occupation number basis. The tomographic reconstruction of the quantum state is done by iteratively changing the RBM parameters to minimize the cost function When convergence in the training is reached, the taught RBM can be used to estimate various properties of interest. For a generic observableÔ, the expectation value on the RBM wave function reduces to an average over the RBM distribution: where we introduced the so-called local observable The expectation value in Eq. (21) can then be approximated with a Markov chain using Monte Carlo sampling, similarly to the evaluation of the gradients. The evaluation of the local observable O loc (σ ) remains efficient as long as the matrix representation of O in the reference basis is sufficiently sparse. This measurement procedure is also useful for monitoring different observables during training, such as average densities and correlation functions. These metrics can be used to assess convergence, since in general the calculation of the Kullback-Leibler divergence is intractable as it also requires estimation of the partition function Z θ .

Code walk-through
We perform QST on the Rydberg-atom data using the PYTHON package NetKet [211]. First, we import the library and define the relevant parameters for the numerical experiments. For instance, we consider the reconstruction of a square array with linear size L = 8 (containing N = 64 spins), with Hamiltonian parameters V = 3.0 MHz, = 1.0 MHz, and δ = 2.0 MHz.
We then define the lattice structure and the Hilbert space for the neural-network wave function, and load the training data from a file: Since we are doing training in a single measurement basis, the arrays bases and rotations are "trivial." Otherwise, these variables would contain, respectively, an integer encoding of each distinct basis and its associated (local) unitary rotations.
Next we define the main components of the QST algorithm: the neural-network wave function, the sampler used to approximate the gradients, and the optimizer for the parameter updates.
In the definition of the RBM (with real-valued network parameters), the parameter α = n h /N represents the density of hidden units. We use the AdaDelta optimizer [212] and a Metropolis sampler using simple single-spin flips. The tomography parameters n_samples_data and n_samples refer, respectively, to the number of training samples and the number of samples drawn from the model distribution to compute the gradients for a single parameter update (i.e., for batch gradient descent). We refer the reader to the NetKet documentation for additional details. Finally, we generate the Rydberg Hamiltonian to be measured during the learning, and run the QST for a fixed number of training iterations (epochs). We perform QST on datasets of projective measurements generated using the MPS obtained from DMRG at different detunings. Each dataset contains 10 5 measurements. We train each RBM separately using the hyperparameters reported above, and measure at each training iteration various observables of interest. We show the results in Fig. 4, where we plot the average energy per spin Ĥ /N and the average magnetizations Ŝ z/x = j Ŝ z/x j /N after the training has converged. Each data point is obtained by our averaging the expectation values of the observables over the last 100 iterations of the training. The reconstruction shows overall good agreement with the exact values computed with the MPS wave functions.

C. Reconstruction of a molecular wave function
We now move to the more general case of a wave function that is nonpositive or complex valued. To accommodate this different setup, we first need to modify the variational ansatz to allow for complex-valued amplitudes ψ θ (σ ). One way to achieve this is to use the RBM to parametrize the probability distribution in the reference basis (as before) and couple it with an additional RBM that parametrizes the phases, ψ θ μ (σ ) = p θ (σ )e i log p μ ( σ ) [71]. A different strategy, which we adopt here, is to cause the weights and biases to have complex values [50]. For the latter representation, we cannot interpret the RBM as a probabilistic graphical model anymore.
The presence of phases in the target wave function leads to an additional overhead in the measurement requirements. Projective measurements in the computational basis do not carry enough information to uniquely identify the state. To obtain information about the phases, measurements in additional bases are required. To account for this, we can write the training data set as D = {x k }, where each single-shot measurement is given by x = (τ , σ ), with τ and σ referring, respectively, to the measurement basis and the binary measurement outcome. For example, for the case of Pauli measurements, the data point x = (σ x 1 = 0, σ z 2 = 1) refers to a measurement of qubit 1 (qubit 2) in the eigenbasis of the Pauli X (Pauli Z) operator.
The learning algorithm proceeds in a similar fashion to the case of a positive wave function, reducing to the optimization of the negative log-likelihood cost function: The important difference is that to evaluate measurement probabilities in bases other than the reference one, we need to appropriately rotate the neural-network wave function. We can write this measurement probability as where U(τ ) refers to the unitary transformation that rotates the reference basis into the measurement basis τ . We also assume local measurements, leading to the factorization U(τ ) = N j =1 U(τ j ) into single-qubit unitary rotations U(τ j ). For the example described above, with x = (σ x 1 = 0, σ z 2 = 1), the neural-network probability is (up to a normalization factor) p θ (x) ∝ | 01|H 1 ⊗ 2 |ψ θ | 2 , where H is the Hadamard gate (i.e., the rotation into the σ x basis). For a full derivation of a reconstruction with generic measurement bases, we refer the reader to Ref. [80].
We show neural-network QST of a nonpositive wave function for the case of the electronic ground state of a small molecule. The Hamiltonian in second quantization is given bŷ whereĉ † andĉ are fermionic creation and annihilation operators, and t αβ and u αβγ δ are electronic integrals. This Hamiltonian can be mapped into a qubit Hamiltonian using one of several mappings (i.e., Jordan-Wigner, Bravyi-Kitaev, etc.):Ĥ where c k are interaction coefficients andP k are operators that belong to the N -qubit Pauli group. We specifically look at the beryllium hydride molecule (BeH 2 ) in the STO-3G basis, mapped to N = 6 qubits [213]. To generate the training data, we first obtain the full ground-state wave function | with exact diagonalization of the qubit Hamiltonian. Given the ground state, we can generate measurement data by sampling the full probability distribution obtained from the Born rule. A single measurement is obtained by our first selecting a measurement basis τ , then rotating the wave function accordingly | τ = U(τ )| , and finally sampling the measurement outcome σ ∼ P τ (σ ) = | σ | τ | 2 . We choose the measurement bases according to the Pauli operatorsP k appearing in the Hamiltonian, each one being selected randomly among this set.
We show the results of the QST experiment in Fig. 5, where we plot the fidelity between the neural-network wave function and the target wave function during the training. The ability to faithfully learn this class of wave functions from limited measurement data has important implications for simulations of quantum chemistry with near-term hardware (e.g., using a variational quantum eigensolver approach [214]). This is due to the variational property of parametric quantum states (more details are given in Sec. V). Measuring the energy from a chemistry quantum simulation requires the independent measurement of a (typically) large number of noncommuting Pauli operators, which inevitably leads to a large variance in the energy estimators [213]. In contrast, by using the measurement data to train a RBM wave function instead, energy estimators with extremely low variance can be produced (provided the training is sufficiently accurate). This, however, introduces a systematic bias due to the training imperfection on limited-size datasets and the RBM approximation. The trade-off between variance and bias was studied in Ref. [80], which demonstrated a substantial decrease in measurement overhead for sufficiently pure quantum states. FIG. 5. Quantum state tomography of the beryllium hydride molecule. We show the fidelity F = | |ψ θ | 2 between the neural-network wave function ψ θ and the exact molecular wave function at each iteration during training.

D. Discussion
We have shown that for wave functions whose amplitudes can be gauged to be real and positive, QST is equivalent to unsupervised learning of projective measurement data in a single basis. For more general wave functions with a sign structure or complex-valued amplitudes, measurements in multiple bases are required to reconstruct the phases. These are processed by the neural network by appropriately rotating the parametrized wave function with a unitary U(τ ) = N j =1 U(τ j ) composed of single-qubit basis rotations U(τ j ). In practice, this operation entails an exponential cost in the number of nontrivial rotations U(τ j ) = j , which means that only a small fraction of any IC set of bases can be use to train the neural network. Nevertheless, this reduced amount of information may be enough to reconstruct sufficiently structured quantum states. For example, a ground state of a local gapped Hamiltonian can be identified by the statistics of projective measurements in a set of bases corresponding to the decomposition of the Hamiltonian in the Pauli group.
An important assumption that was made is the purity of the quantum state under reconstruction, which is violated in any practical setting. There are cases where an approximate pure state reconstruction of an experimental quantum state may be justified, and could still provide valuable insights [77]. However, for benchmarking and noise characterization tasks, one needs to reconstruct the full density matrix. This can be achieved by introducing a neural-network parametrization of a density operator ρ θ (σ , σ ), which is trained in an analogous way. The positivity of ρ θ can be enforced using a purification scheme, where ρ θ is purified by additional latent units in the neural network [73]. An iterative procedure to discover the dominant eigenstates of density operators using RBMs has also been proposed [81].
A different approach for reconstructing generic mixed quantum states consists in directly parametrizing the 040201-13 probability distribution p(α) = Tr α of an IC POVM set { α } [75]. In contrast the purification scheme, this approach does not require unitary rotations, and thus removes the exponential complexity associated with processing data from arbitrary local bases. This, however, comes at the cost of possibly violating the positivity of the learned density matrix, since this cannot be enforced at the level of the POVM distribution parametrization.
Finally, in the examples shown above, we always know the exact target quantum state. This clearly facilitates the validation of the reconstruction. In a more general setting, where the state under reconstruction it not known, convergence in the reconstruction needs to be assessed heuristically using finite-size scaling experiments [79]. Once a given representative metric is chosen (e.g., the energy or correlation functions), its estimator should be evaluated by asymptotically increasing the number of parameters in the neural network, as well as the number of training samples.

V. VARIATIONAL GROUND-STATE OPTIMIZATION
The variational principle in quantum mechanics states that the expectation value of the Hamiltonian of a physical system of interest over any valid wave function is always greater than or equal to the ground-state energy of the system. This principle indicates that a strategy for finding an approximation to the ground-state energy of a system is to start from a parametrized wave function and vary its parameters until it yields the minimum possible energy. Here we specifically consider the VMC method [7], where the expectation value of the energy (used for optimizing the trial state) is evaluated by Monte Carlo sampling.
Historically, the choice of the wave function, which is critical to the success of the algorithm, has been motivated in close connection to a physical understanding of the problem, e.g., from approximate mean-field solutions supplemented with some form of additional correlations (such as a Jastrow factor [7]). More recently, motivated by their representation power, efficiency, and generality, neural networks have been explored as trial wave functions [50]. In particular, and as we explore below, RNNs are naturally well suited to the study of systems exhibiting strong correlations, such as those arising in the study of classical and quantum systems, which are prevalent in condensed matter physics and statistical physics [47,61,62,75]. Furthermore, the ability to draw unbiased samples from a RNN probability distribution greatly empowers VMC approaches, as it removes the need for sophisticated sampling algorithms to obtain sufficiently small autocorrelation times.

A. Recurrent neural networks
A RNN models a probability distribution p(σ ) = p(σ 1 , . . . , σ N ) using a sequential structure according to the chain rule of probabilities: where σ −j = (σ 1 , , . . . , σ j −2 , σ j −1 ). While specifying every conditional p(σ j |σ −j ) gives a full characterization of any possible distribution p(σ ), the computational resources to store and manipulate such a representation grow exponentially with system size N . To alleviate this problem, a recurrent neural network parametrizes the conditional probability distributions p(σ j | σ −j ) at any time step j . The elementary building block of a RNN is a recurrent cell ϑ [ Fig. 6(a)]. In its simplest form, a "vanilla" recurrent cell is a nonlinear function that maps the direct sum (or concatenation) of an incoming hidden vector h j −1 ∈ R n h [ Fig. 6(a)] and an input x j = σ j −1 to an output hidden vector h j such that where f is a nonlinear activation function acting elementwise on the components of its argument and [h j −1 ; σ j −1 ] is the concatenation operation. The variational parameters of this simple RNN are given by the weight matrix W ∈ R n h ×(n h +2) , the bias vector b ∈ R n h , and the vectors h 0 and σ 0 that initialize the recursion. The vector σ j −1 is a one-hot encoding of the input configuration such that, for example, σ j −1 = (1, 0), (0, 1) for a configuration where the input is either 0 or 1, respectively. The goal of the internal state vector h j is to encode the dependency of the conditional p(σ j | σ −j ) on all the previous variables σ −j . Thus, in our setting, the internal state vector provides the mechanism through which the RNN ansatz encodes correlations and entanglement in the quantum state, since, for example, setting h j = 0 results in a product state. As implied below, the dimension n h of h j determines the number of parameters in the RNN and its expressive power. Instead of the vanilla RNN in Eq. (28), we specifically consider gated recurrent units (GRUs) [215]. The GRU was introduced to solve the vanishing and exploding gradient issues of RNNs encountered during the learning process, where the neural network's weights receive updates proportional to the partial derivative of the error function with respect to the current weight. In a vanilla RNN, the gradients can become vanishingly small or extremely large, effectively preventing the weight from changing or experiencing numerical overflow, respectively [216]. A schematic of a GRU is shown in [Fig. 6(b)]. Given a visible input state σ j −1 and a vector h j −1 , the GRU at time step j processes them according to the sequence of operations and outputs an updated latent vector h j . The first two 040201-14 x j operations are the so-called update gate and reset gate: where sig(x) = (1 + e −x ) −1 is the sigmoid function, and W z , W r ∈ R n h ×(n h +2) , and b z , b r ∈ R n h are variational parameters. These gates are used to control how much information about previous time steps is kept encoded in the latent vector. Intuitively, if the reset gate is close to 0, the hidden state is forced to omit the information encoded in the previous hidden state and utilize the current input only. This effectively encourages a more compact and efficient utilization of the capacity of the RNN since the hidden state can drop irrelevant information as needed. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. Next, given the input visible state σ j at the current time step, an internal latent state is created according tõ where a b is the element-wise multiplication, andW ∈ R n h ×(n h +2) andb ∈ R n h are new sets of parameters. The new latent vector (output of the GRU) is generated as (32) and is sent to the GRU at time step (j + 1). Finally, the conditionals are computed using a softmax layer: where U ∈ R 2×n h and c ∈ R 2 are the variational parameters of the softmax layer. Given the above parametrization of a probability distribution p(σ ), we can now promote RNNs to quantum mechanical wave functions ψ(σ ). As noted in the QST examples, we stress that stoquastic many-body Hamiltonians have ground states | with strictly real and positive amplitudes in the standard computational basis [210]. This class of states can be represented in terms of probability distributions, For such family of quantum states, which includes the ground states of the Rydberg system considered in this work, it is natural to try to approximate p(σ ) with a RNN.

B. Variational Monte Carlo simulation of Rydberg atoms
The goal of the VMC method is to iteratively optimize an ansatz wave function to approximate ground states of local Hamiltonians [7]. The VMC method uses a trial wave function | θ endowed with parameters θ . Here we consider a GRU-RNN wave function. Crucially, we exploit the fact that the RNN wave function allows efficient sampling from the square of the amplitudes of | θ .
The VMC method iteratively optimizes the expectation value of the energy The minimization is done using the gradient descent method or any variant of it. Since the RNN wave function is normalized such that θ | θ = 1, the expectation value in Eq. (35) can be written as

040201-15
which represents a sample average of the local energy E loc (σ ). The gradients ∂ θ E can be similarly written as An optimization step involves drawing N S samples {σ (1) , σ (2) , . . . , σ (N S ) } from |ψ θ (σ )| 2 , followed by an estimation of the energy gradients using automatic differentiation [171] and updating the parameters according to with a small learning rate α. Instead of this simple gradient descent rule, we can also use the Adam optimizer [172] to implement the parameter updates. The stochastic evaluation of the gradients in Eq. (38) implies that these may exhibit high variance, which can potentially slow down the convergence of the algorithm. This problem can be alleviated through the introduction of a term in Eq. (38) that helps reduce the variance of the gradients [61]: This estimator has improved variance, stabilizes the convergence of the algorithm, and is unbiased [61]. In the limit where E loc (σ (i) ) ≈ E near convergence, the variance of the gradients ∂ θ E goes to zero as opposed to the nonzero variance of the gradients in Eq. (38).

Code walk-through
We provide an implementation of the VMC method using a RNN wave function based on the TensorFlow library. We first define all the relevant parameters of the Rydberg Hamiltonian and the RNN training.
The RNN parameters are the learning rate for the optimizer, the number of hidden units in the GRU cell, the number of samples used to approximate the energy (and its gradients) at each training iteration, and the total number of epochs. The VMC module is then initialized accordingly.
We can now perform the VMC simulation by training the RNN parameters. For a given number of epochs, we first sample the RNN distribution to generate ns samples. These can be used to evaluate the cost function of the optimization and its gradients, computed here using the AD functionalities from TensorFlow. The parameters are then updated according to the gradients.
We show in Fig. 7 the results of VMC simulations for an 8 × 8 array of Rydberg atoms. In Fig. 7(a) of hidden units in the RNN leads to increased accuracy in the training. In Fig. 7(b), we plot the average energy after convergence as a function of the detuning using n h = 100, and compare it with the values obtained from DMRG.

VI. FREQUENTLY ASKED QUESTIONS
In this section, we expand on a brief set of questions typically encountered during the training and evaluation of machine learning models in the context of quantum physics.

A. What neural-network architecture should I use?
The architecture and characteristics of the neural network depend directly on the features of the problem at hand, such as whether the degrees of freedom represented are subject to geometric constraints or are constrained by a set of symmetries. For instance, a CNN is naturally well suited in capturing two-dimensional systems with translational invariance [15]. Several other symmetries have also been implemented in neural-network representations of wave functions, including U(1) charge conservation [89], SU(2) spin symmetry [217], and gauge symmetries [218,219]. Similarly, one may choose to adopt an autoregressive neural network (such as the RNN) for problems where sampling may be a bottleneck in the algorithm, as it often is in VMC simulation of strongly interacting quantum matter and spin glasses [47,61,220].

B. How do I certify convergence in the training?
There is no general answer to this question, but there are some guidelines that one should follow when implementing data-driven algorithms for quantum problems. There are usually two complexity scales involved in training a neural-network model: the representational power of the model (i.e., the number of network parameters) and the sample complexity [i.e., how much data one needs to learn the underlying features of a state with small error (for VMC simulations, one can interpret the number of Monte Carlo samples per gradient step as data)]. The assessment of convergence is problem specific, and depends on whether there are tractable estimators to probe the convergence. For VMC simulation, for example, the value of the energy is a natural estimator of convergence, and energy variance can be use to estimate how close the parametric wave function is to an energy eigenstate, where zero variance indicates that the algorithm has found an exact eigenstate.
In contrast, in QST tasks the optimization objective (e.g., the Kullback-Leibler divergence) is not accessible when using models with intractable densities (i.e., the RBM). In this scenario, one should instead monitor other observables that are relevant to the problem at hand, such as local densities or magnetizations, or various types of correlation functions. A direct comparison between the model prediction and the measurement data can be obtained using observables that are diagonal in the measurement basis. For both of the above, convergence can be assessed by performing a scaling analysis of a relevant observables of choice against both the number of measurements and the number of network parameters [79].

VII. CONCLUSIONS
In this article, we presented applications of machine learning algorithms based on neural networks to quantum many-body physics. We focused on numerical demonstrations and provided hands-on code tutorials based on opensource software [154,173,211] with the goal of facilitating learning for researchers new to the field and accelerating the adoption of machine learning in quantum physics.
We showcased distinct machine learning paradigms, implemented with different neural-network architectures. As a test bed for the numerical experiments, we chose a system of interacting Rydberg atoms arranged in a twodimensional square array. First, we demonstrated supervised learning of atomic occupation data, which were generated from ground states of the Rydberg Hamiltonian 040201-17 obtained using the density matrix renormalization group. Using a convolutional neural network trained on labeled data, we showed how to learn the quantum phase transition between a disordered phase and the antiferromagnetically ordered phase.
We presented unsupervised learning of unlabeled data using the restricted Boltzmann machine. We showed that for pure quantum states with real and positive amplitude (e.g., the Rydberg ground states), this procedure is equivalent to quantum state tomography based on a neuralnetwork representation of a quantum state. Using atomic occupation data, we trained Boltzmann machines to learn the ground states of the Rydberg Hamiltonian. We also described a simple extension of this approach to learn quantum states with a sign structure, and showed a demonstration in the context of simulation of chemistry with quantum computers, where we learned the ground state of a molecule using qubit measurements.
The final application we explored is the Monte Carlo optimization of a variational wave function to estimate the ground state of a many-body Hamiltonian. We parametrized a wave function using a recurrent neural network and trained its parameters to lower both the expectation value and the variance of the Rydberg Hamiltonian.
It is becoming increasingly evident that machine learning is full of thrilling opportunities and conceptual advances with great potential to energize computational and experimental physics. As these notions continue to spread through the research landscape of strongly correlated quantum matter and quantum information science, we hope this tutorial will provide a useful first step into the expanding domain of artificial intelligence for the study of quantum many-body systems.

ACKNOWLEDGMENTS
The numerical simulations were performed at the Simons Foundation Super-Computing Center using the following software libraries: ITensor [154] and PastaQ [221] (for the simulation of the Rydberg atoms and data generation), TensorFlow [173] (for supervised learning and variational Monte Carlo simulations), and NetKet [211] (for quantum tomography). The Flatiron Institute is supported by the Simons Foundation. J.C. acknowledges support from the Natural Sciences and Engineering Research Council of Canada, the Shared Hierarchical Academic Research Computing Network, Compute Canada, the Google Quantum Research Award program, and the CIFAR AI Chairs program.