Gaussian Process States: A data-driven representation of quantum many-body physics

We present a novel, non-parametric form for compactly representing entangled many-body quantum states, which we call a `Gaussian Process State'. In contrast to other approaches, we define this state explicitly in terms of a configurational data set, with the probability amplitudes statistically inferred from this data according to Bayesian statistics. In this way the non-local physical entanglement features of the state can be analytically resummed, allowing for exponential complexity to underpin the ansatz, but efficiently represented in a small data set. The state is found to be highly compact, systematically improvable and efficient to sample, representing a large number of known variational states within its span. It is also proven to be a `universal approximator' for quantum states, able to capture any entangled many-body state with increasing data set size. We develop two numerical approaches which can learn this form directly: a fragmentation approach, and direct variational optimization, and apply these schemes to the Fermionic Hubbard model. We find competitive or superior descriptions of correlated quantum problems compared to existing state-of-the-art variational ansatzes, as well as other numerical methods.


I. INTRODUCTION
Representing entangled quantum many-body states efficiently and compactly is a major challenge, with the need for tractable approaches underpinning a diverse set of fields including quantum computation, electronic or nuclear structure, and quantum chemistry. Advances in these fields are often punctuated by the discovery and exploitation of efficient wave function forms for capturing certain physics with polynomial-scaling resources. These non-linear parameterizations avoid the famed 'exponential wall' in the complexity of the general solution. This includes forms (and physical features) such as the Gutzwiller wave function [1] (suppression of local double occupancy), tensor network [2][3][4] (low entanglement), Laughlin [5] (fractionalized excitations), or coupled-cluster states [6] (low rank reducible correlations). However, it is not just the quantum many-body problem which encounters a curse of dimensionality. Increasingly, computational fields are turning to statistical inference and machine learning approaches to circumvent this bottleneck, and to allow a more general framework to capture arbitrary functional forms in high-dimensional spaces. Following this spirit, in this work we espouse a new philosophy, whereby wave functions are not defined and constrained by a fixed parameterization, but rather defined explicitly in terms of 'data', with their parameters considered statistically as random variables. This leads to the description of entangled wave functions in a systematically improvable and non-parametric fashion. The expressiveness, accuracy and compactness of the resulting state is found to be competitive or surpass that of * These two authors contributed equally to this work. † george.booth@kcl.ac.uk state of the art wave function parameterizations, encoding many important correlated states in a simple form.
The data used to define this wave function consists of a small 'training' sample of underlying many-body configurations and their (approximate) probability amplitudes. The question we wish to address first is how to optimally use this training set to infer the wave function amplitudes on any other configuration in the exponentially large Hilbert space, in a statistically rigorous fashion. This allows us to define a complete wave function spanning all configurations, which can then be efficiently sampled to extract any desired property of the system. In this way, the state is explicitly parameterized by the data set, rather than this data used instead to optimize a fixed functional form. It is clear that this datadriven approach will not be possible for a random state where no structure exists in the form of the amplitudes, but it is known that physical many-body states exhibit much structure which can be effectively learnt in this approach, emerging from the underlying principles encoded in the Hamiltonian. After this inferred wave function is defined and its limits discussed, we turn the idea into a practical and general method for quantum problems, and demonstrate its accuracy and improvability in describing full N -body quantum correlations by application to the Fermionic Hubbard model of the cuprates and strongly correlated solids. In this work, we consider two sources for the data: configurations from the solution to a small fragment of the full system, and a self-consistent approach to iterative selection and refinement of the data. Furthermore, we show that an automatic selection of the relevant data required to capture the complexity of the many-body correlations allows for a compact description of this state compared to other approaches, with significant potential for a wide range of applications.

II. GAUSSIAN PROCESS REGRESSION FOR WAVE FUNCTIONS
In order to describe the amplitudes of a quantum state, Ψ(x), evaluated for any many-body configuration x in the underlying Hilbert space, we work within a Bayesian inference framework known as Gaussian process (GP) regression. This approach falls under the umbrella of 'kernel' methods, which constitute the alternate paradigm to neural networks in machine learning. The great flexibility and ease-of-use of GPs and other kernel methods has brought them to prominence in a range of fields, including the interpolation of potential energy surfaces and other high-dimensional functions [7][8][9][10]. Advantages over neural networks include a rigorous underlying probablistic model and uncertainty bounds, the ease of incorporation of symmetries and conserved quantities, well understood regularization procedures, and robust and theoretically justified algorithms [11]. In this non-parametric approach, an unknown target function is learnt from samples at a finite number of input points, allowing for a statistical inference of an estimator for the underlying function. The weights of the model are considered as random variables, which define a distribution over all linear models of the state in a high-dimensional space of 'features', defined by the map Φ(x). The scalar product of two configurations in this feature space define a kernel function, which can be evaluated between any two configurations, as k(x, x ) = Φ(x) · Φ(x ). Estimating the expectation of the Gaussian process then involves evaluating a linear combination of kernel functions k(x, x t ) between the target configuration x and each configuration x t of the training database.
We introduce a 'Gaussian Process State' (GPS) as the exponential of a GP estimator, where a configurational probability amplitude, Ψ g (x), is given as Many existing parameterized states can be re-expressed in this form. For instance, the Gutzwiller Ansatz can be considered via a feature map into a single dimensional space, given by the number of doubly occupied sites in the configuration. However, the key to circumvent the complexity of a general mapping is to note that the only quantity required for predictions of the configurational amplitudes is the scalar product of two feature vectors, defined by the kernel function. By constructing the kernel function directly, the feature vectors are only ever implicitly defined, allowing for an arbitrary number of features to define the state, and a definition of the model explicitly and solely in terms of the underlying data {x t }. This GPS is a 'universal approximator', meaning that it can approximate a quantum state with arbitrary accuracy when provided with sufficient data. A GPS can also be recast as a feed-forward neural network with a infinitedimensional single hidden layer [12,13], connecting the approach to the recently developed neural-network states [14][15][16][17][18][19][20][21][22]. These have recently come to prominence, particularly in the form of a 'Restricted Boltzmann Machine' (RBM) which also forms a universal approximator for quantum states [23][24][25]. The exponential form in Eq.
(1) is reminiscent of other wave function forms, such as the Jastrow [26] or coupled-cluster ansatzes [6]. However in contrast, the GPS is defined explicitly in terms of Nbody configurations, rather than parameterized in terms of low-order features. The exponential form also ensures that the overall state for our additive kernel is appropriately product separable in terms of its local features, and size extensive allowing its properties to appropriately scale to the thermodynamic limit. As a scalar product of the configurational features, the kernel function k(x, x t ), aims to quantify the similarity between any two configurations in the system, and it is clear that the specific choice of kernel function, as well as training configurations {x t } and weights w t , is of paramount importance for the expressibility and success of the model.

A. The kernel function
The kernel function defines the set of entanglement features for the state which are optimally modelled from the data, without explicitly applying the corresponding feature map into this space. With a kernel including all features we are able to reproduce any state with our ansatz, assuming we have a complete set of exact data points. More importantly though, we can introduce a set of kernels which gives rise to a systematically improvable ansatz which can accurately approximate the state using a far smaller set of training configurations.
We can write a configuration x in terms of its local Hilbert space on each spatial degree of freedom, i. As a specific example of Fermionic lattice models used later, we can consider four local Fock states for each lattice site, with x = x i where x i ∈ {·, ↑, ↓, ↑↓} (for a spin or qubit model, there would just be two local states, x i ∈ {↑, ↓}). The kernel function compares these local occupations between two configurations, extracting multi-site entanglement features common to both. To do this, the kernel k(x, x ) is written as a linear combination of products of delta functions δ xix j which we define to be 1 if the local occupation x i is equal to the local occupation x j , and 0 otherwise. We do not assume that the dimension of the two configurations are the same, and can therefore compare configurations with different numbers of lattice sites.
The simplest kernel just compares the local occupancies, given by where L and L refer to the size of the lattices corresponding to x (denoting the test amplitude to infer) and x (the training point), and where summing over the lattices is optional but enforces translational symmetry in both the training data and overall GPS amplitudes if desired. The superscript '(1)' emphasises that this kernel extracts features corresponding to the Fock space of local, single-site plaquettes, therefore implicitly mapping each configuration to a four-dimensional space corresponding to its number of each local state. This therefore extracts the features used in the Gutzwiller ansatz (number of doubly occupied sites), with the only difference that the weighting of the feature is given implicitly by the data amplitudes (w t ), rather than explicitly in this feature space as is traditionally done. In order to include higher order entanglement features, we can include comparisons of the occupations of plaquettes over multiple sites for the two configurations (see figure 1). Here, the advantage of using a kernel function to extract descriptors implicitly becomes very prominent; while the number of possible multi-site features grows exponentially with the number of sites in the plaquette considered, the kernel can still be efficiently evaluated. Multi-site features can simply be included via products of delta functions evaluated for different displacements of sites i and j. Exploiting this, we can analytically resum the contribution to the kernel function from all possible topologies and ranks of entanglement descriptors defined up to a maximum number of sites (controlled by the hyperparameter p), and maximum length (controlled by a set of displacements D), giving a general kernel form In this kernel, all pairs of sites i and j in the two configurations are compared; if their local occupancy is the same (δ xixj = 1), then their local surroundings (x i and x j ) are also compared by means of the 'local' kernelk(x i , x j ). The local comparison includes all displacements d ∈ D of i and j, running over the |D| closest sites. In this work, we choose D to include all possible displacements over the full lattice unless otherwise stated. Raising the local kernel to the power of p, we obtain a linear combination of all possible multi-site plaquette entanglement features in our implicit parameterization of the state in the form of an 'additive' kernel, which heirarchically includes all lower-rank features [27]. As p → (L − 1), the implicit feature map of each configuration, φ(x), is injective, fully characterizing the configuration. Further flexibility in this form is provided by the hyperparameter θ, controlling the relative weighting of high-order vs low-order features, as well as the function f (d), which in this work we choose to be the inverse of the displacement distance, f (d) = 1/|d|, to weight the characterization of short-range entanglement features more than longer-range. The complexity of the underlying parame-terization of the GPS is therefore controlled in this work by hyperparameters θ and p. These will optimally model the entanglement features in the data up to p + 1 sites, with θ controlling the relative importance in the rank of these features.
Training config.
Test config. The introduced kernel can be shown to model well known ansatzes under specific limits of these hyperparameters. For θ → ∞ or p → 0, we recover the k (1) kernel introduced in Eq. (2) which gives rise to a purely local Gutzwiller underlying form. For θ = 0 and p = 1, we recover a generalised two-body Jastrow factor, and finally θ = 0 and p → ∞ generates an Entangled Plaquette State (EPS) [28,29] with all plaquette sizes up to |D|+1. As D is then allowed to increase in size to include all system displacements, then the kernel is at its most expressive, allowing all features to be considered, and an underlying (over)-parameterization of the complete wave function space.
Other important wave functions are also compactly expressed within a GPS formalism. The n-qubit W state [30] is a multipartite entangled state, important in quantum computing, and cannot be efficiently represented as a Matrix Product State (MPS) or EPS [31]. This can however be represented as a GPS (θ = 0, p → ∞) with a single data configuration. Similarly, strongly-interacting states such as the Laughlin wave function [5,32] describing the topologically-ordered fractional quantum Hall effect can be exactly represented as a product of two GPS forms acting on a single data configuration. A full derivation of these limiting cases of the kernel are found in Appendix A and B. Symmetries of the system can also be simply included in the kernel form. For instance, to incorporate the spin symmetry of the paramagnetic Hubbard model considered later into our ansatz, we can use the symmetrized kernelk wherex represents the spin-reversal of configuration x . Key to the performance of the GPS, is that despite the fact that an exponential number of entanglement features can underpin the state, the kernel presented in Eq.
(3) can be computed efficiently in only O[L L |D|] time (which can be reduced to O[L L ] when only a local change to x is made to a known kernel value, as is common when sampling the space). A full configurational amplitude evaluation will require N t kernel computations, as given in Eq. (1). This construction of the state in the space of N −body data configurations therefore allows for its flexibility and expressibility. We now turn to the consideration of how compact the set of data configurations can be, and computation of their weights as key for the efficiency of the model.

B. Training and selection of data
Similar to the constructive algorithm for a matrix product state via successive SVD and decimation steps, we aim to devise a deterministic algorithm which allows for the compression of an arbitrary state into GPS form. This requires selection of a set of training configurations, and associated trained weights to optimally represent the original wave function. Ideally, we want this training set to be sparse for a compact description of the state, by exploiting the underlying structure in the entanglement features of the target wave function. The required size of the training set will depend on the choice of kernel hyperparameters and desired accuracy in reproducing the state. In particular, the quality of the wave function can be improved if a more complex kernel is used with a larger set of underlying entanglement features (e.g. by increasing p). However, we expect this will consequently require a larger training set of configurations to appropriately fit -a common tenet of machine learning.
We base our approach on the Relevance Vector Machine (RVM) [33], a Bayesian algorithm for sparse learning. Within the RVM, a different prior distribution is assigned to every weight of a large set of candidate training configurations. The variances of such distributions can be efficiently optimized through the maximization of the marginal likelihood of the model [34], with a large optimal variance for the configuration indicating that it can be pruned away from the data set, whilst still faithfully reproducing the wave function amplitudes up to a desired fidelity. Concurrently, the optimal weights of the resulting configurations, w t , are given by the mean of the posterior distribution for each data point. Complex Gaussian forms for this distribution allow for the optimization of complex weights, which admits signed amplitudes in the overall model. This yields an approach to automatically pick a highly compact training set of support configurations and optimized weights for the GPS, in order to describe a state to a desired accuracy. This is a systematically improvable approach to GPS construction, whereby increasing the complexity of the kernel and hence underlying features parameterizing the state, automatically results in a corresponding increase in the resulting data set size from the RVM. Details on the implementation of the RVM are given in Appendix C. We demonstrate the compression of a correlated state to GPS form by considering the paradigmatic Fermionic Hubbard model, described by the Hamiltonian (6) Figure 2 shows the fidelity in compressing the ground state of this model to GPS form as the kernel complexity is increased via the p hyperparameter, for a small system which can be exactly solved. The error is defined as the mean squared error over all configurations, while the corresponding size of the resulting training set is also shown. With p = 0, only single-site local entanglement features are represented, resulting in the selection of only two relevant training configurations to specify the state with this level of description, as the entire (four-dimensional) feature space is rapidly saturated by the data. As the kernel complexity increases, larger training set sizes are chosen, with a rapid corresponding decrease in the error of the inferred GPS. For the most complex kernel containing all range and rank of correlations, all 3202 symmetry-inequivalent configurations are selected in the training set, resulting in the exact wave function reproduced as a GPS to numerical precision. This represents a key feature of the GPS, that a desired kernel complexity can be used to automatically construct a sparse training set and weights required to optimally represent this complexity of correlations in the state. It is encouraging that the convergence of the GPS is so rapid with training set size, only requiring ∼ 100 training configurations to represent all amplitudes with a mean squared error of ∼ 10 −8 (p = 2) for this system. If we assume that the relevant features required to model the wave function do not change as the lattice increases in size, then we do not expect the GPS error to increase as the sys-tem size grows, for the same fixed test configurational set and kernel complexity. While this will not always be the case, this assumption can be used within a practical construction of a GPS in the large system size limit for this model.

III. PRACTICAL GAUSSIAN PROCESS STATES
A. Extrapolated GPS While the compact representation of a GPS has been shown to have a number of promising features, we turn to practical algorithms for quantum many-body problems, where the desired state is not generally known in advance. This requires a method to choose a necessarily approximate initial state to compress to GPS form, before potential iterative refinement. We first consider an approach whereby we can exactly solve only a fragment of the system, and use the information encoded about the correlations within that fragment to infer the wave function amplitudes in GPS form on a larger simulated system -an approach which has recently been attempted also for neural network states [35]. We denote this fragment the 'training' lattice, and the full system of interest the 'simulation' lattice (e.g. in the thermodynamic limit). The approach is possible as the kernel function can compare entanglement features on training and simulation configurations in different sized Hilbert spaces, as well as the multiplicatively separable features of the GPS ensuring a size extensive and translationally invariant overall description. It is also important to ensure that single-particle physics beyond the lengthscale of the training lattice is still incorporated, which is done via a simple, fixed free particle product state, which has the exact sign structure for this model. The full wave function on the simulation lattice is therefore the product of this free-fermion state and a GPS defined with a training data set on a small sub-lattice, which is designed just to infer the short-ranged entanglement corrections to the underlying single-particle description. This mirrors the use of other successful fragmentation approaches for strong correlation, such as dynamical mean-field theory, where correlations in a small fragment are replicated across a larger system. However here, no symmetries are broken in the simulation lattice, and a correlated wave function is defined over all configurations.
We use this approach to model the ground state of the Hubbard model for larger lattices, where the energy per site is well converged. In 1D, numerically exact results are accessible via the DMRG [2,36], allowing rigorous benchmarking of the accuracy of the resulting GPS state. For this work, we choose an L = 12 site training lattice for the data set, where the system can be exactly solved, but restrict the displacements in the local kernel to include only six-site features (|D| = 5). This ensures that a complete set of all features in this range are spanned, including all particle number and spin fluctua- tions, with the other sites effectively acting as an 'entanglement bath' to ensure a representation in the data of all the six-site features. We then use the RVM to select the training set for different values of p from the exact solution over the training lattice, without further optimization or adjustable parameters. Since the resulting GPS on the simulation lattice is easily specified, a simple Markov chain Monte Carlo is performed to sample the energy of the state, without a sign problem, giving the results in Fig. 3.
For comparison, using standard variational Monte Carlo methodology developed in the 'mVMC' software package [37,38] we also optimize translationallysymmetric, real-valued Gutzwiller, Jastrow and RBM states over the simulation lattice on top of the same (fixed) free-Fermion state, to allow for unbiased comparison of the accuracy and compactness of these states. For the GPS, p = 0 is shown to be essentially Gutzwiller accuracy, while p = 1 is slightly worse than the Jastrow description due to the neglect of two-site correlated features beyond a distance of six sites. As expected, increasing p increases the quality of results up to p = 5, at which point all of the correlations within the considered 6-site plaquettes on the training lattice are captured, resulting in a relative energy error of ∼ 2 × 10 −3 . This leads to an increase in the size of the training set to ∼ 450 configurations.
To mitigate the fact that the parameters of the model (the weights) were deterministically trained by the RVM on the solution of a small sub-lattice, we can also variationally optimize these for the overall state keeping the chosen training configurations fixed (blue squares in Fig. 3). This removes the bias from the weights having been optimized on the smaller lattice, but the improvement is rather marginal. The remaining error therefore primarily stems from the lack of entanglement features in the kernel with a range beyond six sites, for which a larger training lattice for the data is required. By comparison, the RBM state is optimized with increasing numbers of hidden nodes, in principle allowing an improvable description of the ground state. However, the increasing numbers of highly non-linear variational parameters renders the stochastic optimization difficult to reliable converge to the ground state of the model. To mitigate this, the values shown are the lowest achieved from 5 independent optimizations starting from different initial states, but a non-monotonic behaviour with increasing numbers of hidden nodes is still found. This difficulty further underlines the importance of a compact description of the state, which correlates with both the sampling cost and ability to numerically optimize these parameters.
Compact GPS descriptions with competitive or superior accuracy compared to other explicitly variational ansatzes can be found across a range of correlation strengths in this approach, as demonstrated in Fig. 4. We again use a training lattice of N = 12, and a kernel with θ = 0, |D| = 5, p = 3, as the effective Coulomb strength U/t is varied. While the character of the ground- 3) The RVM is used to drastically sparsify the dataset without changing the learned function. 4) Configurations with high local energy variance are added to the sparsified set (green crosses). Note that the final three panels model the function to essentially the same accuracy, allowing systematic improvement in the refining of the data set.
state changes significantly in this range, the GPS results are consistent, outperforming Gutzwiller and Jastrow parameterizations for all U/t, even without any variational optimization of the weights. The α = 3 RBM results are slightly better for small-to-intermediate U/t, where the correlation lengths exceed the size of the training lattice features, while the increasing importance of short-range correlations at larger U/t favour the GPS with this kernel form. An alternative numerical approach to GPS construction is now considered which removes these constraints imposed by a small training lattice, and can thus include correlated features at all lengthscales and rank in an iterative approach.

B. Bootstrapped Optimization of GPS
We propose a numerical procedure for a series of optimization, compression and data reselection steps to iteratively refine a GPS and its data set, illustrated for modelling a continuous function in Fig. 5. The approach is defined directly on the target system, and therefore avoids the need to define a fragment system or restrict to short range features. We first initialise the GPS ansatz with a random set of data configurations with w t = 0, whose weights are then optimized variationally. The data set required for the optimized GPS is then sparsified via the RVM procedure, which prunes redundant data from the original specification which is not required to describe the state to the target accuracy (as given by the kernel complexity). This optimized GPS is then augmented with additional data points outside the original set, in order to acquire additional flexibility, before all weights are then reoptimized variationally and the process is re- peated until convergence. There are a number of possible criteria for selection and addition of relevant new data configurations to improve the GPS; the largest uncertainty in the GPS (obtained in a Bayesian framework), highest local energy, or largest contribution to the variance of the local energy. All these criteria indicate that the configuration is relatively poorly represented in the current GPS data set, which would benefit from explicit inclusion and optimization of these points. We choose the latter criteria, with the selection made from those configurations sampled during the previous Monte Carlo optimization of the GPS. We choose the number of configurations added to the data set to be a constant fraction (in this work 25%) of the size of the compressed data set, whose maximum size is the initial data set size. This ensures that it never exceeds its initial size, and that a large fraction of the data at any point has been already deemed relevant by the RVM, whilst allowing continued exploration of the relevant data and feature space.
We apply this 'bootstrapped' GPS approach to the 2D square Hubbard lattice at U = 8t, where the electronic structure problem is far more challenging. To avoid restriction of the feature space, we use the p → ∞ limit of the kernel, and let θ become an additional (single) variational parameter to optimize at each step. By doing this, the improvability of the bootstrapped GPS now depends solely on the desired size of the data set, with the complexity of the kernel variationally dictated via θ by this data size. We also improve the flexibility of the orbital part of the wave function by using a simultaneously optimized Pfaffian state, together with translational and spin-symmetry projections on this part of the wave func-tion [38][39][40]. This allows for more freedom to alter the nodal structure, and we restrict the weights of the GPS and RBM forms to real values. Fig. 6 shows the relative energy error of the bootstrapped GPS with respect to the number of variational parameters used, as the data set size increases for a 6 × 6 half-filled lattice. Comparison values are obtained with the Gutzwiller, Jastrow and RBM states, all using the same form for the orbital and symmetry-projected reference, while auxiliary-field QMC is able to provide unbiased benchmarks at half-filling [41]. The GPS systematically improves with increasing data set size, surpassing the Gutzwiller accuracy with two data points. As the number of data points increases, it can obtain a description of the same level as Jastrow with similar numbers of parameters, and can ultimately achieve a compact form with an energy error comparable to the best RBM, with an order of magnitude fewer parameters. The fact that the RBM and GPS reach a similar accuracy points to this being the likely limit of the expressiveness of the model, with remaining 0.5% error resulting from the nodal surface constraints of the Pfaffian state. This could be further improved with standard backflow or Lanczos step technology [21,42], while in the future we will look to also describe the sign structure of these systems within the GPS model [43,44]. However, even without these, the results demonstrate the convergence and improvability of the GPS relative to other approaches, and underline the advantage of this data-driven approach, where an exponential number of correlated features can be modelled with very few parameters, for a sparse and accurate representation.
Finally, we consider the tricky n = 0.875 hole-doped parameter regime of the model where agreement between the most accurate numerical methods is substantially harder to achieve compared to the half-filled case [45]. Furthermore, no exact results are available for benchmarking, with many competing low-energy inhomogeneous phases claimed as the ground state [46][47][48][49]. We increase the lattice size to 8 × 8, and compare the convergence of the bootstrapped GPS to other variational approaches in Fig. 7. The database size is grown to 50 configurations, with the additional θ optimization resulting in only 51 parameters. Again, agreement with the RBM is excellent, indicating the improvability to the likely limit of the expressiveness of the model. Interestingly, the Jastrow factor also performs well here for this model, only a little worse than the RBM and GPS states. This is corroborated by the GPS, where we can project the optimized kernel function into the explicit feature space, to extract out the relative weighting of local, nearest-neighbour two-body versus three-body features in the state. From this analysis of the GPS state, we find these three-body features weighted ∼20 times less than the corresponding two-body feature. Overall, the difference in energy between these states is smaller than the error to the best estimates of the ground state in this system [46], and therefore ongoing work of importance for both GPS and RBM states is concerned with ensuring sign information in the flexibility of these approaches [43].

IV. SUMMARY AND OUTLOOK
We present a compact and efficient representation of quantum many-body wave functions, the Gaussian Process State. By allowing the parameters of the model to follow a statistical distribution, we derive a Bayesian inference framework for the probability amplitudes of the state in terms of an exponentially resummed space of physical entanglement features. These are expressed in a kernel function which can compare any two configurations in this feature space, and allow for their amplitudes to be learnt from a data set. The state can be deterministically and efficiently constructed with a manifestly compact data set from any initial state. As the data set is allowed to increase in size, the state is then shown to be able to universally approximate any entangled quantum state. In this way, these states complement the neuralnetwork states, also derived from machine learning principles to approximate any quantum state, but which instead only indirectly depend on their training data.
We also present two numerical approaches which exploit this wave function form directly to approximate an unknown quantum state. In the first, the model is trained on amplitudes from a small subsystem, whose features are then implicitly extrapolated to define a state over the whole system. This is numerically simple, however the restriction to relatively short-ranged entanglement features inspired the development of the bootstrapped iterative optimization approach, which does not rely on any previous wave function information. This has broad applicability, and is shown to be efficient, sparse and accurate for both the half-filled and hole-doped strongly correlated regimes of the Hubbard model. We expect these schemes to work similarly well for other systems relevant for a broad range of scientific domains, and opens a new route for the compact representation of quantum many-body systems directly in terms of data. A.G. and Y.R. contributed equally to this work.

Appendix A: Limits of the kernel function
The kernel function is one of the key ingredients of the GPS ansatz proposed in this work since it implicitly models the entanglement and correlations that can be represented. Its general form, repeated here for convenience, depends on three hyper-parameters: p, θ and the number of considered displacements |D|. The "power" p controls the maximum rank of the correlations modelled (i.e., the maximum number of sites for which an explicit feature is present); the "lengthscale" θ allows tuning of the importance of low-rank and high-rank features in the ansatz; and D determines the set of displacements between two sites which are included in the correlations. In this work, this is generally chosen to be the set of all possible displacements. It is instructive to look at the behaviour of the kernel for certain limiting cases of the hyperparameters just described, since these give rise to well known physical descriptions. a. Gutzwiller. Limit θ → ∞ or p = 0 . A Gutzwiller representation is recovered by either letting θ go to infinity or, equivalently, by taking p to be zero. Indeed, in both cases the kernel in (A1) takes the very simple form This kernel scans over the sites in the two configurations x and x checking for identical occupancies. It is straightforward to check that the above kernel can be written as the scalar product where s is a specific occupancy s ∈ (·, ↑, ↓, ↑↓) and Φ G (x) therefore simply counts the number of sites in x which have a given occupancy s. This means that the kernel k G implicitly defines a linear model on the features Φ G (x). This can be considered a generalised Gutzwiller form which favours or suppresses a given configuration depending on the number of sites in any given state s. b. Jastrow. Limit θ = 0 and p = 1. A generalisation of the Jastrow representation is recovered by setting θ = 0 and p = 1. Indeed, for such a choice of hyperparameters the kernel takes the form As in the case of the Gutzwiller limit just discussed, also in this case it is possible to write the above kernel as an explicit scalar product of some features. In this case the features are defined by the following rank-3 tensor (A6) The component s 1 , s 2 , d 12 of Φ J increases by a factor f (d 12 ) whenever two sites separated by a displacement d 12 are found to be in the states s 1 and s 2 . Similarly to the the Jastrow factor this representation depends on two sites and their relative positions. However, Φ J can be considered of generalized Jastrow form, since it depends on the specific occupation of the two sites, rather than their densities. It should also be stressed that the f (d il ) does not determine the overall amplitude of the feature, but rather the importance in trying to fit these to the data. Its overall weight is instead determined by the weights over the training configurations which exhibit that entanglement feature. c. Squared exponential kernel. Limit p → ∞ Setting the hyperparameter p to finite values is a physically meaningful way to restrict the space of correlations defined by the kernel function. However, in certain cases (e.g., in the bootstrapping calculations described in this work) it might be convenient to take the p → ∞ limit, obtaining a very flexible kernel controlled by the single hyperparameter θ. To evaluate this limit, we exploit the identity e x = lim n→∞ (1 + x n ) n , obtaining the following kernel The above kernel can be seen as a squared exponential kernel on the distance induced by the scalar product S(x i , x j ). Such a distance metric is related to the concept of a Hamming distance [50] in discrete strings. This squared exponential kernel includes by construction all possible correlation features, depending on a single hyperparameter θ which models the relative importance of fitting higher order to lower order features in the training data. d. Entangled Plaquette States. Limit p → ∞ and θ → 0. The Entangled Plaquette States ansatz (also known in the literature as correlator product states or complete graph tensor networks) decomposes the overall amplitude of a configuration into a product of contributions coming from a set of plaquettes over the degrees of freedom [28,29]. Each possible state local to each plaquette is given its own variational parameter, with the number of parameters therefore growing exponentially with plaquette size. Spanning the space of these features is recovered from the kernel in (A1), in the limit of p → ∞ and θ → 0. We have shown that the limit p → ∞ evaluates to Eq. (A7). For θ → 0, the exponential e −γ 2 S (xi,x j )/2θ in Eq. (A7) tends towards a delta function δ D xix j evaluating to zero unless the local occupations of x around i and x around j are identically equal, up to the set of displacements defined by D. We can hence write the resulting kernel as The combined Kronecker symbols δ xix j δ D xix j only evaluate to one if all the sites within a given cutoff in the two configurations are exactly the same and we can hence write it as δ (xi,xi),(x j ,x j ) . Similar to the Gutzwiller and Jastrow limits, the above kernel can also be rewritten as an explicit scalar product. In this case, the corresponding features are defined by the tensor where now the index s = (s 1 , s 2 , . . . ) runs over all Fock states within the |D| + 1 site plaquette around site i. Linear regression on Φ G (x) gives rise to an entangled plaquette states ansatz, favouring or suppressing a given amplitude depending on each "plaquette" of |D| + 1 sites that is found within the corresponding configuration. It is important to note here that while the parameterization of a direct EPS ansatz grows exponentially with |D| (roughly as 4 |D| ), the number of parameters when modelled indirectly through the above kernel only scales with the size of the data set. The amplitudes of these features are then directly inferred from the weights over the test configurations.
Appendix B: Representability of specific states W state: The GPS can also be used to represent certain many-qubit entangled states central to quantum information theory. As an example, here we show how to represent the entangled W state of n qubits, with the GPS. The W state can be represented using a single training configuration, namely one of the n basis configurations which generate the state, e.g.
and a kernel function with p → ∞ and θ → 0 where the delta function only distinguishes between the two local occupancies |0 and |1 . With this choice, the kernel k(x , x) always either evaluates to 1, if x is one of the n translations of the configurations x which create the W state, or to 0 otherwise. Setting the weight associated with the data configuration to the normalisation w x = ln 1 √ n , we therefore obtain a GPS representation of the W state using only a single data configuration.
Laughlin wavefunction: Modifying the kernel function used also allows representation of other known ansatzes in very compact forms. As an example, we consider the Laughlin wavefunction discretized on a realspace lattice, an ansatz describing n spinless electrons on a two-dimensional lattice of L sites. Expressing the lattice position of site k as a complex number Z k = x k +iy k , where x k (y k ) is the x(y) coordinate of the site, this discretized Laughlin wavefunction can be written as [28,32] where q and α are fixed by the external model and desired state, with indices i, j and k referring to occupied sites in a sampled configuration x. We can express the Laughlin wavefunction as a GPS using a single data configuration with a modified kernel, or equivalently as a product of two GPS forms. This may be of interest as a starting point to build more flexible ansatzes for this problem. We define the modified kernel as which is a linear combination of the one and two-body kernels k 1B and k 2B (with k 1B weighted by the parameter α). The one-body kernel is given as where the sum runs over all sites of the lattice. This definition is equivalent to the local kernelk of Eq. (A2) with p = 1 and θ = 0, and with f (d) = |d| 2 . To define the two-body kernel, we express the displacements in the local kernel as complex values (d x + id y ), and use the kernel of Eq. A1 with p = 1, θ = 0, and a displacement weighting function of f (d) = ln(d x +id y ) q , where translational symmetry over the test configurations is removed, referring instead always to the occupation of a reference site. This gives a two-body kernel as Here the sum is taken over all possible displacements for which the displacement of site i gives a site with larger index, i.e. D = {d : d(i) > i}. Using a single data configuration, x , in which all sites are occupied x = |1, 1, . . . , 1 , the factor e k 2B (x,x ) , reproduces the two-body factor of the Laughlin wavefunction, n i<j (Z i − Z j ) q = e n i<j ln((Zi−Zj ) q ) , and the factor e −α k 1B (x,x ) gives the one-body factor e −α n k=1 |Z k | 2 . With the introduced definition of the kernel function k L , the Laughlin wavefunction is therefore obtained as the GPS representation wave function amplitude for each configuration, t. The goal of the Relevance Vector Machine (RVM) is to find a sparse set of weights w from the training data that optimally represent the state. As a standard modelling assumption, we model the likelihood of the logarithm of the state as a Gaussian with fixed variance where we have defined k(x) = (k(x, x 1 ), . . . , k(x, x Nt )) T . This denotes that we are after a constant precision from the data in the description of the logarithm of the wave function across all amplitudes, governed by the variance of the likelihood distribution, σ 2 . To enforce sparsity, RVM further models the prior distribution of each parameter w i with a normal distribution centred around zero and an independent variance for each parameter of α −1 i . This gives rise to the following sparse prior on the weights Having specified prior and likelihood distributions, the posterior distribution of the weights can be formally written down using Bayes' theorem as Since prior and likelihood are Gaussian, the posterior is also Gaussian and can be computed in closed form [33] p(w | ln Ψ, σ 2 , α) = N (µ, Σ), with µ = σ −2 Σk T ln Ψ (C7) Σ = (σ −2 k T k + A) −1 (C8) The mean µ of the posterior distribution is the best estimate for the value of the parameters w. However, these are not optimized directly, rather the value of the variances σ 2 and α (the so called hyperparameters of the model) are optimized by a procedure called type-II maximum likelihood. This consists of the maximization of the denominator of Eq. (C5), the marginal likelihood, with respect to these hyperparameters. The value of the marginal likelihood (or its logarithm) can also be given in closed form [34]. An efficient algorithm is used, whereby we initialize with an empty data set, and data is 'greedily' added by ensuring the update to a parameter α i gives rise to the maximum increase in the log marginal likelihood. This also guarentees convergence to a local maxima within the RVM. During this optimization, the sparse prior dictates that many α t will be optimized at values of infinity, indicating that the basis configuration x t is not relevant to describe the wave function within the feature space of the kernel. This configuration can be effectively ignored within the ansatz in Eq. (C1), as its features are spanned effectively by the other data points. The maximum of the marginal likelihood with respect to σ 2 can also be found in closed form, and its value is also adjusted simultaneously during the optimization. While this approach will accurately find a sparse data representation, which minimizes the error in the logarithm of the state in Eq. (C1), this does not fully encompass the aim we are trying to achieve. This is because ensuring a constant precision, the σ 2 hyperparameter, on the logarithm of the wave function amplitudes does not correspond to a constant precision on the amplitudes themselves. As an example, a wave function with amplitudes ranging from 10 −5 to 10 −1 will have amplitude logarithms (in base 10) ranging from -5 to -1. Fitting everything to accuracy σ = 0.1 means that, roughly speaking, the largest and the smallest amplitudes will be allowed to fluctuate between 10 −1.1 and 10 −0.9 and between 10 −5.1 and 10 −4.9 respectively. Clearly these fluctuations are much larger for the larger amplitudes, meaning that they will be represented to a lower precision in the true amplitudes compared to the smaller amplitudes. This also results in the selected data set being less sparse than it could be, as small weighted amplitudes are proportionally being very finely resolved.
To overcome this problem, we can let each log amplitude ln Ψ i be represented to a different precision σ 2 i . The precisions σ 2 i can be chosen to precisely compensate for the fact that we are fitting to the logarithm of the state, by ensuring that the likelihood distribution of a given log wave function amplitude has a smaller variance if the magnitude of its expectation is larger. The variance of a given amplitude, Ψ i , under the assumption of Eq. (C2) that its logarithm is normally distributed, is This demonstrates that the variance of Ψ i increases for increasing Ψ i , which we aim to compensate. By assuming that the expectation value for the amplitude is the same as the one given by the data set, we can fix the variance of each amplitude as Var(Ψ i ) =σ 2 , and solve for the amplitude-dependent precision in the likelihood, as This allows us to define a constant variance,σ 2 , to model all wave function amplitudes, and correspondingly use the configuration-dependent σ 2 i above to model the variance in the likelihood distribution of ln Ψ i . We typically set the value ofσ 2 to a specific value to reach a desired accuracy or sparsity with the fit. Alternatively, we can determine a sensible choice forσ 2 for a given data set and kernel choice, as the error incurred on a subset of the data.