Can neural quantum states learn volume-law ground states?

We study whether neural quantum states based on multi-layer feed-forward networks can find ground states which exhibit volume-law entanglement entropy. As a testbed, we employ the paradigmatic Sachdev-Ye-Kitaev model. We find that both shallow and deep feed-forward networks require an exponential number of parameters in order to represent the ground state of this model. This demonstrates that sufficiently complicated quantum states, although being physical solutions to relevant models and not pathological cases, can still be difficult to learn to the point of intractability at larger system sizes. This highlights the importance of further investigations into the physical properties of quantum states amenable to an efficient neural representation.

Introduction.-Theexponential complexity of representing general quantum many-body states is a key challenge in computational quantum physics.To simulate systems beyond small sizes tractable by exact diagonalization methods, it is necessary to find an efficient representation of quantum states of interest.This is made possible by the fact that physically relevant states usually possess a high degree of structure, compared to an arbitrary Hilbert space vector.As a prominent example, ground states of local, gapped Hamiltonians exhibit an area law of the entanglement entropy, i.e., an entanglement entropy that scales like the boundary of the subregion instead of its volume.For systems with a low dimensionality, typically 1D, the area law allows for an efficient representation of the wave function as a matrix product state, which can be simulated by algorithms such as the density matrix renormalization group (DMRG) [1][2][3][4][5].
However, many quantum states of physical interest display a volume law scaling of the entanglement entropy [6], for which generally applicable efficient representations are not known to this date.One class of variational approximations that has been studied to overcome this challenge are neural quantum states (NQS) [7], which are based on an artificial-neural-network representation of the wave function's probability amplitudes [8][9][10] and have shown promising results for the study of discrete lattice models even beyond one dimension [11][12][13][14][15][16][17][18][19].Notably, it has been shown that a shallow NQS ansatz is able to efficiently represent quantum states featuring volume-law entanglement [20,21], suggesting that this method could complement tensor network techniques for the purpose of uncovering the physics of highly entangled states.Nevertheless, while for matrix product states and more general tensor-network-based approaches it is known how the entanglement scaling limits the representation capabilities of the ansatz [3], there is so far no analogous physical property that directly relates to the ability of an NQS to learn a given quantum state.Universal approximation theorems, which have been proven for several broad classes of neural networks, guarantee that, in the limit of infinite network size, a neural network ansatz can theoretically represent any continuous function to arbitrary precision [22][23][24][25].Still, these results do not provide bounds on the scaling of the required number of parameters with the system size.For practical applications of NQS, it is thus a central question to determine which classes of quantum many-body states can be efficiently represented that are impossible to tackle with other established variational ansätze.
In this Letter, we investigate the capabilities of NQS based on shallow and deep feed-forward neural networks (FFNNs) to represent ground states of the Sachdev-Ye-Kitaev (SYK) model [26][27][28], which is a paradigmatic model for quantum chaos and non-Fermi liquid behavior [29] and which features a volume-law entanglement in the ground state [30].We present a systematic study of the representation accuracy achieved by the FFNN in dependence of the network hyperparameters.We find an exponential dependence on the system size for the number of network parameters required to learn the SYK ground state.This demonstrates limitations of fully general NQS to learn complicated quantum ground states of physical interest.
Model.-The SYK model describes strongly correlated fermions on L sites and is defined by the Hamiltonian [26][27][28] where ĉ( †) i , i ∈ {1, . . .L}, are fermionic ladder operators.The vertices J ij;kl have the symmetry J * ij;kl = J lk;ji and J ij;kl = −J ji;kl and are random, uncorrelated, all-to-all couplings that are drawn from a Gaussian unitary ensemble (GUE) [31] with mean E [J ij;kl ] = 0 and variance E |J ij;kl | 2 = 1 [29].Consequently, quantities of physical interest are expectation values over the ensemble of couplings J, which is evaluated after the quantum- expectation value.The ground state of the SYK model describes a strongly correlated non-Fermi liquid without quasi-particle excitations [29], that exhibits volume-law entanglement entropy [32,33].In the thermodynamic limit the model becomes self-averaging and exactly solvable, but despite this exact solvability, the ground state is not a Gaussian state, i.e. not a product of single particle wave-functions [34].At finite sizes, particularly studied in the context of quantum chaos [35,36] and experimental realizations [37], no exact solutions are known.Different variational ansätze to represent the ground state have been proposed recently [34,38].Here the model can be analyzed by employing approximations, or numerically, by drawing a set of couplings {J (n) } N n=1 from the GUE, constructing the corresponding Hamiltonians Ĥsyk (J (n) ), and solving for the ground states |Ψ GS (J (n) ) .Finally, the properties of interest, such as expectation values, are averaged over this ground state ensemble.Because of the self-averaging property of the SYK model, it suffices to evaluate expectation values for a single realization of J in the thermodynamic limit [29].
Network architecture.-Weuse a fully-connected FFNN [Figs.2(a), 3(a)] which is a composition of µ layers f (l) , each applying an affine transformation and a scaled exponential linear unit (SELU) activation function φ [39] as pointwise nonlinearity.Each layer has αL neurons, where α is the fixed hidden unit density.The output of the final layer is reduced to a (scalar) log-probability amplitude with respect to the computational basis {|x } by an exponential sum, Here, θ denotes the vector of all variational parameters, which contains all entries of the weight matrices W (l) and bias vectors b (l) .The variational parameters and therefore network outputs are complex numbers, with the activation function being applied separately to real and imaginary parts.The total number of network parameters scales as N par = O(µ α 2 L 2 ).We choose the occupation number basis (as has been done in previous NQS studies of fermionic molecular Hamiltonians [40][41][42]) at half filling, which fixes the fermion number to L/2.Therefore, the input to the neural network (2) is a vector of occupation numbers x ∈ {0, 1} L such that i x i = L/2.We have verified our results for several variations of this network architecture.In particular, we have evaluated using tanh as nonlinear activation function as well as the addition of skip connections, which can be used to counteract the increased training complexity of networks beyond a certain depth [43,44].These variations did not achieve better results compared to those presented in the main text.Details can be found in Section III of the supplemental material (SM) [45].
Optimization.-Theground state of the network is obtained by numerically minimizing the overlap difference between the variational state |ψ θ and the ground state |ψ GS (J) with respect to the variational parameters θ using Adam [46].We work with system sizes up to L = 18 sites, which are accessible via exact diagonalization (ED) and thus enable training using a supervised learning (SL) protocol targeting the overlap with the ED ground state |ψ GS (J) [47].The system size allows us to evaluate the loss function (4) by summation over the full Hilbert space (preventing any potential errors arising from Monte Carlo sampling) and to assess the quality of our results using the relative energy error compared to the target ground state energy E GS (J) = ψ GS (J)| Ĥsyk (J)|ψ GS (J) .Details on the optimization scheme are reported in Section II of the SM [45].
Results.-To start, we discuss the minimum energy error δE min = min| t∈[0,tmax] δE(θ, J) reached within a maximum number of iterations t max of the optimization protocol.a threshold error to assess successful convergence to the desired ground state.With this threshold, one can see in Figs.2(b) and 3(b) that at any fixed number of training iterations t max there is a systematic improvement of the accuracy with respect to increasing both α and µ, as one would expect given the increased representation capabilities of the network at larger sizes.
Next, we determine the minimum number of variational parameters at which the network is able to learn the ground state with the desired energy of δE threshold .Especially for the smallest system sizes, there is a clear transition between regimes where the network is able or unable to learn the state (in particular as a function of α in the shallow network).For larger system sizes, it is somewhat more difficult to assess convergence.While both very small and very large networks converge to energies above or below the desired threshold within a reasonable optimization time, there is an intermediate regime where the energy gets close to the threshold but only converges at very long time scales.In order to systematically identify a value of α or µ at that boundary, we have developed a criterion used to truncate optimiza- tion runs after a reasonable optimization time when those runs are predicted to ultimately converge to a δE(θ, J) higher than δE threshold .See Section II B of the SM [45] for details.In Fig. 4 we show the number of network parameters at the critical α min or µ min at which the network is able to reach the target energy accuracy threshold.This allows for a comparison of network expressiveness for both varying width and depth on equal footing.We find that for both the shallow and deep network, an exponentially growing number of parameters is needed to achieve the target energy error.A comparison with the Hilbert space dimension reveals that the network only reaches this threshold once the number of variational parameters exceeds the number of probability amplitudes contained in the respective state vector.Hence, we find that our deep feed-forward NQS ansatz as trained here does not learn a more efficient representation of the SYK ground state than the full state vector representation.It is conceivable, in particular given the fully-connected nature of our ansatz, that there is some redundancy in the learned variational parameters, which could be used to achieve a degree of compression after training.In order to investigate this possibility, we have performed a low-rank approximation based on singular value decom- position of the weight matrices [48], the details of which are reported in Section V of the SM [45].This analysis, however, has not revealed such an redundancy.
Our scaling results cannot be interpreted as an immediate consequence of the entanglement scaling of the SYK model, as NQS are known to be able to efficiently represent some volume-law quantum states [20], while they seem to fail for others (as shown here).While a particular realization of the SYK Hamiltonian is of significantly higher complexity than a low-dimensional local lattice Hamiltonian (both because of its fully connected structure and the ∝ L 4 randomly drawn interaction matrix elements), its ground state still exhibits more structure than a random Hilbert space vector.Since it is well known that deep (and, in fact, already two-layer) networks are able to memorize even completely random data once the number of network parameters exceeds the number of data points [49], these results provide evidence that our FFNN ansatz does not learn to utilize any of this structure but only manages to learn it as unstructured random data.This is in stark contrast to more structured lattice Hamiltonians, where it is clear from previous works that neural quantum states can approximate ground state energies with sub-exponential scaling and thus do manage to make use of structure present in the quantum ground state [50,51], although exponential scaling results as a function of real time have been previously found for time-evolved states in a onedimensional lattice spin model [52].We have found comparable sub-exponential behavior when evaluating our training procedure on the ground state of the Heisenberg spin model ĤHeisb = N i=1 i+1 on a one-dimensional chain with periodic boundary conditions diagonalized in the same zero-magnetization subspace used for the SYK computations.The scaling of the required number of parameters to reach δE threshold in this model is also reported in Fig. 4. In this case, a relatively small and fixed α = 1 and µ = 2 independent of the system size are sufficient to reach this threshold, implying a polynomial scaling of the required number of parameters N par = O(L 2 ).This corresponds to an effective compression of the information contained in the exact state vector and allows to study sizes beyond those tractable by full state simulation [7,50].However, the same approach fails to be useful in the more complex SYK model case.
Discussion.-We have tackled the prototypical SYK model using an NQS variational ansatz, presenting a systematic study of the ability of deep FFNNs to learn the volume-law entangled ground states of this model.Focusing on the scaling of the required number of parameters to describe the ground state to a desired and fixed accuracy we find that the size of the FFNN ansatz needs to grow exponentially in the system size.With this we show explicitly that the neural network ansatz is unable to efficiently represent SYK ground states in larger systems in spite of general results raising such hopes.We have performed this analysis using a variety of training techniques (as detailed in the SM [45]), showing that the observed scaling is robust to such implementation choices.While the proven capability of random RBMs to represent volume-law quantum states [20,21] indicates that NQS methods have the potential to tackle problems out of the reach of established tensor-network based methods, our results demonstrate that the entanglement entropy is not the property that determines whether or not a physical quantum state can be efficiently represented by an NQS.It remains an intriguing open question which other properties of a physical quantum state determine the efficient applicability of NQS-based methods.NQS ansätze more specifically tailored to fermionic systems could potentially achieve better scaling [42,53].Studies in this direction would help elucidate to what extent the nonlocal parity structure inherent to fermionic models [54] affects the learnability of the SYK ground state.Separating this influence from other sources of complexity, such as the lack of spatial structure and the disorder induced by random couplings, and thereby exploring the intermediate region between states that can be learned with compression (such as in the Heisenberg and similar spin models) and states that cannot (such as the SYK results presented here) can provide an improved understanding of the complexity of physical quantum states.[57].Computations were performed on the HPC system Ada at the Max Planck Computing and Data Facility (MPCDF).The authors also gratefully acknowledge computing time granted by the JARA Vergabegremium and provided on the JARA partition part of the supercom-puter JURECA at Forschungszentrum Jülich [58] under the project ID enhancerg.We acknowledge support by the Max Planck-New York City Center for Nonequilibrium Quantum Phenomena.Given the ED solution |ψ GS , it is possible to train the network to reproduce the correct probability amplitudes (up to norm and phase gauge freedoms) via supervised learning [4].In practice, we have done so by optimizing the loss function using the Adam optimizer [5].Another training scheme can be implemented by directly minimizing the expected energy E(θ, J) = Ĥsyk (J) |ψ θ of the system as the loss function.For this loss function, it is possible to extend the training to system sizes beyond the reach of ED studies through variational Monte Carlo (VMC) sampling [6], which is not possible in the SL scheme employed here due to its reliance on the knowledge of the full solution vector.Since all our system sizes could be treated via full summation, we have not relied on VMC sampling in this work.As our goal is to find the network size required to represent the SYK ground states at a given system size, we have compared training runs for both SL and energy minimization routines, finding the SL to be the best performing choice in the regime under consideration (compare Section III B).Therefore, we have selected the SL results for presentation in the main text.Our observations regarding the exponential scaling of required network size hold also for the variational energy optimization runs.
The optimization problem for a neural network is challenging on its own and there is unfortunately no general way to systematically identify the best choice of hyperparameters.Here, we have chosen to study the scaling with regard to width α and depth µ of the network by performing two sets of runs: Varying the width for a two-layer network and varying the depth of a network of fixed width α = 4. Considering the trajectory of weights {θ(t)} t∈[tmax] (where [t max ] := {1, . . ., t max } ⊆ N) over the sequence of optimization steps, we can the define the optimal energy relative error as δE min (θ, J) = min t∈[tmax] E(θ(t), J).In Fig. S1 we report δE min for SL protocols for three different number of total iterations t max as function of α.All the data reported in this plot have been obtained with µ = 2, keeping the learning rate of the Adam optimizer constant throughout all the simulation.From this analysis we can make two relevant observations.The first is that up to the network sizes simulated there is a monotonic improvement of the error with increasing width α.The second observation is that at each fixed number of simulation steps there is an exponential scaling of the α min required to bring δE min below an arbitrary threshold (10 −3 in our example) as a function of the system size.With the data reported in this paragraph, it cannot be ruled out that this exponential cost in network size is due to an exponential scaling of the number of steps required to bring the small α values to convergence.However, through an analysis of the training curves it is possible to make a strong argument to identify the α min discussed in the main text as a quantity independent from the number of training steps.This analysis is explained in the following section.

B. Truncation scheme
In the first row of Fig. S2, we report δE(θ(t), J) for four different alpha values distributed around what we are going to define as the critical α min .The light blue line corresponds to the raw data obtained from the training process.In order to remove the fluctuations around the moving average, we apply a flat window filter over the raw error, obtaining the profile corresponding to the dark blue line.This running average error is smooth and monotonically decreasing.Furthermore, the slope of these curves is also decreasing (in absolute value), implying a slowing rate of convergence of the optimization.In order to conserve computational resources, we truncate the runs that are expected to converge to a δE(θ, J) above the energy threshold of 10 −3 based on the following criterion: Assuming that the absolute value of the slope is monotonically decreasing throughout all the training iterations, it is possible to estimate, at any step t, a lower bound for the number of steps required to reach the error threshold as Within the assumption of constant slope, the equality δE(θ(t + t * min (t)), J) = δE threshold holds.If now we define a control interval ∆t big enough to average out the noise around training curve, one can check if a candidate run is trending towards convergence below the error threshold based on the slope of t * min (t), The runs that are expected to reach convergence are those that satisfy We simulated all runs for at least t max = 2 × 10 5 , for those runs that where still above the threshold for that step we used the criterion (S8) (with ∆t = 10 5 ) in order to assess whether they can still be expected to reach the error threshold, based on their current rate of convergence.The runs for which converge below the threshold was ruled out by our criterion were stopped in order to conserve computational resources, while we did perform additional blocks of 10 5 steps for the remaining simulations, until the truncation criterion (S8) did apply or the error threshold was reached.The resulting α min values reported in Fig. 4 of the main text are the smallest simulated value of α for which convergence below the threshold was reached.The µ scaling analysis was performed in an equivalent way.

III. OVERVIEW OF ALTERNATIVE TRAINING METHODS INVESTIGATED
In the previous paragraphs we have described the protocol and data analysis necessary to reproduce the results that we discussed in the main text.Here we are showing alternatives to the training scheme and hyperparameters.The alternatives which we have tested achieved either worse accuracy or no significant improvement compared to the results we selected for the main text.

A. Different learning rate
We have used the Adam optimizer using a constant learning rate throughout the simulations reported in the main text.One advantage of Adam with respect to SGD is that it automatically adapts the step size based on the history of loss gradients.Still, Adam has to be initialized with a specific learning rate and this can potentially influence the effectiveness of the training.While determining the optimal learning rate for our simulations, we have also tried schemes involving an update of the learning rate during the simulation.We report an example of an unconverged training curve in Fig. S3 for which, after 4 • 10 5 steps at a fixed learning rate, we performed further 10 5 steps with three different adjusted learning rates.These changes of learning rate did not result in an improvement of training performance.

B. Variational energy optimization
Beside the SL protocol described above, we have also considered directly optimizing the variational energy For disambiguation, we refer to this loss function as the variationally optimized energy (VOE).Note that this loss does not involve the ED solution of the model and can be applied to system sizes beyond those limits, if it is used together with MC sampling of the quantum states [6].However, as we are interested in comparing the performance of the SL and VOE protocols in the ED regime, we only use the loss (S9) obtained by full summation in the following.Within the parameter range tested, the SL protocol achieved systematically lower errors than the VOE loss.As an example of this, we show in Fig. S4 the resulting δE min for a given system size (L = 18) and a fixed number of iteration steps.Although the trend is similar between the two curves, the SL reaches the convergence threshold at lower α values.

C. Deep networks with skip connections
In the main part of this work, we presented an extensive discussion of the results obtained when training a FFNN with the architecture defined in Section I.Among the two different hyperparameter dependencies discussed there, the number of layers µ scaling requires some extra attention.It is known that while adding more layers enhances the networks expressive capability, at the same time it can make the training more difficult [7].One common remedy to this is the addition of skip connections between the layers of the neural network [8].We report in Fig. S5 a comparison between the training at a fixed number of iteration steps t max = 5 • 10 4 between the FFNN discussed in the main text and a fully-connected network with skip connections.Specifically, our network is divided into n B blocks, each containing layers.The total number of layers is thus µ = n B .After the application of each block (i.e., after applying fully-connected layers), the content of the original input layer (with the first affine transformation applied) is added to the output before it is passed to the following block.For the SL simulation runs considered (Fig. S5), this architecture does not give an improvement in energy error when compared to a simple FFNN.We have found improved convergence (compared to the simple FFNN) of the skip connection networks in some runs using the VOE loss.However, in those cases the number of layers already exceeded the critical µ (around which the network with skip connections still performed worse), so no improvement with regard to compression was obtained this way.

D. Hyperbolic tangent activation function
As already stated, for all the data discussed in the main part, the FFNN implemented where following the definition presented in Eq. (S1) using SELU as activation function φ.While in principle there are many viable choices with regard to which specific activation function utilizing, we opted to restrict our studies to a particular activation in order to investigate extensively the dependence of the representability power as function of the network hyperparameters, keeping a manageable number of simulations to perform.Thus the need to choose an activation function able to achieve reliable performance when dealing with both deep and shallow networks.We present here a comparison of the performances achieved by the same training protocol and network architecture while using another common activation function, the hyperbolic tangent.In Fig. S6, we present training profiles for the relative energy error δE as function of the iteration step t.We show this for both a network with SELU and a network with tanh activation function for several numbers of layers µ.As one could expect from the vanishing gradient problem that effects tanh, we observe that also for the network sizes relevant to our problem, above a certain number of layers the tanh activated networks are not able to improve the energy error anymore.At the same time SELU displayed a monotonic improvement of the energy error with the number of layers (up to scales relevant for our specific problem) and thus is an effective choice (compare Fig. S1).

IV. SYK BIPARTITE ENTANGLEMENT
Following the standard definition, the bipartite entanglement of a density matrix ρ corresponds to where ρ A(B) = Tr B(A) [ρ] is the reduced density matrix obtained from the density matrix ρ and tracing over the degrees of the subpartition B(A) of the total hilbert space, in the special case of dim(A) = dim(B).We compare the S bipartite scaling for four independent finite size SYK model realizations together with the Page value [9], that quantifies the entanglement for a pure random state.In the special case case of bipartite entanglement, the page value is As one can see in Fig. S7, the entanglement scaling of the SYK model follows a volume law scaling that does not saturate the page value, showing that the SYK states exhibit structure beyond a pure random state.

V. NETWORK COMPRESSION
Trained neural networks can be amenable to compression in order to reduce the total number of parameters.This is usually done for the purpose of reducing memory and computational requirements in order to run a network on a lower powered device.Here, we have explored a compression scheme based on a lower-rank approximation of the weight matrices in each layer [10] in order to gauge whether any potential redundancy in the learned variational parameters can be identified this way.This is done by performing a singular value decomposition (SVD) where U (l) , V (l) are unitary matrices and S (l) = diag(σ M ≥ 0. We truncate this spectrum by discarding all singular values below a threshold λ relative to the largest singular value, i.e., based on the criterion σ (l) i /σ (l) 1 < λ.For simplicity, we chose λ uniformly for each layer.As a measure of the approximation error, we report in Fig. S8 the relative energy error after truncation as a function of the fraction q = singular values retained total singular values (S13) of singular values retained over all weight matrices of the network for different system sizes and choices of hyperparameters.This data shows that already a limited amount of truncation (0.95 ≤ q < 1.0) results in a significant

FIG. 1 .
FIG. 1. Cartoon representation of the SYK model.Gray circles represent lattice sites and every different colour shown has two corresponding lines in total connecting four sites.Each color represents one element of the coupling matrix J ij;kl of the SYK model defined by Eq. (1).
Figure 2(b)  shows the dependence of δE min on the network width α for a network with a fixed number of µ = 2 layers, while Fig.3(b)shows the results as a function of network depth µ for deep networks with constant width α = 4.We select δE threshold = 10 −3 as

FIG. 2 .
FIG. 2. (a) Shallow fully-connected feed-forward neural network, α denotes the hidden unit density of each layer and thus parametrizes the width of the network.(b), (c), (d) Relative ground state energy error δE as function of the network width α for several system sizes and random initializations after (b) 5 × 10 4 , (c) 10 5 , and (d) 2 × 10 5 simulation steps, respectively.The color of each set of data points corresponds to the average over four independent realizations of the network initial weights, for the system size L as indicated in the legend.The coloured areas give the maximum and minimum values of δE for the four independent runs.Black bars indicate δE threshold = 10 −3 .

FIG. 3 .
FIG. 3. (a) Deep fully-connected feed forward neural network, µ denotes the number of layers and thus the network depth.(b), (c), (d) Relative energy error δE as function of the network number of layers µ for several system sizes and random initializations after (b) 5×10 4 , (c) 10 5 , and (d) 2×10 5 simulation steps, respectively.The color of each set of data points corresponds to the average over four independent realizations of the network's initial weights, for the system size L as indicated in the legend.The coloured areas give the maximum and minimum values of δE for the four independent runs.Black bars indicate δE threshold = 10 −3 .

FIG. 4 .
FIG. 4. Minimum number of parametersNpar required for the FFNN to learn the ground state of the SYK model as function of the system size L. Results are shown for the scaling with network width in a shallow (µ = 2) network (blue lines) and for the scaling with network depth for fixed α = 4 (red line).In both cases, an exponential scaling in the system size is observed, which matches the scaling of the full Hilbert space dimension dim H (dashed line).The Npar scaling for the ground state of the Heisenberg model (blue) and the associated quadratic polynomial law are reported for comparison.
FIG. S1.Relative energy error δE as function of [(a), (b), (c)] the network width α and of [(d), (e), (f)] the number of layer µ for several system sizes and random initializations after [(a), (d)] 5 × 10 4 , [(b), (e)] 10 5 , and [(c), (f)] 2 × 10 5 simulation steps, respectively.Here we report the full dataset from which the average value plotted in the main text has been derived.The color of each set of data points corresponds to the system size L as indicated in the legend while different shades of the same color belong to independent realizations of the initial network weights.
FIG. S2.(a) Relative energy error δE as function of the iteration step for shallow networks with µ = 2 layers and width α as reported in the respective panels.The system size is L = 16.In light blue we report the raw data, in dark blue the result of a window smoothing filter applied to the raw data.(b) Absolute value of the numerical derivative | ∂(δE) ∂t |.(c) Estimated minimum number of steps required to convergence t * min as defined by Eq. (S6).
FIG. S3.(a) Overlap Ψ θ |Ψexact(J) and (b) relative energy error δE for a training protocol in the interval t ∈ [4 × 10 5 , 5 × 10 5 ] steps.The same pre-trained network here it has been updated with the three different learning ratios reported in the legend.
FIG.S4.Minimum relative energy error δEmin obtained using the SL (green) and a VOE (red) loss functions using tmax = 5×10 4 steps as function of the number of network width α.For the data reported in this plot, we considered system size L = 18 and a fixed µ = 2 FIG. S6.Relative energy error δE as function of the iteration step t using the SL training protocol with SELU (light blue) and tanh (light green) activation functions.The figure shows the data for (a) L = 14, µ = 2; (b) L = 16, µ = 6; and (c) L = 18, µ = 10.
FIG. S7.Bipartite entanglement entropy averaged over 4 independent realizations of the SYK model (error bar shows the standard deviation) compared to bipartite entanglement of a random state given by the Page value (red).