Neural network wave functions and the sign problem

Neural quantum states (NQS) are a promising approach to study many-body quantum physics. However, they face a major challenge when applied to lattice models: convolutional networks struggle to converge to ground states with a nontrivial sign structure. We tackle this problem by proposing a neural network architecture with a simple, explicit, and interpretable phase Ansatz , which can robustly represent such states and achieve state-of-the-art variational energies for both conventional and frustrated antiferromagnets. In the latter case, our approach uncovers low-energy states that exhibit the Marshall sign rule and are therefore inconsistent with the expected ground state. Such states are the likely cause of the obstruction for NQS-based variational Monte Carlo to access the true ground states of these systems. We discuss the implications of this observation and suggest potential strategies to overcome the problem.

Over the last decade, machine learning has had a profound impact on nearly all aspects of life as well as the physical sciences [1].In condensed matter physics in particular, using neural networks as ansätze for quantum many-body wave functions has emerged as an exciting application of machine learning techniques to solve challenging problems.Since the first demonstration [2] of the restricted Boltzmann machine (RBM) [3] as a practical wave function ansatz for obtaining ground states of many-body Hamiltonians using variational Monte Carlo (VMC) techniques, such neural quantum states (NQS), including deep convolutional networks [4][5][6][7], have become an important branch of many-body numerical techniques, competitive with, and sometimes even outperforming, state-of-the-art tensor network (TN) methods.NQS approaches are fundamentally appealing because both RBMs [8] and deep neural networks [9] can represent almost any function accurately, without a priori limitations like the Monte Carlo sign problem [10] or the area law entanglement of TNs [11,12].RBM states have also been successfully used in higher dimensions [2,4] and for chiral topological states [13], both of which pose well-known difficulties to TN techniques.
All promising properties notwithstanding, NQS approaches are not without their own difficulties.In particular, while NQS ansätze can in principle represent nontrivial sign structures, actually learning them appears to pose significant challenges, especially in frustrated systems.This appears less pointedly in RBMs and other shallow, fully connected architectures, which are able to reach low variational energies even for Hamiltonians with a severe sign problem [3,[14][15][16][17][18][19].By contrast, deep convolutional networks (which are desirable for cutting-edge applications due to their better scalability and explicit translational invariance [20]) quite often fail to converge to ground states with near-zero average signs (e.g., of antiferromagnetic or fermionic systems) [4].Attempting to learn such states can generate unphysically rough amplitude profiles, resulting in poor convergence or even complete breakdown of the protocol.Successful variational learning of NQS ground states in these cases requires transforming the Hamiltonian to remove its sign problem, severely limiting the usefulness of the method [4,10,21].
The origins of this "sign problem" remain poorly understood, and its existence is counterintuitive given the success of convolutional networks in a range of machine learning applications [22].Some insight has recently been offered in Ref. 18, which studied the ability of NQS ansätze to reconstruct the exact ground state from partial data in a supervised learning scenario.In certain frustrated phases, a range of architectures show poor generalisation properties, especially when it comes to representing their highly nontrivial sign structures.This also impedes the convergence of VMC algorithms, which rely on reconstructing quantum expectation values based on a small sample of the Hilbert space [23].Further work to understand these phenomena and develop more robust NQS ansätze is therefore crucial to deploy neural network-based VMC algorithms to study many interesting and challenging problems in condensed matter physics.
In this paper, we make a contribution to this quest by introducing a novel NQS architecture, and a corresponding variational optimisation protocol, which can reliably find lowenergy variational states of antiferromagnetic Hamiltonians without any prior knowledge of the sign structure of the ground state wave function.Our ansatz has a simple, explicit, and interpretable phase representation, whose convergence properties improve even upon deep convolutional networks.We benchmark our approach on the spin-1/2 J 1 -J 2 Heisenberg antiferromagnet (HAFM) on the square lattice, and achieve variational energies comparable to the state of the art [4,24,25] both at the unfrustrated point J 2 = 0 and in the fully frustrated quantum spin liquid phase at J 2 /J 1 = 0.5.
In the unfrustrated case, our approach is able to learn the expected Marshall sign rule (MSR) with excellent accuracy.This is a crucial improvement over previous VMC protocols based on convolutional NQS, which suffer from the sign problem even in this simpler case [4].Our approach, therefore, paves the way for systematically studying conventional phases, including critical ones [26], where the area law entanglement of TNs is a serious impediment [11].
In the frustrated case, we again achieve an excellent variational energy; however, we find a state that obeys the same MSR, even though the true ground state is expected to deviate from it significantly.The existence of such "MSR-like" low-energy variational states, and the ease and stability with which the VMC algorithm homes in on them, highlight the risks of using the energy as the only criterion for assessing the Φ σ 2 4 k e r n e l s FIG. 1. Single-layer convolutional network used to represent the phase Φ of the wave function.The spins σ are mapped through convolutional kernels spanning the entire lattice (with periodic boundary conditions).Each entry in the images is then taken as the argument of a unit complex number; Φ is given by the argument of their sum.accuracy of variational wave functions and may explain the poor generalisation properties of supervised NQS learning in frustrated regimes [18].We suggest potential improvements towards the end of this work [27].
Our approach.We start by separating the amplitudes and phases of the wave function into two neural networks [28].For convenience [4,23,29], we take the networks to represent the logarithm of the wave function |ψ : where |σ are σ z basis states, and A(σ) and Φ(σ) are two functions (represented as real-valued neural networks) mapping the basis state to the log-modulus and phase of its complex amplitude, respectively.This ansatz is then optimised in two stages.First, the phases are optimised to minimise the variational energy while keeping the amplitudes of all σ z basis states equal.This allows moving away from the initial guess for Φ (which resembles the ground state of a ferromagnetic Hamiltonian) and approaching the correct phases without scrambling the corresponding amplitude profile.Optimal sign structures depend weakly on the imposed amplitudes, a prime example being the MSR for antiferromagnets on bipartite lattices [30], which minimises the variational energy for any given set of amplitudes.Therefore, a well-converged result of this first stage is expected to be a good initial guess in the second one, where A and Φ are optimised simultaneously to approach the true ground state.
To make use of this protocol, however, the phases Φ have to be represented in a way that can approach the true ground state sign structure starting from an initial guess very far from it.We propose a single-layer convolutional architecture, visualised in Fig. 1, where the activation layer is replaced by summing the convolutional output as phasors: where w n,r and b n are the real-valued weights and uniform biases of the convolutional filters, respectively.
Once an appropriate phase structure is found, good variational energies can readily be obtained using any typical neural network architecture to represent A(σ).Similar to most machine learning tasks [22], we found that deeper, wider networks generally perform better.In our numerical experiments, a sixlayer convolutional network was used, details of which are given in the supplement [31].
We performed the optimisation using stochastic reconfiguration (SR), which approximates the imaginary time evolution of the initial state [23].For neural network quantum states, this protocol has been shown to be superior to stochastic gradient descent and other commonly used algorithms in deep learning [32].SR is described and details of the optimisation are given in the supplement [31].The simulations were implemented using the NetKet library [29].
Numerical experiments.We deployed our method to the spin-1/2 HAFM on the square lattice with nearest and nextnearest neighbour interactions: where J 1 , J 2 ≥ 0 and i j and i j refer to nearest and second neighbour sites, respectively.We considered a 10 × 10 lattice with periodic boundary conditions and set J 1 = 1 without loss of generality.
Our first benchmark was the nearest-neighbour Hamiltonian J 2 = 0.In this case, the sign problem can be cured by rotating all spins on a chequerboard sublattice A of the square lattice by π around the σ z axis: as this makes the coefficients of all off-diagonal terms σ + i σ − j negative.As a result, σ| i ∈ A σ z i |GS is positive (up to an overall phase) for all σ z basis states |σ .In terms of the phase structure Φ(σ), the resulting Marshall sign rule (MSR) can be written as We find that our phase structure ansatz (2) converges reliably to (5) in the first stage of the optimisation, as shown in Fig. 2.This is to be expected as the same sign structure attains optimal variational energy for any set of amplitudes [30], including the uniform one used here.Other convolutional networks, by contrast, fail to approach the MSR, which in turn leads to instabilities in the amplitude optimisation, as shown in the supplement [31].
In the subsequent optimisation of amplitudes and phases, we achieved a variational energy of −0.671275 per spin, 2.7×10 −4 higher than the numerically exact energy given by stochastic series expansion [33].This energy is only slightly above the one attained by the convolutional network of Ref. [4], even though the latter has substantially more variational parameters (7676 compared to our 5145) and is preconditioned with the exact MSR.
We then used our approach to study the fully frustrated phase of the model at J 2 /J 1 = 0.5 [24,25].We achieved a variational energy of −0.494757 per spin, 2.8 × 10 −3 higher than the best energies obtained using Lanczos iterating a Gutzwiller projected fermionic wave function [24].Our result again compares favourably with the best NQS-based variational energy in the literature [34], where the corresponding error is 1.8×10 −3 .Surprisingly, however, we find that the converged phase structure Φ(σ) recovers the MSR to a high accuracy, and no bimodality consistent with having both positive and negative amplitudes can be seen (see Fig. 2).This is at odds with the fact that the frustrated Hamiltonian (3) remains non-stoquastic even after the Marshall transformation (4), and as such its average sign should fall below 1 [31,35].
In addition to variational energies, we evaluated the total spin ì S 2 of our converged variational wave functions, as well as the antiferromagnetic order parameter for ì q = (π, 0) and (π, π), which correspond to stripy and Néel orders, respectively.These, together with a range of stateof-the-art variational energies, are listed in Table I.Since the true ground state of any HAFM is a singlet, deviations from ì S 2 = 0 measure the quality of the converged variational state.In both cases, we achieve similar figures to Ref. [4]; notably, ì S 2 is substantially higher in the frustrated case.Finally, we calculated parity and spatial symmetry quantum numbers, which are shown in the supplement [31].Both wave functions are predominantly parity even, and transform according to the trivial representation of the point group D 4 , consistent with eigenstates of the Hamiltonian being symmetry eigenstates as well.
Discussion.We developed a robust and efficient protocol for finding low energy states with a nontrivial sign structure using convolutional neural quantum states without any prior knowledge on the sign problem of the Hamiltonian.We used an ansatz with two neural networks that represent the amplitudes and phases separately, and optimised it in two stages, first generating an approximate phase structure, from which the entire wave function can readily converge without encountering severe instabilities.We demonstrated our approach by attempting to learn the ground states of the square lattice spin-1/2 J 1 -J 2 HAFM both at the unfrustrated point J 2 = 0 and at J 2 /J 1 = 0.5, inside the fully frustrated spin liquid phase.In both cases, we reached variational energies comparable to the best NQS energies reported in the literature [4]; the difference might be attributed to the smaller number of variational parameters in our ansatz.Importantly, we used a fully convolutional architecture: This automatically imposes translational invariance, a useful inductive bias that speeds up the convergence to a robust state representation [20] and allows for resolving the lowest energy states in different symmetry sectors [14].Furthermore, the convolutional structure reduces the number of variational parameters from the O(N 2 ) typical for RBMs and other fully connected architectures [19,36] to O(N), which keeps VMC algorithms viable for larger system sizes.At J 2 = 0, our phase structure ansatz (2) learns the Marshall sign rule with better generalisation properties than other convolutional networks, both deep and shallow [31], which is crucial for finding ground states reliably [18].A possible origin of the underlying inductive bias is the following: σ + i σ − j exchanges a positive and a negative value of σ r in (1), leading to each phasor changing its phase by ∆φ n,r = 2(w n,r−r i − w n,r−r j ), where the dummy variable r covers the entire lattice.The change of the overall phase Φ is an "average" of these.While the energy of an antiferromagnetic interaction is optimised if ∆Φ = π, this can be realised by a range of distributions of the ∆φ centred on π, suggesting that the MSR can be encoded in a robust way by such an architecture.Indeed, the weights w r produced by VMC in the unfrustrated case show a distinct chequerboard pattern that produces ∆φ ≈ π for all nearest neighbour pairs, consistent with the MSR (see Fig. 3).By contrast, deep neural networks have an inductive bias for functions that only change significantly upon large-scale changes of the input [22,37].While this is desirable for most machine learning applications, it is detrimental for learning a nontrivial quantum phase structure.
In the frustrated case, the same approach fails to find the appropriate ground state sign structure, homing in instead on the MSR.The existence of low-lying variational states with "simple" sign structures deep within frustrated phases parallels the poor generalisation of the corresponding ground states in supervised learning scenarios [18], hinting at a possible bias of NQS ansätze towards such states and the corresponding ubiquity of this "residual sign problem".Other physical observables, such as the antiferromagnetic order parameters (6), also remain surprisingly accurate, highlighting the need to benchmark NQS approaches beyond the usual tests of energy and correlation functions.A notable exception to this observation, J 2 /J 1 GWF [24] GWF+RBM [34] CNN [4] best [24,33] this work ì TABLE I. Variational energies and spin correlation functions attained in this paper compared to other state-of-the-art energies.Our approach consistently outperforms plain Gutzwiller projected fermionic wave functions (GWF) [24], and achieves similar accuracy to the RBM-enhanced GWF of Ref. 34 and the convolutional networks of Ref. 4 (CNN).The "best" energy is obtained using numerically exact stochastic series expansion for J 2 = 0 [33] and Lanczos-corrected GWF for J 2 /J 1 = 0.5 [24].The antiferromagnetic order parameters S 2 ( ì q) of our wave functions are very close to those reported in Ref. 4; ì S 2 is 0 for the true ground state.
Weights w r of a typical convolutional kernel converged for J 2 = 0 (left) and J 2 /J 1 = 0.5 (right).The chequerboard pattern of the former is a direct consequence of the Marshall sign rule [31]; the latter shows an admixture of a stripy antiferromagnetic pattern, consistent with the MSR for the opposite limit, J 1 = 0.
both in our work and in Ref. 4, is the total spin quantum number ì S 2 , which is an order of magnitude higher in the frustrated case.While this may be a straightforward consequence of the gapless ground state [24], it might also be necessary to reconcile low variational energies with the MSR and serve as a signature of the residual sign problem for both ansätze.If this is the case, minimising the variational energy and total spin ì S 2 simultaneously could steer the learning process away from the low-energy MSR-like states towards a more accurate representation of the ground state.
We also note that the representation of the MSR learned in the two cases is starkly different (see Fig. 3, and the supplement [31] for a detailed analysis): While the unfrustrated sign structure attains the simplest possible representation of the MSR, stripe features consistent with the MSR of the opposite unfrustrated limit, J 1 = 0, appear in the frustrated case.This hints at an (ultimately failed) attempt at learning a "compromise" between the two limits, consistent with our qualitative understanding of frustrated phases.More detailed insight into this learning dynamics may open up the possibility of finding similarly simple ansätze with significantly better generalisation abilities.
Finally, we believe that the simplicity, interpretability, and robustness of our phase representation, as well as the insight it affords us about the sign problem of NQS ansätze, make it a useful resource to guide efforts to design novel network architectures that will one day reliably learn frustrated ground states with complex phase structures.The performance of our approach might also be improved substantially by imposing symmetries, either through making the wave function explicitly symmetric [4,14,19,38], or using the variational protocol to project out states that do not have the right symmetry (e.g., ones for which ì S 2 0).Successfully learning the MSR suggests that our method can readily be used for large-scale simulations of conventional phases, including excited states [14,16,19] and gapless or critical systems [26], without the entanglement limitations of tensor networks [12].Most exciting of all, Ref. 18 suggests that the ground state sign structure of certain important geometrically frustrated spin models is amenable to existing NQS ansätze in a supervised learning scenario.We are confident that our network would also be able to learn such ground states variationally, paving the way of using NQS approaches for large-scale simulations of these challenging frustrated magnetic systems.
TCM Group, Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, United Kingdom

SI. STOCHASTIC RECONFIGURATION
Stochastic reconfiguration (SR) is a generic method for optimising parametrised trial wave functions |Ψ({α k }) so as to minimise their energy with respect to a Hamiltonian H [1,2].It is commonly used in variational Monte Carlo studies owing to its more reliable convergence compared to other common optimisation protocols, such as stochastic gradient descent [2,3].
The method proceeds by approximating the imaginary time evolution of the trial wave function using Monte Carlo sampling.Namely, given |Ψ({α k }) , we want to find a new set of the real parameters where η is a small positive number playing the role of the learning rate in machine learning language.Since we only want to project out all excited states, the Trotterisation error in (S1) is irrelevant.|Ψ is optimised by maximising the overlap of the (unnormalised) wave functions |Ψ exact and |Ψ : To linear order in both η and δα k , the condition In order to find δα k numerically, we want to rewrite the expectation values in (S3) as Monte Carlo averages with respect to the quantum probability distribution p(σ) = σ|Ψ 2 / Ψ|Ψ .This can readily be done by inserting a resolution of the identity; for example, we have where we introduce and the local energy The expectation value (S4) can now be estimated as the Monte Carlo average of E * loc O k for samples distributed according to p(σ); the others follow from analogous considerations, resulting in To improve numerical stability, the covariance is calculated as Since the covariance matrix S depends entirely on the parametrisation of the wave function rather than its energy under the Hamiltonian, it can be thought of as a metric tensor on the parametrised Hilbert space [1].Eq. (S7) is thus analogous to the natural gradient approaches used to stabilise gradient descent in other machine learning contexts [3].We note that equivalent expressions can be derived for trial wave functions that are (piecewise) analytic functions of complex parameters [1,4].The result is identical to (S7), omitting the real-part signs.
In our setup, log σ|Ψ = A(σ, {κ}) + iΦ(σ, {λ}), where both A and Φ are real-valued functions of real parameters.It follows that O κ is real for all κ and O λ is imaginary for all λ, and so all entries of the covariance matrix S connecting a κ and a λ vanish.This allows us to solve (S7) for the δκ and the δλ separately, speeding up the algorithm.
Solving (S7) for δα k requires inverting the covariance matrix S, which, while positive semidefinite, tends to be illconditioned even for a large number of Monte Carlo samples [1].This can be resolved either by using the pseudoinverse, or, more commonly, by adding a small positive constant to the diagonal entries in order to make the matrix invertible.In our case, however, the entries of the S matrix corresponding to the parameters of the amplitude and phase have vastly different values (separated by up to eight orders of magnitude).To keep the optimisation of both parts viable, we add 10 −5 times the average diagonal entry (i.e., the trace divided by the number of parameters) to both blocks of the matrix:  S1.Convergence of our optimisation scheme with various neural network ansätze (described in Secs.SII and SIII) for the nearestneighbour (top panel) and the J 2 /J 1 = 0.5 (bottom panel) square lattice HAFM.The shaded area shows the full spread of energy estimates used by the SR algorithm, the thicker lines show 100-step moving averages.The background shading indicates the learning rate η (white -0.001, yellow -0.01, purple -0.05).The phase structure described in the main text (blue curve) converges reliably to energies close to the true ground state; other ansätze (red and green curves; see Sec.SIII for details), both shallow and deep, fail to reach either a consistent variational energy, or one close to the ground state.Energies are compared to exact stochastic series expansion for J 2 = 0 [5] or Lanczos-corrected Gutzwiller projected fermionic wave functions (GWF) for J 2 /J 1 = 0.5 [6].For reference, variational energies are also shown for the convolutional network of Ref. [7] (CNN), plain [6] and RBM-improved [8] GWF, as well as DMRG [9] for the frustrated case.

SII. DETAILS OF NUMERICAL EXPERIMENTS
We used a six-layer convolutional neural network to represent the amplitude structure A(σ).The layers consist of 8, 7, 6, 5, 4, and 3 10 × 10 lattice replicas, respectively, which are connected by convolutional filters with real-valued kernels spanning 4 × 4 sites in periodic boundary conditions, and (real-valued) ReLU activation functions: The amplitude is given by the modulus of the product of all entries in the last convolutional layer.Since NetKet is implemented using the logarithm of wave functions [10], this is achieved using a final activation function f 6 (x) = ln |x|, followed by summing all entries.All convolutional layers before the last one are initialised with Gaussian distributed random numbers of zero mean, and standard deviation chosen so as to preserve the typical magnitude of backpropagated derivatives [11].The last set of kernels are initialised with a uniform bias of 1.0 and Gaussian distributed kernel entries with standard deviation 2 × 10 −4 .This results in amplitudes uniformly close to 1 upon initialisation.
The phases Φ(σ) are given by the ansatz described in the main text.We employ 24 lattice replicas.Similarly good results are achieved using fewer replicas, but the wider network allows for faster and more reliable convergence.The convolutional filters are initialised with Gaussian distributed random numbers of standard deviation 0.043.
The ansatz is converged to the ground state in two stages.First, the phases are optimised with a uniform amplitude distribution (i.e., setting A ≡ 0): 10 000 such SR steps with learning rate η = 0.01 were followed by 10 000 steps with η = 0.05.Next, both A and Φ were optimised starting from the phase distribution achieved in the first stage: for this, we used 5 000 steps with η = 0.001, 5 000 steps with η = 0.01, and, finally, 50 000 steps with η = 0.05.The learning rate was increased during the optimisation, because the imaginary time evolution emulated by SR results in infinitesimal temperature (and thus energy) reduction close to the ground state.In both stages, the Monte Carlo averages in (S7) were evaluated using 5 000 samples obtained via the Metropolis-Hastings algorithm as implemented by NetKet [10].
The convergence of this ansatz to the ground state is shown by the blue curves in Fig. S1.The first stage quickly attains an approximately constant minimum of variational energy.We find, however, that any residual optimisation speeds up the next stage significantly, which is desirable as a single, simple neural network can be evaluated an order of magnitude faster than the full ansatz.The second stage also reaches a nearly converged variational energy in about 20 000 steps; however, the variational energy is further reduced slightly throughout the procedure.
Once the wave function had converged, the distribution of phases was estimated using 100 000 samples, while the expectation values of the Hamiltonian and the spin correlators defined in the main text were evaluated using 1 000 000 samples.For the latter, we exploited the translational invariance of the wave function to rewrite them as where both sums include i = 0 (note, however, that ì σ 0 • ì σ 0 ≡ 3/4).

SIII. COMPARISON OF PHASE STRUCTURE ANSÄTZE
To demonstrate the advantage of our phase structure ansatz over other convolutional networks, we considered two alternative architectures: 1. 24 convolutional filters spanning the entire lattice, followed by a ReLU activation layer and summation 2. The architecture used for the amplitudes, except for the last layer, where the ln |x| activation is also replaced by ReLU.
Amplitudes are encoded using the same ansatz as described in Sec.SII.The performance of these architectures under the protocol described in Sec.SII is shown by the green and red curves in Fig. S1, respectively.In the first stage, neither of them approach the optimal variational energy found with the phasor sum ansatz; subsequently, the amplitude network also fails to approach the ground state, even though it is capable of representing it closely, as found previously.We also point out that as the Monte Carlo sampling is restarted after changing the learning rate η, the estimates of the variational energy change substantially, leading to discontinuities in Fig. S1.This indicates that the amplitude structure had developed several unphysically strong peaks, which make subsequent Monte Carlo sampling unable to recover the correct wave function.
In the unfrustrated case, we also probe the phase structures learned by the various ansätze by comparing them directly to the exact Marshall sign rule.The distribution of the differences Φ − Φ MSR is shown in Fig. S2.While the architecture used in the main paper learns the MSR to a high accuracy (up to an irrelevant overall phase), the alternatives produce essentially random phases.

SIV. KERNELS OF THE GROUND STATE PHASE STRUCTURES
As discussed in the main text, the change in the phase ansatz upon exchanging an up and a down spin separated by R is a kind of average of the change ∆φ = 2(w n,r+R − w n,r ) in each elementary phase that enters (S10).(The factor of 2 is due to representing up and down spins as ±1 rather than ±1/2.)Since the energy of an antiferromagnetic interaction +Jσ + i σ − j is optimised if ∆Φ = π upon exchanging spins i and j, we expect the corresponding set of ∆φ to have a distribution centred upon π, especially in the unfrustrated case, where all interactions in the Hamiltonian can be optimised simultaneously.
In particular, one can represent the Marshall sign rule corresponding to the unfrustrated limit J 2 = 0 exactly by requiring that all ∆φ ≡ π (mod 2π), i.e., w r+R − w r ≡ π/2 (mod π) for all nearest neighbour pairs, or R = x, ŷ.This is readily achieved by a chequerboard pattern of weights w and w + π/2, provided both sides of the lattice are of even length.Indeed, the phase of each elementary phasor in this setup is where N is the number of lattice sites, A and B are the two chequerboard sublattices of the lattice, and we repeatedly use the fact that σ z = 0.In the last line, we also note that the total number of down spins (which is measured by the sums for each sublattice) is N/2, which is even; therefore, the parity of the sum for the two sublattices is the same.That is, all terms in (S10) have the same phase, which is also the MSR up to an overall phase πN/4; it follows that Φ equals this phase, and thus recovers the MSR.
Beyond the nearest neighbour case, there is a Marshall sign rule corresponding to the other unfrustrated limit, J 1 = 0, arising from the requirement that ∆Φ = π upon exchanging next-nearest neighbour up and down spins.At this point, the two chequerboard sublattices decouple, each being effectively a square lattice with diagonal axes.Using the above construction, appropriate kernels w r can be found for both of them; however, the offset between the two sublattices is arbitrary, as it only contributes to an unimportant overall phase.Upon reintroducing a weak nearest-neighbour coupling, however, the model develops a stripy antiferromagnetic order [6,9].Consistently, the sublattices in the convolutional weights w are expected to lock so as to form rows or columns with equal w, shifted from one another by π/2 (mod π).
We now consider the convolutional kernels generated by the variational Monte Carlo protocol both in the unfrustrated limit and at J 2 /J 1 = 0.5.The weights w n,r are plotted for each kernel in Figs.S3 and S4, respectively.In both cases, we also plot the differences between horizontally and vertically nearest-neighbour kernel entries.
At J 2 = 0, the chequerboard pattern derived for the exact MSR can be seen in almost all kernels to a very good approximation; consistently, ∆φ ≈ π upon exchanging nearest neighbour spins, both horizontally and vertically.This suggests that this representation of the MSR is especially stable; presumably, it sits at the bottom of a wide basin of the learning landscape, making it easy to find for optimisation algorithms [13].The only surprising feature in Fig. S3 is a kernel that develops a three column wide "topological fault", in which the Marshall-adjusted convolutional weights wind around the vertical direction.This results in a number of ∆φ far from the desired π.The large number of kernels, however, allows these to be corrected through slight deviations of the other kernels from the exact MSR.Furthermore, the additional kernels probably play a key role in eliminating such detrimental structures: Unwinding a "topological fault" requires large-scale changes in the individual phasors, which substantially increase the variational energy, unless corrected for by the other kernels.Indeed, our attempts to use a single   [15].convolutional kernel in (S10) were plagued by robust "topological faults" spanning the entire kernel, leading to convergence far above the ground state energy.
The kernels obtained in the frustrated case are substantially more complex, with many "topological faults" and some kernels that show no discernible pattern.Many kernels, however, retain the chequerboard pattern consistent with the MSR, and the ∆φ upon exchanging nearest neighbour spins vertically are clearly dominated by values close to π.Both of these are consistent with the fact that the kernels represent the MSR rather than the true frustrated sign structure.Nevertheless, we observe several columns of the stripy pattern consistent with the MSR of the J 1 = 0 limit, as discussed above.These result in ∆φ ≈ 0 for horizontal nearest neighbour exchanges, leading to a much more varied pattern in these differences.It is surprising that this more diverse distribution has no apparent effect on the overall sign structure learned by the network.Furthermore, the striking difference between the horizontal and the vertical direction is not warranted by any fundamental property of the ansatz, so it might hint at spontaneous point group symmetry breaking at higher variational energies that ultimately leads to learning an incorrect sign structure.A more detailed analysis of the learning dynamics is necessary to better understand and overcome any such problems.

SV. AVERAGE SIGNS IN EXACT DIAGONALISATION
To estimate the average Marshall-adjusted sign of the true ground state of the 10 × 10 frustrated model, we obtained the exact ground states for 4×4 and 4×6 lattices using the Lanczos algorithm as implemented in SciPy [14], and calculated the average sign both for the original and the Marshall-adjusted ground states, |GS and i ∈ A σ z i |GS .These average signs are given in Table SI, together with an extrapolation to the 10 × 10 lattice, assuming an exponential decay of both s [15].The average sign of the original wave function decays extremely fast and becomes negligibly small for our lattice size.By contrast, the average Marshall-adjusted sign remains close, but clearly distinct from, 1.For a 10 × 10 lattice, the expected average sign is ≈ 0.88: Since this is the difference of the statistical in, the fully converged NQS wave functions in the unfrustrated limit J 2 = 0 and for J 2 /J 1 = 0.5.Both states are predominantly parity even and transform according to the trivial representation A 1 ; the weight of states with odd parity and/or different spatial symmetry is ≈ 0.001 in the unfrustrated case and ≈ 0.01 in the fully frustrated case, consistent with the better energy convergence of the former.
weights of positive and negative amplitude states, we expect those to be about 94% and 6%, respectively.

SVI. SYMMETRY QUANTUM NUMBERS
Further to the total spin quantum number and the antiferromagnetic order parameters considered in the main text, we also evaluated expectation values of parity and spatial symmetry operators.Since these all commute with the Hamiltonian, the ground state should be an eigenstate of all of them [or a set of degenerate states transforming according to an irreducible representation (irrep) of the nonabelian symmetry group].Therefore, the expectation values of these unitary operators provide us with a measure of how well the numerically obtained state captures key properties of the true ground state.This is especially interesting for the D 4 point group symmetries of the frustrated case, where the starkly different pattern of ∆φ for horizontal and vertical directions hints at the possibility of spontaneous breaking of rotational and mirror invariance.
The parity operator P = i σ x i commutes with all point group symmetries as well as the Hamiltonian.Therefore, the parity of a wave function is fully characterised by the expectation value P , without any possibility of symmetryprotected degeneracies.We evaluated this expectation value using 4 000 000 Monte Carlo samples of the fully converged wave function in both cases, as shown in Table SII.Both are predominantly parity even, with a small admixture of oddparity states (≈ 0.0008 at J 2 = 0 and ≈ 0.005 at J 2 /J 1 = 0.5, cf.Sec.SV).
In the case of point group symmetries, symmetry-protected degeneracies may arise since the symmetry group D 4 is nonabelian, limiting the usefulness of symmetry operator expectation values.Instead, we evaluated the statistical weight of eigenstates transforming according to the different irreps α of the symmetry group, using the projection operators [16]  where we used the fact that the projector (S13) is Hermitian and squares to itself.The weights of all five irreps were evaluated using the same samples as above and are listed in Table SII for both fully converged ground states.They transform predominantly according to the trivial irrep A 1 , with small admixtures from other ones (≈ 0.0014 at J 2 = 0 and ≈ 0.011 at J 2 /J 1 = 0.5).This clearly shows that our ansatz breaks neither spatial nor parity symmetry, and thus recovers physical properties of the true ground state to a good approximation.
FIG.S3.Weights w n,r of all convolutional kernels n converged for J 2 = 0 (top three rows), and the differences w n,r+x − w n,r and w n,r+ŷ − w n,r (middle and bottom three rows, respectively).All but one kernels show a clear chequerboard pattern, which results in ∆φ ≈ π upon exchanging a neighbouring up and down spin, consistent with the Marshall sign rule.In the only exception (fifth kernel of the third row), a "topological fault" spanning three columns appears, causing some ∆φ to wind from 0 to 2π: these have little effect on the overall ∆Φ, and might persist due to a topologically invariant winding number.The kernel shown in the main text is the first one in the second row.(Perceptionally uniform colour maps were chosen following Ref.12).

TABLE SI .
Average sign s and Marshall-adjusted sign s MSR of the ground state of the frustrated model for 4 × 4 and 4 × 6 lattices, calculated by exact diagonalisation.The 10×10 lattice is extrapolated from these, assuming that s decays exponentially in the number of spins

TABLE SII .
Average parity P of, and statistical weight of the different irreps of the point group D 4