Ab initio solution of the many-electron Schrödinger equation with deep neural networks

Given access to accurate solutions of the many-electron Schrödinger equation, nearly all chemistry could be derived from ﬁrst principles. Exact wave functions of interesting chemical systems are out of reach because they are NP-hard to compute in general, but approximations can be found using polynomially scaling algorithms. The key challenge for many of these algorithms is the choice of wave function approximation, or Ansatz, which must trade off between efﬁciency and accuracy. Neural networks have shown impressive power as accurate practical function approximators and promise as a compact wave-function Ansatz for spin systems, but problems in electronic structure require wave functions that obey Fermi-Dirac statistics. Here we introduce a novel deep learning architecture, the Fermionic neural network, as a powerful wave-function Ansatz for many-electron systems. The Fermionic neural network is able to achieve accuracy beyond other variational quantum Monte Carlo Ansatz on a variety of atoms and small molecules. Using no data other than atomic positions and charges, we predict the dissociation curves of the nitrogen molecule and hydrogen chain, two challenging strongly correlated systems, to signiﬁcantly higher accuracy than the coupled cluster method, widely considered the most accurate scalable method for quantum chemistry at equilibrium geometry. This demonstrates that deep neural networks can improve the accuracy of variational quantum Monte Carlo to the point where it outperforms other ab initio quantum chemistry methods, opening the possibility of accurate direct optimization of wave functions for previously intractable many-electron systems.


I. INTRODUCTION
The ground state wavefunction ψ(x 1 , x 2 , . . ., x n ) and energy E of a chemical system with n electrons may be found by solving the time-independent Schrödinger equation, Ĥψ(x 1 , . . ., x n ) = Eψ(x 1 , . . ., x n ) (1) where x i = {r i , σ i } are the coordinates of electron i, with r i ∈ R 3 the position and σ i ∈ {↑, ↓} the spin, ∇ 2 i is the Laplacian with respect to r i , and R I and Z I are the position and atomic number of nucleus I.We work in the Born-Oppenheimer approximation, 6 where the nuclear positions are fixed input parameters, and Hartree atomic units are used throughout.The Schrödinger differential operator is spin independent but the electron spin matters because the wavefunction must obey Fermi-Dirac statistics -it must be antisymmetric under the simultaneous exchange of the position and spin coordinates of any two electrons: ψ(. . ., x i , . . ., x j , . ..) = −ψ(. . ., x j , . . ., x i , . ..).
Many approaches in quantum chemistry start from a finite set of one-electron orbitals φ 1 , . . ., φ N and expand the many-electron wavefunction as a linear combination of antisymmetrised tensor products (Slater determinants) of those functions: where {φ k 1 , . . ., φ k n } is a subset of n of the N orbitals, the sum in Eqn. 2 is taken over all permutations P of the electron indices, and the sum in Eqn. 3 is over all subsets of n orbitals.The difficulty is that the num-ber of possible Slater determinants rises exponentially with the system size, restricting this "full configurationinteraction" (FCI) approach to tiny molecules, even with recent advances. 7o address problems of practical interest, a more compact representation of the wavefunction is needed.The choice of function class used to approximate the wavefunction is known as the wavefunction Ansatz.For most applications of quantum Monte Carlo (QMC) methods to quantum chemistry, the default choice is the Slater-Jastrow Ansatz, 8 which takes a truncated linear combination of Slater determinants and adds a multiplicative term -the Jastrow factor -to capture close-range correlations.The Jastrow factor is normally a product of functions of the distances between pairs or triplets of particles.There are many alternative Ansätze, 9,10 but for continuous-space many-electron problems in three dimensions the Slater-Jastrow Ansatz remains the default.
Here, we greatly improve the accuracy of the variational quantum Monte Carlo (VMC) method by using a neural network we dub the Fermionic Neural Network, or Fermi Net, as a more flexible Ansatz.This avoids the use of finite basis set, a significant source of error for other Ansätze, and models higher-order electron-electron interactions compactly.The use of neural networks as a compact wavefunction Ansatz has been studied before for lattice spin systems 4,11,12 and small systems of bosons in continuous space. 13Applications of neural networks to continuous fermionic systems have been limited to date, presumably due to the complexity of Fermi-Dirac statistics.Existing work has been restricted to very small numbers of electrons, 14 or has been of very low accuracy. 15nlike these other approaches, we use the Slater determinant as the starting point for our Ansatz, and then extend it by generalising the single-electron orbitals to include generic exchangeable nonlinear interactions of all electrons.
The Fermi Net is not only an improvement over existing Ansätze for VMC, but is competitive with and in some cases superior to more sophisticated quantum chemistry algorithms.Projector methods such as diffusion quantum Monte Carlo (DMC) 8 and auxiliary field quantum Monte Carlo (AFQMC) 16 generate stochastic trajectories that sample the ground state wavefunction without the need for an explicit representation, although accurate explicit trial wavefunctions are still required for good performance and numerical stability.We find the Fermi Net is competitive with projector methods on all systems investigated, in contrast with the conventional wisdom that VMC is less accurate.Coupled cluster methods 5 use an Ansatz that multiplies a reference wavefunction by an exponential of a truncated sum of creation and annihilation operators.This proves remarkably accurate for equilibrium geometries, but conventional reference wavefunctions are insufficient for systems with many low-lying excited states.We evaluate the Fermi Net on a variety of stretched systems and find that it outperforms coupled cluster in all cases.

A. Fermionic Neural Network Architecture
To construct an expressive neural network Ansatz, we note that nothing requires the orbitals in the matrix in Eqn. 2 to be functions of the coordinates of a single electron.The only requirement for the determinant of a matrix-valued function of x 1 , x 2 , . .., x n to be antisymmetric is that exchanging any two input variables, x i and x j , exchanges two rows or columns of the output matrix, leaving the rest invariant.This observation allows us to replace the single-electron orbitals φ i (x j ) in the determinant by multi-electron functions φ i (x j ; x 1 , . . ., x j−1 , x j+1 , . . ., x n ) = φ i (x j ; {x /j }), where {x /j } denotes the set of all electron states except x j , so long as these functions are invariant to any change in the order of the arguments after x j .The construction of a set of these permutation-equivariant functions with a neural network is the main innovation of the Fermi Net.A diagram of the full Fermi Net architecture is given in Figure 1.
The Fermionic Neural Network takes features of single electrons and pairs of electrons as input.As input to the single-electron stream of the network, we include both the difference in position between each electron and nucleus r i − R I and the distance |r i − R I |.The input to the two-electron stream is similarly the differences r i −r j and distances |r i − r j |.Adding the absolute distances between particles directly as input removes the need to include a separate Jastrow factor after a determinant.As the distance is a non-smooth function at zero, the neural network is capable of expressing the non-smooth behavior of the wavefunction when two particles coincidethe wavefunction cusps.Accurately modeling the cusps is critical for correctly estimating the energy and other properties of the system.We denote the concatenation of all features for one electron h 0 i , or h 0α i if we explicitly index its spin α ∈ {↑, ↓}; the features of two electrons are denoted h 0 ij or h 0αβ ij .Intermediate layers of the Fermionic Neural Network mix information together in a permutation-equivariant way, by taking the mean of activations from different streams of the network.These mean activations are then concatenated together and appended to the singleelectron streams of the network.For a single layer this is a purely linear operation, but when combined with a nonlinear activation function after each layer it becomes a flexible architecture for building permutationequivariant functions 17 .Information from both the other one-electron streams and the two-electron streams are fed into the one-electron streams.However, to reduce the computational overhead, no information is transferred between two-electron streams -these are multilayer perceptrons running in parallel.If the outputs of the oneelectron stream at layer with spin α are denoted h α i and outputs of the two-electron stream are h αβ ij , then Output of layer Input to layer ``+ 1  . . .

1"
n"1 ⌦ FIG.1: The Fermionic Neural Network (Fermi Net).Top: Global architecture.Features of one or two electron positions are inputs to different streams of the network.These features are transformed through several layers, a determinant is applied, and the wavefunction at that position is given as output.Bottom: Detail of a single layer.The network averages features of electrons with the same spin together, then concatenates these features to construct an equivariant function of electron position at each layer.
the input to the one-electron stream for electron i with spin α at layer + 1 is which is the concatenation of the mean activation for spin up and down parts of the one and two electron streams, respectively.This concatenated vector is then input into a linear layer followed by a tanh nonlinearity.A residual connection is also added between layers of the same shape, for both one and two electron streams.
After the last intermediate layer of the network, a final spin-dependent linear transformation is applied to the activations, and the output is multiplied by a weighted sum of exponentially-decaying envelopes, which enforces the boundary condition that the wavefunction goes to zero far away from the nuclei: where ᾱ is ↓ if α is ↑ or vice versa, h Lα j is an output from the L-th (final) layer of the intermediate single-electron features network for electrons of spin α, and w kα i (g kα i ) are the weights (biases) of the final linear transformation for determinant k.The learned parameters π kα im and Σ kα im ∈ R 3×3 control the anisotropic decay to zero far from each nucleus.The functions {φ kα i (r α j )} are then used as the input to multiple determinants, and the full wavefunction is taken as a weighted sum of these determinants: Eq. 6 uses the fact that the full determinant det[Φ] = det φ i (x j ; {x /j }) may be replaced by a product of spin-up and spin-down terms since the spin does not appear explicitly in the Schrödinger equation 8 : The new wavefunction is only antisymmetric under exchange of electrons of the same spin, {r ↑ } or {r ↓ }, but nevertheless yields correct expectation values of spinindependent observables and the fully antisymmetric wavefunction can be reconstructed if required.This factorisation allows spin-dependence to be handled explictly rather than as input to the network.

B. Wavefunction Optimisation
As in the standard setting for wavefunction optimisation for variational Monte Carlo, we sought to minimise the energy expectation value of the wavefunction Ansatz: , where θ are the parameters of the Ansatz, Ĥ is the Hamiltonian of the system as given in Eqn.I, and X = (x 1 , . . ., x n ) denotes the full state of all electrons.As Ĥ is time-reversal invariant and Hermitian, its eigenfunctions and eigenvalues are real.If the minimisation is taken over all real normalisable functions, the minimum of the energy occurs when ψ θ (X) is the ground-state eigenfunction of Ĥ; for a more restricted Ansatz, the minimum lies above the ground-state eigenvalue.When samples from the probability distribution defined by the wavefunction Ansatz p(X) ∝ ψ 2 θ (X) are available, unbiased estimates of the gradient of the energy with respect to θ can be computed as follows: where E L is the local energy and we have dropped the dependence of ψ on θ for clarity.For all wavefunction Ansätze used in this paper, the determinants were computed in the log domain, and the final network output gave the log of the absolute value of the wavefunction, along with its sign.The local energy was computed directly in the log domain using the formula: where V (X) is the potential energy of the state X and the index i runs over all 3N dimensions of the electron position vector.
To optimise the wavefunction, we used a modified version of Kronecker Factorised Approximate Curvature (KFAC), 18 an approximation to natural gradient descent 19 appropriate for neural networks.KFAC maintains the advantage over first order methods that other second order and hybrid methods have for wavefunction optimisation 20 (see Figure 2), but unlike exact second order methods, KFAC scales linearly in the number of layers of a neural network.Natural gradient descent updates for optimising L with respect to parameters θ have the form δθ ∝ F −1 ∇ θ L(θ), where F is the Fisher Information Matrix (FIM): This is equivalent to stochastic reconfiguration 21 when the probability density is unnormalised (see Supplementary Information) and closely related to the linear method of Toulouse and Umrigar. 22or large neural networks with thousands to millions of parameters, solving the linear system Fδθ = ∇ θ L becomes impractical.KFAC ameliorates this with two approximations.First, any terms F ij are assumed to be zero when θ i and θ j are in different layers of the network.This makes the FIM block-diagonal and significantly more efficient to invert.The second approximation is based on the structure of the gradient for a linear layer in a neural network.If W is the weight matrix for layer of a network, then the block of the FIM for that weight is, in vectorised form: where a are the forward activations and e are the backward sensitivities for that layer.KFAC approximates the inverse of this block as the Kronecker product of the inverse second moments: Further details can be found in Martens and Grosse (2015). 18he original KFAC derivation assumed the density to be estimated was normalised, but we wish to extend it to stochastic reconfiguration for unnormalised wavefunctions.In the supplementary information, we show that if we only have access to an unnormalised wavefunction, terms in the FIM can be expressed as: where O i = ∂log|ψ| ∂xi .The terms in the FIM for the weights of a linear neural network layer would then be: For clarity, a moving average of the energy is given over the last 1000 iterations.Note that the small dip early in optimisation with KFAC is due to the slow equilibration of the MCMC chain and goes away with a larger Metropolis-Hastings proposal step size.
We use a similar approximation for the inverse to that of conventional KFAC, replacing the uncentered second moments with mean-centered covariances: where

C. Experimental Setup
For all experiments, a Fermionic Neural Network with four layers was used, not counting the final linear layer that outputs the orbitals.Each layer had 256 hidden units for the one-electron stream and 32 hidden units for the two electron stream.A tanh nonlinearity was used for all layers, as a smooth function is needed to guarantee that the Laplacian is well defined and nonzero everywhere.16 determinants were used where not otherwise specified.For comparison, the conventional VMC results in Table I 23 use 50 configuration state functions (CSF).While the exact number of determinants in a CSF will depend on the system, generally this will be on the order of hundreds to thousands of determinants.With this configuration of the Fermi Net there were approximately 700,000 parameters in the network, although the exact number depends on the number of atoms in the system due to the way we construct the input features and exponentially-decaying envelope.
For the baseline Slater-Jastrow network , multilayer perceptrons with 3 hidden layers of 128 units were used for the orbitals.The same multiplicative envelope employed in the Fermionic Neural Network was included; this can be seen as an extension to the electron-nuclear Jastrow factor.The Jastrow factor is of the standard form: 24 where χ j , u αβ and f αβ k are all 3-layer perceptrons with 64 hidden units.
To sample from ψ 2 (X) we used the standard Metropolis-Hastings algorithm. 8The proposed moves were Gaussian distributed with a fixed, isotropic covariance.All electron positions were updated simultaneously.Due to slow equilibration of the Markov Chain Monte Carlo (MCMC) sampling, the computed energy sometimes overshot the true value, but always reequilibrated after a few thousand iterations.We experimented with Hamiltonian Monte Carlo to give faster mixing and lower bias in the gradients, but found this led to significantly higher variance in the local energy and lower overall performance.
Before using the local energy as an optimization objective we pretrained the network to match Hartree-Fock (HF) orbitals computed using PySCF 25 .There were two reasons for this.First, we found that the numerical stability of the subsequent local energy optimization was improved.On large systems, the determinants in the Fermionic Neural Network would often numerically underflow if no pretraining was used, causing the optimisation to fail.Pretraining with HF orbitals as a guide meant that the main optimization started in a region of relatively low variance, with comparitively stable determinant evaluations and electron walkers in representative configurations.Second, we found that time was saved by not optimizing the local energy through a region that we knew to be physically uninteresting, given that it had an energy higher than that of a straightforward mean field approximation.The pretraining did not seem to strand the neural network in a poor local optimum, as the energy minimisation always gave consistent results capturing roughly 99% of the correlation energy.This is consistent with the conventional wisdom in the machine learning community that issues with local minima are less severe in wider, deeper neural networks.Further, stochasticity in the optimization procedure helps break symmetry and escape bad minima.
The pretraining loss is: where φ HF iα r α j denotes the value of the i-th Hartree-Fock orbital for spin α at the position of electron j, ᾱ is ↓ if α is ↑ or vice versa, and φ kα i r α j ; {r α /j }; {r ᾱ} is the corresponding entry in the input to the k-th determinant of the Fermionic Neural Network.We use a minimal (STO-3G) basis set for the Hartree-Fock computation as we require only a stable initialisation in the rough vicinity of the mean field solution, not an accurate mean field result.The probability distribution p pre (X) is an equal mixture of the product of Hartree-Fock orbitals and the output of the Fermionic Neural Network: We chose not to use the distribution from the Hartree-Fock determinant because we wanted sample coverage at every point where the orbitals were large, but in practice the difference to using the anti-symmetrized distribution was marginal.The inclusion of the neural network density helps to increase the sampling probability in areas where the neural network wavefunction is spuriously high.We approximate the expectation for the loss by using MCMC to draw half the samples in the batch from ψ 2 and half from the product of Hartree-Fock orbitals using MCMC.
Initial MCMC configurations were drawn from Gaussian distributions centred on each atom in the molecule.Electrons were assigned to atoms according to the nuclear charge and spin polarisation of the ground state of the isolated atom, with the atomic spins orientated such that the total spin projection of the molecule was correct, which was possible for systems studied here.We used ADAM with default parameters as the optimiser.After pretraining, we reinitialized the electron walker positions and then had a burn in MCMC period with target distribution ψ 2 before we began local energy minimization.
For the Fermi Net, all code was implemented in Ten-sorFlow, and each experiment was run in parallel on 8 V100 GPUs.With a smaller batch size we were able to train on a single GPU but convergence was significantly slower.For instance, ethene converged after just 2 days of training with 8 GPUs, while several weeks were required on a single GPU.We expect further engineering improvements will reduce this number.10 Metropolis-Hastings steps were taken between every parameter update, and it typically required O(10 5 − 10 6 ) parameter updates to reach convergence (results in the paper used 2 × 10 5 parameter updates).Conventional VMC wavefunction optimisation will perform O(10 1 − 10 2 ) parameter updates and O(10 4 − 10 6 ) MCMC steps between updates, so we require roughly the same number of wavefunction evaluations as conventional VMC.
Accurate and stable convergence was highly dependent on the hyperparameters used; the default values for all experiments are included in Table III.These hyperparameters do seem to be generalisable -we have observed good performance on every system for which we used this configuration (all systems except bicyclobutane, due to memory limitations).For some larger systems, stability was improved by using more pretraining iterations.Getting good performance from KFAC requires careful tuning, and we found that the damping and norm constraint parameters critically affect the asymptotic performance.If the damping is too high, KFAC behaves like gradient descent near a local minimum and converges too slowly.If the damping is reduced, training quickly becomes unstable unless the norm constraint (a generalisation of gradient clipping) is lowered in tandem.Surprisingly, we found little advantage to using momentum, and sometimes it even seemed to reduce training performance, so we set it to zero for all experiments.
To reduce the variance in the parameter updates, we clipped the local energy when computing the gradients but not when evaluating the total energy of the system.This is a commonly used strategy to improve the accuracy of QMC 26 .We computed the total variation of each batch, where ẼL is the median local energy of that batch.This is to the 1 norm what the standard deviation is to the 2 norm, and we prefer it to the standard deviation as it is more robust to outliers.We clip any local energies more than 5 times further from the median than this total variation and compute the gradient in Eqn. 8 with the clipped energy in place a b FIG.3: Comparison of the Fermi Net against the Slater-Jastrow Ansatz.a: First-row atoms with a single determinant.Baseline numbers are from Chakravorty et al. 32 .The Slater-Jastrow neural network yields slightly lower energies than VMC with a conventional Slater-Jastrow Ansatz, while the Fermi Net is substantially more accurate.b: The CO and N 2 molecules (bond lengths 2.17328 a 0 and 2.13534 a 0 respectively) with increasing numbers of determinants.All-electron CCSD(T)/CBS results are used as the baseline.No matter how many determinants are used, the Fermi Net far exceeds the accuracy of the Slater-Jastrow net.
of E L .The aforementioned KFAC norm constraint enforces gradient clipping in a manner which respects the information geometry of the model.
We used PySCF 25 to perform all-electron CCSD(T) calculations on atoms and dimers (TableI).PSI4 27 was used to perform all-electron CCSD(T) calculations on methane, ethene, ammonia and bicyclobutane, and CCSD(T) and FCI calculations on H 4 .Cholesky decomposition 28 was used to reduce the memory requirements for bicyclobutane, which we verified introduces an error in the total energies of O(10 −5 ) hartrees with the aug-cc-pCVTZ basis set.The H 4 calculations used a cc-pVXZ (X=T, Q, 5) basis set.All other CCSD(T) calculations used aug-cc-pCVXZ (X=T, Q, 5) basis sets.An unrestricted Hartree-Fock reference was used for atoms and dimers, with restricted Hartree-Fock used otherwise.We extrapolated energies to the CBS limit using standard methods 29,30 .CBS Hartree-Fock energies for Li, Be and Li 2 were taken from aug-cc-pCV5Z calculations, in which the basis set error was below 10 −4 hartrees.CBS Hartree-Fock energies for other systems were obtained by fitting the function E HF (X) = E HF (CBS)+ae −bX , where X is the cardinality of the basis; CCSD, CCSD(T) and FCI correlation energies were extrapolated to the CBS by fitting the energies from quadruple-and quintuple-zeta basis sets (triple-and quadruple-zeta for bicyclobutane) to the function E c (X) = E c (HF) + aX −3 .The total energy is given by the sum of the Hartree-Fock energy and correlation energy.To compare the dissociation potential of N 2 against experiment, we used the MLR 4 (6, 8) potential from Le Roy et al. (2006), 31 which is based on fitting 19 lines of the N 2 vibrational spectrum.

A. Comparison of Slater-Jastrow and Fermi Net Ansätze
To demonstrate the expressive power of the Fermi Net, we investigated its performance relative to a standard Slater-Jastrow Ansatz with varying numbers of determinants.Rather than using Hartree-Fock orbitals and a closed-form Jastrow factor with few free parameters, our Slater-Jastrow Ansatz used a neural network to represent the one-electron orbitals and Jastrow factor, making it much more flexible.To fairly compare our calculations against previous work, we first looked at singledeterminant Ansätze for first-row atoms.Figure 3a compares the Fermi Net results with numbers already available in the literature. 23The neural network Slater-Jastrow Ansatz already outperforms the numbers from the literature by a few milli-Hartrees (mE h ).This could be due to the lack of basis set approximation error when using a neural network to represent the orbitals and Jastrow factor.The Fermi Net cuts the error relative to both the Slater-Jastrow neural network and the baseline by almost an order of magnitude.Just a single Fermi Net determinant is sufficient to come within a few mE h of chemical accuracy, defined as 1 kcal/mol (1.594 mE h ), which is the typical standard for a quantum chemical calculation to be considered "correct." Not only is the Fermi Net a significant improvement over the Slater-Jastrow Ansatz with one determinant, but only a few Fermi Net determinants are necessary to saturate performance.Figure 3b shows the Slater-Jastrow network and Fermi Net energies of the nitrogen and carbon monoxide molecules as functions of the number of determinants.As FCI calculations are impractical for these systems, we compare against the unrestricted coupled cluster singles, doubles, and pertur-  TABLE II: Ground state energy at equilibrium geometry for diatomics and small molecules.The percentage of correlation energy captured by the Fermi Net relative to the exact energy (where available) or CCSD(T) is given in the rightmost column.If no citation is provided, the number was from our own calculation.* Due to computational limitations, the Fermi Net was run with half as many determinants and half the batch size for bicyclobutane as for other systems.Geometries for larger molecules are given in Supplementary Information.
bative triples method (CCSD(T)) in the complete basis set (CBS) limit, typically considered the gold standard for molecules of this size at equilibrium geometry.As the Slater-Jastrow network optimises all orbitals separately, the results from the Slater-Jastrow network should be a lower bound on the performance of a Slater-Jastrow Ansatz with a given number of determinants.As expected, the Slater-Jastrow network is still far from the accuracy of CCSD(T) at 64 determinants.The 64-determinant Fermi Net, in contrast, comes within a few mE h of CCSD(T).The Fermi Net energies begin to plateau after only a few tens of determinants, suggesting that large linear combinations of Fermi Net determinants are not required.Despite recent advances in optimal determinant selection, 33,34 conventional Slater-Jastrow VMC calculations typically require tens of thousands of determinants for systems of this size and rarely match CCSD(T) accuracy even then.

B. Equilibrium Geometries
Tables I and II show that the same 16-determinant Fermi Net with the same training hyperparameters generalises well to a wide variety of atoms and diatomic and small organic molecules.As a baseline, we used a combination of experimental and exact computational results where available, 32,36,37 and all-electron CBS CCSD(T) otherwise.On all atoms, as well as LiH, Li 2 , methane and ammonia, the Fermi Net error was within chemical accuracy.In comparison, energies from VMC using a conventional Slater-Jastrow-backflow Ansatz for firstrow atoms 23 are uniformly worse than the Fermi Net, despite using at least an order of magnitude more determinants.The VMC-based Fermi Net energies are more comparable in quality to diffusion Monte Carlo (DMC), which is typically much more accurate than VMC.On the largest molecules investigated, ethene (C 2 H 4 ) and bicyclobutane (C 4 H 6 ), we recovered over 98% and 95% of the correlation energy respectively -remarkably good for a variational calculation.Bicyclobutane is an especially challenging system due to its high ring strain and large number of electrons -too many for exact methods like FCI.We also computed the first ionisation potentials, E(X + ) − E(X) for element X, and electron affinities, E(X) − E(X − ), for first-row atoms (Table I) and compare to experimental data 35 with relativistic effects removed.Agreement with experiment is excellent (mean absolute error of 0.09 mE h for ionisation potentials and 0.66 mE h for electron affinities), demonstating that the Fermi Net Ansatz is capable of representing charged and neutral species with similar accuracy.

C. The H4 Rectangle
While CCSD(T) is considered the gold-standard for equilibrium geometries, it often fails for molecules with low-lying excited states or stretched, twisted or otherwise out-of-equilibrium geometries.Understanding these systems is critical for predicting many chemical properties.A model system small enough to be solved exactly by FCI for which coupled cluster fails is the rectangle of four hydrogen atoms, parametrised by the distance R of the atoms from the centre and the angle θ between neighbouring atoms. 38FCI shows that the energy varies smoothly with θ and is maximised when the atoms are at the corners of a square (θ = 90 o ).The coupled cluster results are qualitatively incorrect, predicting an energy minimum with a non-analytic downward-facing cusp at 90 o , caused by a crossing of two Hartree-Fock states with different symmetries. 39Figure 4 shows that the Fermi Net does not suffer from the same errors as coupled cluster and is in essentially perfect agreement with FCI.We given in the bottom panel.In the region of largest UCCSD(T) error, the Fermi Net prediction is comparable to highly-accurate r 12 -MR-ACPF results. 40tribute the small discrepancy between the Fermi Net and FCI energies to errors arising from the basis set extrapolation used for the FCI energies.

D. The Nitrogen Molecule
A problem more relevant to real chemistry that troubles coupled cluster methods is the dissociation of the nitrogen molecule.The triple bond is challenging to describe accurately and the stretched N 2 molecule has several low-lying excited states, leading to errors when using single-reference coupled cluster methods. 41Experimental values for the dissociation potential have been reconstructed from spectroscopic measurements using the Morse/Long-range potential. 31These closely match calculations using r 12 -MR-ACPF, 40 which is highly accurate but scales factorially.A comparison between unrestricted CCSD(T), the Fermi Net, and these high-accuracy methods is given in Figure 5.The total Fermi Net error is significantly smaller than UCCSD(T), and in the region of largest UCCSD(T) error the Fermi Net reaches accuracy comparable to r 12 -MR-ACPF, but scales much more favourably with system size.While coupled cluster could in theory be made more accurate by extending to full triples or quadruples, or using multireference methods, CCSD(T) is generally considered the largest coupled cluster approximation that can reasonably scale beyond small molecules.This shows that, without any specific tuning to the system of interest, the Fermi Net is a clear improvement over coupled cluster for modelling a strongly correlated real-world chemical system.

E. The Hydrogen Chain
Finally, we investigated the performance the Fermi Net on the evenly-spaced linear hydrogen chain.The hydrogen chain is of great interest as a system that bridges model Hamiltonians and real material systems and may undergo an insulator-to-metal transition as the separation of the atoms is decreased.Consequently, results obtained using a wide range of many-electron methods have been rigorously evaluated and compared. 42We compare the performance of the Fermi Net against many of these methods in Figure 6.Of the two projector methods studied by Motta et al. , AFQMC gave slightly better results than lattice regularized DMC and so we omit the latter for clarity.Without changing the network architecture or hyperparameters, we are again able to outperform coupled cluster methods and conventional VMC and obtain results competitive with state-of-the-art approaches.

IV. DISCUSSION
We have shown that antisymmetric neural networks can be constructed and optimised to enable highaccuracy quantum chemistry calculations of challenging systems.The Fermionic Neural Network Ansatz makes the simple and straightforward VMC method competitive with DMC, AFQMC and CCSD(T) methods for equilibrium geometries and better than CCSD(T) for many out-of-equilibrium geometries.Importantly, one network architecture with one set of training parameters has been able to attain high accuracy on every system examined.The use of neural networks means that we do not have to choose a basis set or worry about basisset extrapolation, a common source of error in computational quantum chemistry.There are many possible applications of the Fermi Net beyond VMC, for instance as a trial wavefunction for projector QMC methods.We expect further work investigating the tradeoffs of different antisymmetric neural networks and optimisation algorithms to lead to greater computational efficiency, higher representational capacity, and improved accuracy on larger systems.This has the potential to bring to quantum chemistry the same rapid progress that deep learning has enabled in numerous fields of artificial intelligence.able to accurately match the scaling, suggesting that for systems of this size the computation is dominated by the O(N 2 ) evaluation of the two-electron stream of the Fermi Net, while the determinant only becomes dominant for much larger systems.

B. Equivalence of Natural Gradient Descent and Stochastic Reconfiguration
Here we provide a derivation illustrating that stochastic reconfiguration is equivalent to natural gradient descent for unnormalised distributions.Though many authors have investigated extensions of the Fisher information metric to quantum systems, 43 this particular connection between methods in machine learning and quantum chemistry seems not to be well appreciated by either community.
We denote the density proportional to ψ 2 (X) by p(X), and the normalizing factor by Z(θ).In addition, let p(X) = ψ 2 (X) denote the unnormalized density.In stochastic reconfiguration, the entries of the preconditioner matrix M have the form where The term E p(X) [O i ] can be expressed in terms of the normalizing factor: Plugging this into the expression for M ij yields which, up to a constant, is the Fisher information metric for p(X).

C. Numerically Stable Computation of the Log Determinant and Derivatives
For numerical stability, the Fermionic Neural Network outputs the logarithm of the absolute value of the wavefunction (along with its sign), and we compute log determinants rather than determinants.Even if some of the matrices are singular, this is not an issue for numerical stability on the forward pass, because these matrices will have zero contribution to the overall sum of determinants the network outputs: We use the "exp-normalise trick" to compute the sum -that is, we subtract off the largest log determinant before exponentiating and computing the weighted sum, and add it back in after the logarithm at the end.This avoids numerical underflow if the log determinants are not well scaled.
Naively applying automatic differentiation frameworks to compute the gradient and Laplacian of the log wavefunction will not work if one of the matrices is singular.However, the first and second derivatives are still well defined, and we show how to express these derivatives in closed form appropriate for reverse-mode automatic differentiation.Several of the results used here, as well as the notation, are based on the collected matrix derivative results of Giles (2008) 44 .
From Jacobi's formula, the gradient of the determinant of a matrix is given by where C is the reverse-mode sensitivity.Unfortunately, this expression becomes undefined if the matrix A is singular.Even so, both the cofactor matrix and its derivative are still well defined.To see this, we express the cofactor in terms of the singular value decomposition of A. Let UΣV T be the singular value decomposition of A, then Cof(A) = det(A)A −T = det(U)det(Σ)det(V)UΣ −1 V T .
Since U and V are orthonormal matrices, their determinant is just the sign of their determinant.To avoid clutter, we drop the det(U) and det(V) terms until the very end.Let σ i be the ith diagonal element of Σ, then we have det(Σ) = i σ i , and cancelling terms in the expression, we get (up to a sign factor) where Γ is a diagonal matrix with elements γ i defined as because the σ −1 i term in Σ −1 cancels out one term in det(Σ).
The gradient of the cofactor is more complicated, but once again terms cancel.Again neglecting a sign factor, the reverse-mode gradient can be expanded in terms of the singular vectors as: where M = V T CT U, and we have taken advantage of the invariance of the trace of matrix products to cyclic permutation in the last line.Now, in the expression inside the square brackets in the last line, terms conveniently cancel that prevent the expression from becoming undefined should σ i = 0 for some singular value.Denote this term Ξ, the off-diagonal terms of Ξ only depend on the second term Σ −1 MΓ: and the diagonal terms have the form Putting this all together, we get Ā = Sgn(det(U))Sgn(det(V))UΞV T , with otherwise, This allows us to compute second derivatives of the matrix determinant even for singular matrices.To handle degenerate matrices gracefully, we fuse everything from the computation of the log determinant to the final network output into a single TensorFlow op, with a custom gradient and gradient-of-gradient that includes the expression above.

D. Molecular structures
Molecular structures were taken from the G3 database 45 where available.We reproduce the atomic positions for all molecules studied in Tables IV-VII.

FIG. 4 :
FIG.4:The H 4 rectangle, R = 3.2843a 0 .Coupled cluster methods incorrectly predict a cusp and energy minimum at Θ = 90 • , while the Fermi Net approach agrees with exact FCI results.

FIG. 5 :
FIG.5:The dissociation curve for the nitrogen triple-bond.The difference from experimental data31 is given in the bottom panel.In the region of largest UCCSD(T) error, the Fermi Net prediction is comparable to highly-accurate r 12 -MR-ACPF results.40

FIG. 6 :
FIG. 6: The H 10 chain.All energies except the Fermi Net are taken from Motta et al. (2017) 42 .Highly accurate MRCI+Q+F12 results are used as reference energies in the bottom panel, where the shaded region indicates an estimate of the basis-set extrapolation error.The errors in the coupled cluster and conventional VMC energies are large at medium atomic separations but the Fermi Net remains to AFQMC.

FIG. 7 :
FIG.7: Comparison of the runtime for one optimisation iteration on atoms up to zinc.Polynomial regressions up to fourth order are fit to the data.The small difference between the cubic and quartic fit suggests that the determinant computation is not the dominant factor at this scale.
∂det(A) ∂A = det(A)A −T = Adj(A) T = Cof(A), where Cof(A) is the cofactor matrix of A. Let C = Cof(A).Then, by the product rule, we can express the reverse-mode gradient of Cof(A) as Ā = A −T Tr CT Cof(A) I − CT Cof(A) , Ā = A −T Tr CT Cof(A) I − CT Cof(A) = UΣ −1 V T Tr CT UΓV T I − CT UΓV T = U Tr CT UΓV T Σ −1 − Σ −1 V T CT UΓ V T = U Tr (MΓ) Σ −1 − Σ −1 MΓ V T , Optimisation progress for neon atom with KFAC (orange) vs. ADAM (blue), with exact numbers in black.The qualitative advantage of KFAC is obvious.

TABLE I :
35ound state energy, ionisation potential and electron affinity for first-row atoms.The QMC method (Fermi Net, conventional VMC or DMC) closest to the exact ground state energy for each atom is in bold.Electron affinities for Be, N and Ne are not computed as their anions are unstable.Experimental ionisation potentials and electron affinities have had estimated relativistic effects35removed.All ground state energies are within chemical accuracy of the exact numerical solution, and all electron affinities and all ionisation potentials except neon are within chemical accuracy of experimental results.If no citation is provided, the number was from our own calculation.

TABLE III :
Default hyperparameters for all experiments in the paper.For bicyclobutane, the batch size was halved and the pretraining iterations were increased by an order of magnitude.

TABLE IV :
Atomic positions for NH 3 .

TABLE V :
Atomic positions for CH 4 .