Kohn-Sham equations as regularizer: building prior knowledge into machine-learned physics

Including prior knowledge is important for effective machine learning models in physics, and is usually achieved by explicitly adding loss terms or constraints on model architectures. Prior knowledge embedded in the physics computation itself rarely draws attention. We show that solving the Kohn-Sham equations when training neural networks for the exchange-correlation functional provides an implicit regularization that greatly improves generalization. Two separations suffice for learning the entire one-dimensional H$_2$ dissociation curve within chemical accuracy, including the strongly correlated region. Our models also generalize to unseen types of molecules and overcome self-interaction error.

Differentiable programming [1] is a general paradigm of deep learning, where parameters in the computation flow are trained by gradient-based optimization.Based on the enormous development in automatic differentiation libraries [2][3][4][5], hardware accelerators [6] and deep learning [7], this emerging paradigm is relevant for scientific computing.It keeps rigorous components where we have extremely strong physics prior knowledge and well-established numerical methods [8] and parameterizes the approximation by a neural network, which can approximate any continuous function [9].Recent highlights include discretizing partial differential equations [10], structural optimization [11], sampling equilibrium configurations [12], differentiable molecular dynamics [13], differentiable programming tensor networks [14], optimizing basis sets in Hartree-Fock [15] and variational quantum Monte Carlo [16][17][18].
Density functional theory (DFT), an approach to electronic structure problems, took an enormous step forward with the creation of the Kohn-Sham (KS) equations [19], which greatly improves accuracy from the original DFT [20][21][22].The results of solving the KS equations are reported in tens of thousands of papers each year [23].Given an approximation to the exchangecorrelation (XC) energy, the KS equations are solved selfconsistently.Results are limited by the quality of such approximations, and a standard problem of KS-DFT is to calculate accurate bond dissociation curves [24].The difficulties are an example of strong correlation physics as electrons localize on separate nuclei [25].
Naturally, there has been considerable interest in using machine learning (ML) methods to improve DFT approximations.Initial work [26,27] focused on the KS kinetic energy, as a sufficiently accurate approximation would allow by-passing the solving of the KS equations [28,29].For XC, recent works focus on learning the XC potential (not functional) from inverse KS [30], and use it in the KS-DFT scheme [31][32][33][34].An important step forward was made last year, when it was shown that a neural network could find functionals using only three molecules, by training on both energies and densities [35], obtaining accuracy comparable to human-designed functionals, and generalizing to yield accurate atomization energies of 148 small molecules [36].But this pioneering work does not yield chemical accuracy, nor approximations that work in the dissociation limit.Moreover, it uses gradient-free optimization which usually suffers from poor convergence behavior on the large number of parameters used in modern neural networks [37][38][39].One-dimensional H2 dissociation curves trained from two molecules (red diamonds).(a) A ML model that directly predicts E from geometries, clearly fails to capture the physics from very limited data.(b) Comparison of LDA found with KSR and that from uniform gas (brown), and (c) same as (b) but for GGA, (d) the global XC approximation found with KSR.Enn is the nucleus-nucleus repulsion energy.Grey lines denote 15 sampled functionals during training, with darker lines denoting later samples.Functionals with optimal parameters validated from the molecule at R = 3 (black triangles) are highlighted in orange, green, pink and blue respectively in each panel.KSR-global yields chemical accuracy (grey shadow), shown in lower panels.Atomic units used throughout.See the supplemental material [40] for more details.
Here, we show that all these limitations are overcome by incorporating the KS equations themselves into the neural network training by backpropagating through their iterations -a KS regularizer (KSR) to the ML model.In a traditional KS calculation, the XC is given, the equations are cycled to self-consistency, and all previous iterations are ignored in the final answer.In other ML work, functionals are trained on either energies alone [41][42][43][44], or even densities [32,33,45], but only after convergence.By incorporating the KS equations into the training, thereby learning the relation between density and energy at every iteration, we find accurate models with very little data and much greater generalizability.
Our results are illustrated in Figure 1, which is for a one-dimensional mimic of H 2 designed for testing electronic structure methods [46].The distribution of curves of the ML model directly predicting E from geometries (direct ML) in (a) clearly fails to capture the physics.For local density approximation (LDA) and generalized gradient approximation (GGA) calculations similar to Nagai et al. [35] in (b-c), the effect of the KSR yields reasonably accurate results in the vicinity of the data, but not outside.But when a global XC functional is included in (d), chemical accuracy is achieved for all separations including the dissociation limit.Similar results can be achieved for H 4 , the one-electron self-interaction error can easily be made to vanish, and the interaction of a pair of H 2 molecules can be found without any training on this type of molecule (all discussed below).
Modern DFT finds the ground-state electronic density by solving the Kohn-Sham equations: The electronic density is obtained from occupied orbitals The KS equations are in principle exact given the exact XC functional [19,47], which in practice is the only term approximated in DFT.From a computational perspective, the eigenvalue problem of Eq. ( 1) is solved repeatedly until the density converges to a fixed point, starting from an initial guess.We use linear density mixing [48] to improve convergence, 2(a) shows the unrolled computation flow.We approximate the XC energy per electron using a neural network XC,θ [n], where θ represents the trainable parameters.Together with the self-consistent KS iterations in Figure 2(b), the combined computational graph resembles a recurrent neural network [49] or deep equilibrium model [50] with additional fixed computational components.Density mixing has the same form as residual connections in deep neural networks [51].In addition to improving convergence for the forward problem of KS self-consistent calculations, density mixing helps backpropagate gradients efficiently through long computational procedures.
If the neural XC functional were exact, KS selfconsistent calculations would output the exact density and the intermediate energies over iterations would converge to the exact energy.This intention can be translated into a loss function and the neural XC functional can be updated end-to-end by backpropagating through the KS self-consistent calculations.This procedure differentiates through KS calculations and is general regardless of the dimensionality of the system.Throughout, experiments are performed in one dimension where accurate quantum solutions could be relatively easily generated via density matrix renormalization group (DMRG) [52].The electron-electron repulsion is A exp(−κ|x − x |), and attraction to a nucleus at x = 0 is −A exp(−κ|x|) [40].We design the loss function as an expectation E over training molecules, where N e is the number of electrons.L n minimizes the difference between the final density with the exact density.L E optimizes the trajectory of energies in total K iterations.The neural XC functional needs to not only output accurate XC in each iteration, but also drive the iterations to quickly converge to the exact energy.The trajectory loss also makes backpropagation more efficient by directly flowing gradients to early iterations [53].w k are arbitrary non-negative weights associated with each iteration.The optimal neural network parameters are selected with minimal mean absolute energy per electron on the validation set.Hundreds of useful XC functional approximations have been proposed by humans [54].Here we build a neural XC functional with several differentiable components with physics intuition tailored for XC in Figure 2(c).A global convolution layer captures the long range interaction, Note two special cases retrieve known physics quantities, Hartree energy density G(n(x), κ −1 ) ∝ H and electronic density G(n(x), 0) = n(x).Global convolution contains multiple channels and ξ p of each channel is trainable to capture interaction in different scales.Although the rectified linear unit [55] is popular, we use the sigmoid linear unit (SiLU) [56] (also known as swish [57]) f (x) = x/(1 + exp(−x)) because the infinite differentiability of SiLU guarantees the smoothness of v XC , the first derivative, and the second and higher order derivatives of the neural network used in the L-BFGS training [58].We do not enforce a specific choice of XC (sometimes called a gauge [59]), but we do enforce some conditions, primarily to aid convergence of the algorithm.We require XC to vanish whenever the density does, and that it be negative if at all possible.We achieved the former using the linearity of SiLU near the origin and turning off the bias terms in convolution layers.We softly impose the latter by a negative transform layer at the end, where a negative SiLU makes most output values negative.Finally, we design a self-interaction gate (SIG) that mixes in a portion of − H to cancel the self-interaction error, The portion is a gate function β(N e ) = exp(−(N e − 1) 2 /σ 2 ).When N e = 1, then  nucleus and approaches half that for H + 2 .Now we dive deeper into the outstanding generalization we observed in a simple but not easy task: predicting the entire H 2 dissociation curve from two points, as shown in Figure 1.It is not surprising that direct ML model completely fails.Neural networks are usually underdetermined systems as there are more parameters than training examples.Regularization is crucial to improve generalization [61,62], especially when data is limited.Most existing works regularize models with particular physics prior knowledge by imposing constraints via feature engineering and preprocessing [63,64], constraints on the network [65][66][67][68] or physics-informed loss terms [69,70].Another regularization strategy is to generate extra data for training using prior knowledge: in image classification problems, data are augmented by operations like flipping and cropping given the prior knowledge that labels are invariant to those operations [71].However, it is not clear how to generate extra data for physics problems solved by specific methods, e.g.electronic structure problems with KS equations.We found that training from differentiating through KS self-consistent calculations regularizes the model.Although the exact densities and energies of only two separations are given, KSR naturally samples different trajectories from an initial density to the exact density at each training step.More importantly, KSR focuses on learning an XC functional that can lead the KS self-consistent calculations to converge to the exact density from the initial density.Figure 3 visualizes the density trajectories sampled by KSR for one training separation R = 3.84.The functional with untrained parameters (t = 0) samples densities near the initial guess but soon learns to explore broadly and finds the trajectories toward the vicinity of the exact density.
In contrast, most existing ML functionals learn to predict a single step from the exact density, which is a poor 0.0000 0.0025 0.0050 0.0075 surrogate for the full self-consistent calculations [72].These standard ML models have two major shortcomings.First, the exact density is unknown for new systems, so the model is not expected to behave correctly on unseen initial densities for KS calculations.Second, even if a model is trained on many densities for single step prediction, it is not guaranteed to converge the selfconsistent calculations to a good solution.Research in imitation learning shows that error accumulation from single steps quickly pushes the model out of its interpolation region [73].On the other hand, since KSR allows the model access to all the KS iterations, it learns to optimize the entire self-consistent procedure to avoid the error accumulation from greedy optimization of single steps.Further comparison for training neural XC functionals without or with "weaker" KSR is in the supplemental material [40].
Next we retrain our neural XC functional with KSR on N train /2 examples each of H 2 and H 4 molecules.Figure 4 shows the prediction accuracy of KSR with both energy and density loss (full KSR), in comparison to KSR with only energy loss (energy only KSR) and direct ML model.We compute the energy mean absolute error on the holdout sets of H 2 (R ∈ [0.4,6]) and H 4 (R ∈ [1.04, 6]).The average mean absolute error of H 2 and H 4 with various N train is shown in Figure 4(a).Full KSR has the lowest error at minimum N train = 4, reaching chemical accuracy at 6.As the size of the training set increases, energy only KSR reaches chemical accuracy at N train = 10, but direct ML model never does (even at 20).Then we test models on unseen types of molecules.In Figure 4 tionals, while direct ML models always have large errors.Finally we take a pair of equilibrium H 2 and separate them with R = 0.16 to 9.76 Bohr, denoted as H 2 H 2 .KSR models generalize much better than ML for "zero-shot" prediction [74], where H 2 H 2 have never been exposed to the model during training.
Why is the density important in training, and what use are the non-converged iterations?The density is the functional derivative of the energy with respect to the potential, so it gives the exact slope of the energy with respect to any change in the potential, including stretching (or compressing) the bond.Thus the density implicitly contains much energetic information, and reproducing the density accurately guarantees finding the correct derivative at that point in the binding curve.Iteration of the KS equations before convergence produces abundant information about the functional in the vicinity of the minimum, exploring its shape.Even before θ is close to its final value, the network is learning to construct a functional with both the correct minimum and all correct derivatives at this minimum.In the paradigm of differentiable programming, density is the hidden state carrying the information through the recurrent structure in Figure 2(a).Correct supervision from L n greatly helps generalization from very limited data, see N train ≤ 6 in Figure 4.But as N train increases, both KSR with and without L n perform well in energy prediction.To address more details, we show the solution of H 4 with R = 4.32 in Figure 5.With L n , the density is clearly much accurate than KSR without L n ( (n KS − n DMRG ) 2 dx = 9.2 × 10 −5 versus 9.8 × 10 −2 ).Then we compute the corresponding exact v S using inverse KS method [30].Both functionals do not reproduce the exact v S .However, functional trained with L n recovered most of the KS potential.We must stress the nontriviality because unlike previous works [32][33][34] that explicitly included the KS or XC potential into the loss function, our model never uses the exact KS potential.In our KSR setup, the model aims at predicting XC , from which the derived v S yields accu-rate density.Therefore, predicting v XC is a side product.We also address some concerns on training explicitly with v XC .One artifact is that generating the exact v S requires an additional inverse calculation, which is known to be numerically unstable [30].Schmidt et al. [32] observe outliers when they generate training v XC from inverse KS.While v XC is a fascinating and useful object for theoretical study, because its relation to the density is extremely delicate, it is far more practical to simply use the density to train on [35].
Differentiable programming blurs the boundary between physics computation and ML.Here we showed that treating KS self-consistent calculations as a differentiable program is a regularizer, incorporating a physics prior and resulting in a remarkable generalization of the neural XC functional trained with it.The results serve as a proof of principle to rethink physics computation in the context of the new era of computing owing to achievements in automatic differentiation software, hardware and theories.An exciting next step is to apply this idea to real molecules, as an end-to-end differentiable electronic structure method.Besides finding density functionals, all heuristics in the calculations, e.g.initial guess, density update, preconditioning, basis sets, even the entire self-consistent calculations as a meta-optimization problem [53], can be learned and optimized while keeping the rigorous physics and mathematics in the rest of the algorithm -getting the best of both worlds.
The authors thank Michael Brenner, Sam Schoenholz, Lucas Wagner, and Hanjun Dai for helpful discussion.K.B. supported by NSF CHE-1856165, R.P. by DOE DE-SC0008696.

I. 1D MODEL SYSTEMS
In 1D, we utilize exponential Coulomb interactions to mimic the standard 3D Coulomb potential, where A = 1.071295 and κ −1 = 2.385345 [1].Within this model, the external one-body potential for a nuclei of atomic number Z and position x is represented by −Z v exp (x − x ).The external potential for arbitrary molecular systems and geometries is modeled as For example, a 1D H 2 molecule at separation R = 4 can be represented by v(x) = −v exp (x − 2) − v exp (x + 2).The repulsion between electrons at positions x and x is given by the two-body potential v ee (x − x ) = v exp (x − x ).We represent all systems on a 1D grid of m = 513 points each separated by a distance h = 0.08.The center grid point is at the origin, x = 0, and the range of grid points is x ∈ {−20.48, . . ., 20.48}.For consistency, all nuclei positions reside on grid points in calculations.In this convention, all molecules in this work are either symmetric about the origin x = 0 or x = 0.04, depending on the separation between nuclei.

II. DMRG CALCULATION DETAILS
The real-space interacting Hamiltonian for a 1D system of lattice spacing h becomes in second quantized notation, where the operator c † jσ creates (and c jσ annihilates) an electron of spin σ on site j, n jσ = c † jσ c jσ , and n j = n j↑ + n j↓ .The single and double brackets below the sums indicate sums over nearest and next nearest neighbors, respectively.The hopping term coefficients are determined by the 4-th order central finite difference approximation to the second derivative.The Hamiltonian is solved using DMRG to obtain highly accurate ground-state energies and densities.Calculations are performed using the ITensor library [2] with an energy convergence threshold of 10 −7 Ha.

A. Local Density Approximation
In our 1D model the electron repulsion is an exponential interaction.To implement a local density approximation (LDA) for this interaction we use Ref. [1] which provides the exponentially repelling uniform gas exchange energy analytically and an accurate parameterized model for the correlation energy.We use this specific implementation for all LDA calculations.

B. Initial density
We solve the Schrödinger equation of the non-interacting system with the external potential v(x) defined in Eq.S2, The density is the square sum of all the occupied orbitals n(x) = i |φ i (x)| 2 .In all the KS self-consistent calculations presented in this work, we use the density of the non-interacting system with external potential v(x) as the initial density.

C. Linear density mixing
Linear density mixing is a well-known strategy to improve the convergence of the KS self-consistent calculation, In this work, we apply an exponential decay on the mixing factor α = 0.5 × 0.9 k−1 .

D. Symmetry
The training molecules used in this paper are symmetric to their centers.We define the symmetry operation on functions on the grids S : R m → R m .It flips a function at the center and averages with itself.In each KS iteration, we enforce the symmetry on the XC functional, XC [n] → S XC [S(n)] .So E XC and v XC are transformed as Before the output of each KS iteration, ).

IV. TRAINING, VALIDATION AND TEST A. Weights in trajectory loss
We use w k = 0.9 K−k H(k − 10), where H is the Heaviside function and K is the total number of iterations.

B. Number of KS iterations
All the KS calculations use a fixed number of iterations that is sufficient for convergence.The number of iterations for different molecules are listed in Table I.H 2 dissociation curves in Figure 1 are trained from exact densities and energies of two molecules.One is a compressed H 2 (R = 1.28) and the other is a stretched H 2 (R = 3.84) molecule.The optimal checkpoint is selected by a validation molecule with R = 2.96.

Training molecules
The distances between nearby atoms for training molecules used in Figure 4 are listed in Table II.

Validation molecules
The distances between nearby atoms for validation molecules used in Figure 4 are listed in Table III.

Test molecules
The distances between nearby atoms for test molecules used in Figure 4 are listed in Table IV.We extend the plot of test errors in Figure 4 to N train = 20 in Figure S1 and list all the numerical values in Table V 1.In the ML model that directly predicts energy from geometry (Figure S2(a)), we first solve Eq.S4 to obtain a density for a particular molecular geometry.We use this density as a smooth representation of the geometry.The first few layers (global conv-conv-SiLU-conv-SiLU) are identical to the KSR-global in Figure S2(d).Next, we use convolution layers with 128 channels and dense layers to increase the capacity of the model.Finally, a dense layer with a single unit outputs the scalar E.
The KSR-LDA and KSR-GGA approaches do not have the global convolution layer because the use of global information violates the local and semi-local approximation.The first layer of KSR-LDA is a convolution with filter size 1.It mimics the physics of the standard LDA approach by mapping the density value to the XC energy density at the same point, LDA XC,θ : R 1 → R 1 .KSR-GGA uses a convolution layer with filter size 3 to map the density values of three nearby points to the XC energy density at the center point, GGA XC,θ : R 3 → R 1 .The XC energy density in the entire space is also computed pointwise, The remaining network structure of KSR-LDA and KSR-GGA is identical to KSR-global, except for self-interaction gate (SIG).

Convolution
Filter weights in 1D convolution are initialized by He normal initialization [3].The stride is one.The edges are padded with zero to ensure that the size of the output spatial dimension is the same as the size of the unpadded input spatial dimension.There is no bias term.

Global convolution
Global convolution contains multiple channels to capture the interaction in different scales.The operation in each channel is where ξ p is trainable and controls the scale of the interaction.We parameterize ξ p = a+(b−a)•σ(η p ) using the sigmoid function σ(x) = 1/(1 + exp(−x)) to bound ξ p ∈ (a, b).η p is initialized using the normal distribution N (0, 10 −4 ).For the 16-channel global convolution layer used in this work, we have η 1 ≡ 0 to preserve the input density and the rest η p ∈ (0.1, κ −1 are trainable.

Dense
The dense layer is only used in the ML model (Figure S2(a)).The weights are initialized Glorot normal initialization [4] and bias terms are initialized using the normal distribution N (0, 10 −4 ).

C. Checkpoint selection
Each calculation is repeated with 25 random seeds.The model is trained by L-BFGS [5] implemented in SciPy [6] scipy.optimize.fmin_l_bfgs_bwith factr=1,m=20,pgtol=1e-14.Parameter checkpoints are saved every 10 steps until L-BFGS stops.The optimal checkpoint is the checkpoint with the lowest average energy error per electron on validation sets, , where E is the final energy from KS calculations, E DMRG is the exact energy, and N e is the number of electrons.The validation sets {S val } and the molecules M in each sets are listed in Table III.

VI. TRAINING A NEURAL XC FUNCTIONAL WITHOUT KS REGULARIZATION
Schmidt et al. [7] proposed a neural XC functional that can be used in a KS self-consistent calculation in the inference stage.Unlike our work, which trains the network through KS calculations, they train the network in a single-step.The training set contains 12800 molecules and the validation set contains 6400 molecules.Here, the molecules are exact solutions of one-dimensional two-electron problems in the external potential of up to three random nuclei (Equation 4in Schmidt et al. [7]).The exact v XC for each molecule is computed by an inverse KS method.They input exact ground state density and train the network to predict XC energy per length e XC that minimizes the loss function: a weighted combination of the mean square errors (MSE) of the XC energy, XC potential, its numerical spatial derivative, and the difference between the XC energy and the integral over the potential (Equation 5 in Schmidt et al. [7]), where α = 1.0, β = 100.0,γ = 10.0, and δ = 1.0 is the weights used in Schmidt et al. [7].
It is natural to pose the question: does the generalization from the two H 2 training molecules in Figure 1 result from using KS self-consistent calculations in the inference stage rather than the training stage?This is a reasonable concern because the XC energy is usually a small portion of the total energy.To justify this concern, we first use inverse KS to get the exact v XC on the two H 2 molecules used in the H 2 experiment.Then, we take the model architecture [8] and loss function in Schmidt et al. [7] and attempt to learn the entire dissociation curve of H 2 from two molecules.Figure S3 compares the results from KS self-consistent calculations using functionals trained on (a) single-step and (b) Kohn-Sham regularization.It is not surprising that even though both approaches use KS self-consistent calculations in the inference stage, the model trained on a single-step fails to generalize in the small training set limit (1/6400 training set size to the original paper).The neural XC functional is a many-to-many mapping, which is very hard to learn with limited data.Moreover, KS self-consistent calculations start with an initial density that is not the exact ground state density.It is clearly out of the interpolation region for the model that has only seen exact densities of two molecules.
We would like to emphasize that this comparison aims to show that using KS calculations in training -Kohn-Sham regularizer -is crucial to the generalization.A single-step model could work well as reported in Schmidt et al. [7] with a larger training set and exact v XC .Unlike other methods that build physics prior knowledge to the model through constraint, KSR "augments" densities for the model during training.Thus, there is no single coefficient to explicitly control the strength of the regularization.A straightforward idea to control the KSR strength is to change the total number of iterations K in the KS selfconsistent calculations.However, a small K may not be sufficient to converge KS calculations.Thus it is ambiguous to understand whether the worse performance is from weaker regularization or unconverged KS calculations.Here we design an approach to control the strength of KSR by stopping the gradient flow in the backpropagation and keeping K fixed.
Stop gradient is a common operation in differentiable programming.It acts as an identity in the forward pass so it does not affect the KS calculations.In the backpropagation, it sets the gradient passing through it to zero.As shown in Figure S4, we stop gradient before a certain KS iteration k = k * so all the previous iterations k < k * have no access to the gradient information.Since the gradient may still flow into the iteration k from L E through its energy output E k , we also stop the gradient on E k for k < k * .To simplify the graph, we remove the arrows between E k to L E for k < k * .In Figure S4(a), the neural XC functional is updated only from the gradient information flowing through the final iteration.(b) is similar to (a) but has access to the gradient flowing through the last two iterations.
No stop gradient is applied to (c) and it is identical to the computational graph we used in the main text.We repeat the same experiment in Figure 1. Figure S5 shows the H 2 dissociation curves trained with three stop gradient setting in Figure S4.In Figure S5(a), L-BFGS converges quickly as there is no sufficient gradient information for training.By including the gradient information in the K − 1-th iteration, the distribution of the dissociation curves predicted by the model during training get closer to the true curve in (b).For comparison, we place the distribution of dissociation curves from model without stop gradient in (c), previously shown in Figure 1(d), where the physics of the true dissociation curve is captured.
FIG. 1.One-dimensional H2 dissociation curves trained from two molecules (red diamonds).(a) A ML model that directly predicts E from geometries, clearly fails to capture the physics from very limited data.(b) Comparison of LDA found with KSR and that from uniform gas (brown), and (c) same as (b) but for GGA, (d) the global XC approximation found with KSR.Enn is the nucleus-nucleus repulsion energy.Grey lines denote 15 sampled functionals during training, with darker lines denoting later samples.Functionals with optimal parameters validated from the molecule at R = 3 (black triangles) are highlighted in orange, green, pink and blue respectively in each panel.KSR-global yields chemical accuracy (grey shadow), shown in lower panels.Atomic units used throughout.See the supplemental material[40] for more details.
r) is the KS potential consisting of the external one-body potential and the density-dependent Hartree (H) and XC potentials.The XC potential v XC [n](r) = δE XC /δn(r) is the functional derivative of the XC energy functional E XC [n] = XC [n](r)n(r)dr, where XC [n](r) is the XC energy per electron.The total electronic energy E is then given by the sum of the non-interacting kinetic energy T s [n], the external one-body potential energy V [n], the Hartree energy U [n], and XC energy E XC [n].

FIG. 2 .
FIG. 2. (a) KS-DFT as a differentiable program.Black arrows are the conventional computation flow of KS selfconsistent calculations with linear density mixing (purple diamonds).The gradients flow along red dashed arrows to minimize the energy loss LE (green hexagon) and density loss Ln (orange hexagon).(b) In each single KS iteration, neural XC functional produces vXC,θ[n] and EXC,θ[n].(c) Architecture of global XC functional XC,θ [n].
(out) XC= − H .For more electrons, σ can be fixed or adjusted by the training algorithm to decide the sensitivity to N e .For H 2 as R → ∞, XC tends to a superposition of the negative of the Hartree energy density at each

FIG. 3 .
FIG. 3. (a) t-SNE visualization[60] of density trajectories (grey dots) sampled by KSR during training for R = 3.84 from initial guess (cross) to exact density (red diamond).Darker trajectories denote later optimization steps t.Densities from each KS step in trajectories are plotted in the corresponding highlighted colors for (b) untrained t = 0, (c) optimal t = 220 in Figure1, and (d) overfitting t = 560.

FIG. 5 .
FIG. 5. Density and KS potential of H4 with R = 4.32 from neural XC functionals trained with (a) full KSR and (b) energy only KSR on training set of size Ntrain = 20.vS are shifted by a constant for better comparison.

Figure
Figure illustrates the model architectures used in Figure1.In the ML model that directly predicts energy from geometry (FigureS2(a)), we first solve Eq.S4 to obtain a density for a particular molecular geometry.We use this density as a smooth representation of the geometry.The first few layers (global conv-conv-SiLU-conv-SiLU) are identical to the KSR-global in FigureS2(d).Next, we use convolution layers with 128 channels and dense layers to increase the capacity of the model.Finally, a dense layer with a single unit outputs the scalar E.The KSR-LDA and KSR-GGA approaches do not have the global convolution layer because the use of global information violates the local and semi-local approximation.The first layer of KSR-LDA is a convolution with filter size 1.It mimics the physics of the standard LDA approach by mapping the density value to the XC energy density at the same point, LDA XC,θ : R 1 → R 1 .KSR-GGA uses a convolution layer with filter size 3 to map the density values of three nearby points to the XC energy density at the center point, GGA XC,θ : R 3 → R 1 .The XC energy density in the entire space is also computed pointwise, XC = GGA XC,θ [n(x −1 , x 0 , x 1 )], . . ., GGA XC,θ [n(x m−2 , x m−1 , x m )] ∈ R m .The remaining network structure of KSR-LDA and KSR-GGA is identical to KSR-global, except for self-interaction gate (SIG).

FIG
FIG. S3.Training the neural XC functional using (a) single-step and (b) Kohn-Sham regularization.Both functionals use Kohn-Sham self-consistent calculations in the inference stage.

FIG
FIG.S5.H2 dissociation curves trained from two molecules (red diamonds) with a corresponding computational graph shown in FigureS4.

TABLE I .
Number of iterations for different molecules.

TABLE II .
Distances between nearby atoms for training molecules used in Figure4.

TABLE III .
The distances between nearby atoms for validation molecules used in Figure4.The validation set is fixed for calculations with 4 ≤ Ntrain ≤ 20.

TABLE IV .
The distances between nearby atoms for test molecules used in Figure4.The test set is fixed for calculations with 4 ≤ Ntrain ≤ 20.

TABLE V .
.FIG.S1.Test generalization of ML models as a function of the total number of training examples Ntrain: full KSR (blue), energy only KSR (pink) and direct ML (orange) on (a) holdout H2 and H4, and unseen types of molecules (b) H + 2 (c) H2H2.Black dashed lines show chemical accuracy.All the numerical values are listed in Table V. Numerical values of test errors plotted in Figure S1.