Accelerating Finite-temperature Kohn-Sham Density Functional Theory with Deep Neural Networks

We present a numerical modeling workflow based on machine learning (ML) which reproduces the the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spatially-resolved, energy-resolved, and integrated quantities can be calculated, including the DFT total free energy, which serves as the Born-Oppenheimer potential energy surface for the atoms. We demonstrate the efficacy of this approach for both solid and liquid metals and compare results between independent and unified machine-learning models for solid and liquid aluminum. Our machine-learning density functional theory framework opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms.


INTRODUCTION
Multiscale materials modeling [1] provides fundamental insights into microscopic mechanisms that determine materials properties.A multiscale modeling framework operating both near first-principles accuracy and across length and time scales would enable key progress in a plethora of applications.It would greatly advance materials science research-in gaining understanding of the dynamical processes inherent to advanced manufacturing [2,3], in the search for superhard [4] and energetic materials [5,6], and in the identification of radiationhardened semiconductors [7,8].It would also pave the way for generating more accurate models of materials at extreme conditions including shock-compressed materials [9,10] and astrophysical objects, such as the structure, dynamics, and formation processes of the Earth's core [11], solar [12][13][14][15][16][17][18] and extra-solar planets [19,20].This requires faithfully passing information from quantum mechanical calculations of electronic properties on the atomic scale (nanometers and femtoseconds) up to effective continuum material simulations operating at much larger length and time scales.Atomistic simulations based on molecular dynamics (MD) techniques [21] and their efficient implementation [22] for large-scale simulations on high-performance computing platforms are the key link.For atomic-scale systems, MD results can be directly validated against quantum calculations.At the same time, MD can be scaled up to the much larger scales on which the continuum behaviors start to emerge (either micrometers or microseconds, or even both [23]).
MD simulations require accurate interatomic poten-tials (IAPs) [24].Generating them based on machine learning (ML) techniques has only recently become a rapidly evolving research area.It has already led to several ML-generated IAPs such as AGNI [25], ANI [26], DPMD [27], GAP [28], GARFfield [29], HIP-NN [30], SchNet [31], and SNAP [32].While these IAPs differ in their flavor of ML models, for instance, non-linear optimization, kernel methods, structured neural networks, and convolutional neural networks, they all rely on accurate training data from first-principles methods.First-principles training datasets are commonly generated with Kohn-Sham density functional theory (DFT) [34] and its generalization to finite electronic temperature [35].Due to its balance between accuracy and computational cost, it is the method of choice [36] for computing properties of molecules and materials with close to chemical accuracy [37,38].
Despite its success [39], DFT is limited to the nanoscale due to its computational cost which scales as the cube of the system size.At the same time, the amount of data needed to construct an IAP increases exponentially with the number of chemical elements, thermodynamic states, phases, and interfaces [40].Likewise, evaluation of thermodynamic properties using DFT is computationally expensive, typically 10 5 to 10 6 core-hours for each point on a grid of temperatures and densities [41].Furthermore, fully converged DFT calculations at low densities, very high temperatures, or near phase transitions are considerably more difficult.
Pioneering efforts in applying ML to electronic structure calculations focus on mining benchmark databases generated from experiments, quantum chemistry methods, and DFT in order to train ML models based on kernel ridge regression and feed-forward neural networks.These efforts enabled ML predictions of molecular properties [42][43][44] and crystal structures [45].Only limited efforts have gone into actually speeding up DFT calcu-FIG.1. Illustration of the ML-DFT workflow listed in Table I.The raw input data is generated from finite-temperature DFT calculations for snapshots of atomic configurations, i.e., positions and chemical identities of atoms (a).SNAP fingerprints in terms of bispectrum components are generated for each of several million Cartesian grid points in the snapshot volume (b).The trained ML-DFT feed-forward neural network takes the input vector f composed of SNAP fingerprints for one Cartesian grid point and outputs the predicted LDOS vector d at the grid point (c).The primary output of the ML inference is the LDOS on a Cartesian grid (d) which is used to compute the total energy (e) based on Eq. ( 9).The embedded Figure 6 displays our central result-the capability of a single ML-DFT model to predict the total energy in both solid and liquid phases of aluminum with accuracy comparable to standard DFT.
lations by directly approximating the solutions of the Kohn-Sham differential equations through ML models.Prior work has focused on approximating the kinetic energy functional using ML regression [46][47][48] and deeplearning models [49].Most recently, ongoing efforts have demonstrated the usefulness of ML kernel techniques on the electronic density [50] and feed-forward neural networks on the density of states (DOS) [45,51,52], and the local density of states (LDOS) [53] for creating computationally efficient DFT surrogates.
In this work, we develop ML-DFT, a surrogate for DFT calculations at finite electronic temperature using feedforward neural networks.We derive a representation of the Born-Oppenheimer potential energy surface at finite electronic temperature (Section II A) that lends itself to an implementation within an ML workflow (Section II B).Using the LDOS as the central quantity, ML-DFT enables us to compute the DFT total free energy in terms of a single ML model (Section II B), applicable to coupled electrons and ions not only under ambient conditions, but also in liquid metals and plasmas.For the sake of simplicity, we will refer to the DFT total free energy as defined in Eq. ( 6) as total energy.Utilizing a generalization of the SNAP bispectrum descriptors on a Cartesian grid (Section II E), the feed-forward neural network of ML-DFT (Section II F) is trained on DFT data (Section II C).Integrated quantities such as the band energy and the total energy of an atomic configuration at a given electronic temperature are in turn computed based on the LDOS predicted by ML-DFT.Using the LDOS in conjunction with the developed expression for total energy avoids the need to introduce another ML model to map densities to kinetic energies [50] and goes significantly beyond prior ML efforts related to the LDOS [53].ML-DFT yields DFT-level accuracy, but comes at a negligible computa-tional cost by avoiding the O(N 3 ) eigensolver step.We demonstrate this for aluminum (Section III).Trained on atomic configurations of aluminum at the melting point, a single ML-DFT model yields accurate results (LDOS, electronic density, band energy, and total energy) for both crystalline and liquid aluminum.Notably, band energies and total energies predicted by ML-DFT agree with the results of conventional DFT calculations to well within chemical accuracy, which is traditionally defined as 1 kcal/mol = 43.4meV/atom.Hence, ML-DFT meets all requirements to become the backbone of the computational infrastructure needed for high-performance multiscale materials modeling of matter under ambient and extreme conditions (Section IV).
The ML-DFT workflow is illustrated in Figure 1.It starts with input data generated from finite-temperature DFT calculations for a given atomic configuration (a).SNAP fingerprints are generated for each of several million Cartesian grid points in the snapshot volume (b).Given a SNAP fingerprint as input, the trained ML-DFT feed-forward neural network makes its prediction for the LDOS vector (c).The primary output of the ML inference is the LDOS prediction at each grid point (d).Based on the predicted LDOS, quantities such as the electronic density based on Eq. ( 10), the DOS based on Eq. ( 13), and the total energy based on Eq. ( 9) are computed (e).

A. Born-Oppenheimer Density Functional Theory at Finite Electronic Temperature
A suitable theoretical framework for computing thermodynamic materials properties from first principles is within the scope of non-relativistic quantum mechanics in the Born-Oppenheimer approximation [54].Additionally we work within a formalism that takes the electronic temperature into account.This becomes relevant when the scale of the electronic temperature becomes comparable to the first excitation energy of the electronic system which is particularly relevant for liquid metals and plasmas.In the given framework we, hence, assume that the electrons find their thermal equilibrium on a time scale that is small compared to that of ionic motion.The formal development and implementation of such methodologies that couple electron and ion dynamics is an area of active research [55][56][57][58][59][60][61].
More formally, consider a system of N e electrons and N i ions with collective coordinates r = {r 1 , . . ., r Ne } and R = {R 1 , . . ., R Ni }, where r j ∈ R 3 refers to the position of the jth electron, while R α ∈ R 3 denotes the position of the αth ion of mass M α and charge Z α .The physics in this framework is governed by the Born-Oppenheimer Hamiltonian where T e (r) = Ne j −∇ 2 j /2 denotes the kinetic energy of the electrons, V ee (r) = ionion interaction.Note that we work within atomic units throughout, where = m e = e 2 = 1, such that energies are expressed in Hartrees and length in Bohr radii.
Using the standard framework of quantum statistical mechanics [62], all thermodynamic equilibrium properties are formally given in terms of the grand potential which is defined as a statistical average over the grand canonical operator Ω = ĤBO − µ N − Ŝ/β, where N denotes the particle number operator, Ŝ the entropy operator, µ the chemical potential, and where k B is the Boltzmann constant and T is the electronic temperature.Statistical averages as in Eq. ( 2) are computed via the statistical density operator Γ = Ne,j w Ne,j |Ψ Ne,j Ψ Ne,j | which is a weighted sum of projection operators on the underlying Hilbert space spanned by the N e -particle eigenstates Ψ Ne,j of ĤBO with energies E Ne,j and statistical weights w Ne,j = exp[−β(E Ne,j −µN e )]/ Ne,j exp[−β(E Ne,j −µN e )].The thermal equilibrium of a grand canonical ensemble is then defined as that statistical density operator which minimizes the grand potential.Further details on the grand canonical ensemble formulation of a thermal electronic system can be found in Refs.[63,64].
In practice, the electron-electron interaction complicates finding solutions to Ψ Ne,j and, hence, evaluating the grand potential as defined in Eq. (2).Instead, a solution is found in a computationally feasible manner within DFT [34] at finite electronic temperature [35,63,65].Here, a formally exact representation of the grand potential is given by where T S denotes the Kohn-Sham kinetic energy, S S the Kohn-Sham entropy, U [n] the classical electrostatic interaction energy, E XC the exchange-correlation free energy which is the sum of the temperature-dependent exchange-correlation energy and the interacting entropy, v ei (r; R) = − α Z α /|r − R α | the electron-ion interaction, and V ii the ion-ion interaction.The grand potential in Eq. ( 3) is evaluated in terms of the electronic density defined by where the sum runs over the Kohn-Sham orbitals φ j and eigenvalues j that are obtained from the solving the Kohn-Sham equations where f β ( j ) = (1 + exp[β( j − µ)]) −1 denotes the Fermi-Dirac distribution.The Kohn-Sham framework is constructed such that a noninteracting system of fermions yields the same electronic density as the interacting many-body system of electrons defined by ĤBO .This is achieved by the Kohn-Sham potential v S (r; R) = δU [n]/δn(r; R) + δE XC [n]/δn(r; R) + v ei (r; R) which includes all electron-electron interactions on a mean-field level.While formally exact, approximations to E XC are applied in practice either at the electronic ground state [39] or at finite temperature [66][67][68].
The key quantity connecting the electronic and ionic degrees of freedoms is the total energy This expression yields the Born-Oppenheimer potential energy surface at finite electronic temperature, when it is evaluated as a function of R. It provides the forces F α = −∂A BO (R)/∂R α on the ions which are obtained from solving for the instantaneous, grand potential in thermal equilibrium Ω 0 [n] for a given atomic configuration R via Eq.( 5).In doing so, Eq. ( 6) enables us to time-propagate the dynamics of the atomic configuration R, in its simplest realization, by solving a Newtonian equation of motion This can be further extended, for example, to take into account the energy transfer between the coupled system of electrons and ions by introducing thermostats [69,70].Despite its powerful utility for predicting thermodynamic and material properties on the atomic scale [37,38] and providing benchmark data for the parameterization of IAPs for use in MD simulations [29], applying DFT becomes computationally infeasible in applications towards the mesoscopic scale due to the computational scaling of Eq. (5).The computational cost for data generation explodes exponentially with the number of chemical elements, thermodynamic states, phases, and interfaces [40] needed for the construction of IAPs.

B. Machine-learning Model of the Total Energy
We now reformulate DFT in terms of the LDOS which is granted by virtue of the first Hohenberg-Kohn theorem [33].We then replace the DFT evaluation of the LDOS with a ML model for the LDOS in order to obtain all of the quantities required to evaluate the total energy while avoiding the computationally expensive solution of Eq. ( 5).The LDOS is defined as a sum over the Kohn-Sham orbitals where denotes the energy as a continuous variable, and δ( ) is the Dirac delta function.This enables us to rewrite Eq. ( 6) as where denotes the potential energy component of the XC energy [71].
The central advantage of this reformulation is that, by virtue of Eq. ( 9), the total energy is expressed solely in terms of the LDOS.We emphasize this by explicitly denoting the functional dependence of each term on the LDOS.Now let us turn to the individual terms in this expression.The ion-ion interaction V ii (R) is given trivially by the atomic configuration.The Hartree energy and the potential XC energy ] are all determined implicitly by the LDOS via their explicit dependence on the electronic density Likewise, the band energy and the Kohn-Sham entropy are determined implicitly by the LDOS via their explicit dependence on the DOS Eq. ( 9) enables us to evaluate the total energy in terms of a single ML model that is trained on the LDOS and can predict the LDOS for snapshots unseen in training.This is advantageous, as it has been shown that learning energies from electronic properties, such as the density, is significantly more accurate [50] than learning them directly from atomic descriptors [46,72,73].Furthermore, developing an ML model based on the LDOS instead of the density avoids complications of prior models.For example, accuracy was lost due to an additional ML model needed to map densities to kinetic energies or noise was introduced by taking gradients of the ML kinetic energy models [47,48,74,75].The two formulations in Eq. ( 6) and Eq. ( 9) are mathematically equivalent.However, Eq. ( 9) is more convenient in our ML-DFT workflow where we use the LDOS as the central quantity.To our knowledge, this work is the very first implementation of DFT at finite electronic temperature in terms of an ML model.

C. Data Generation
Our initial efforts have focused on developing an ML model for aluminum at ambient density (2.699 g/cc) and temperatures up to the melting point (933 K).The training data for this model was generated by calculating the LDOS for a set of atomic configurations using the Quantum Espresso electronic structure code [76][77][78].The structures were generated from snapshots of equilibrated DFT-MD trajectories for 256 atom supercells of aluminum.We calculated LDOS training data for 10 snapshots at room temperature, 10 snapshots of the crystalline phase at the melting point, and 10 snapshots of the liquid phase at the melting point.
Our DFT calculations used a scalar-relativistic, optimized norm-conserving Vanderbilt pseudopotential (Al.sr-pbesol.upf)[79] and the PBEsol exchangecorrelation functional [80].We used a 100 Rydberg planewave cutoff to represent the Kohn-Sham orbitals and a 400 Rydberg cutoff to represent densities and potentials.These cutoffs resulted in a 200 × 200 × 200 realspace grid for the densities and potentials, and we used this same grid to represent the LDOS.The electronic occupations were generated using Fermi-Dirac smearing with the electronic temperature corresponding to the ionic temperature.Monkhorst-Pack [81] k-point sampling grids, shifted to include the gamma point when necessary, were used to sample the band structure over the Brillouin zone.The DFT-MD calculations that generated the snapshots used 2 × 2 × 2 k-point sampling for the crystalline phase and 1 × 1 × 1 k-point sampling for the liquid phase.
In contrast, the generation of the LDOS used 8 × 8 × 8 k-point sampling.The reason for this unusually high k-point sampling (for a 256 atom supercell) was to help overcome a challenge that arises when using the LDOS as the fundamental parameter in a ML surrogate for DFT.For a periodic supercell, the LDOS is given as where the index j now labels the Kohn-Sham orbitals at a particular k value, and the vector index k categorizes the Kohn-Sham orbitals by their transformation properties under the group of translation symmetries of the periodic supercell.In particular, where u jk (r; R) is some function with the same periodicity as the periodically repeated atomic positions R.
In Eq. ( 14), the integral with respect to k is taken over the first Brillouin zone (BZ), which has volume Ω BZ .If this integral could be evaluated exactly, the resulting LDOS would be a continuous function of with Van Hove singularities [82] of the form √ − or √ − at energies that correspond to critical points of the band structure jk .However, when this integral is represented as a summation over a finite number of k-points k 1 , . . ., k N k , as in the Monkhorst-Pack [81] approach used in our calculations, it is necessary to replace the Dirac delta function δ( − jk ) appearing in Eq. ( 14) with a finite-width representation δ( − jk ).The LDOS used to train our ML model is then obtained by evaluating on a finite grid of values.In our calculations, we use an evenly spaced grid with a spacing of 0.1 eV, a minimum value of -10 eV, and a maximum value of +15 eV.Hence, the data for each grid point consists of a vector of 250 scalar values.Kohn-Sham orbitals with eigenvalues significantly outside the energy range where we evaluate the LDOS make negligible contributions to the calculated LDOS values, and ideally we would include a sufficient number N s of the lowest Kohn-Sham orbitals in Eq. ( 14) and Eq. ( 16) to span this energy range.However, due to memory constraints on our 8 × 8 × 8 k-point DFT calculations, we were only able to calculate N s = 576 orbitals.At ambient density, this provides an accurate LDOS up to ≈ 10 eV, and the relevant Fermi levels are at µ ≈ 7.7 eV.The tails of f β ( ) should be negligible above ≈ 10 eV for T 2500 K.At higher electronic temperatures, it would become necessary to increase N S when generating the training data in order to get an accurate LDOS over a wider energy range.The challenge mentioned above arises in the choice of the delta function representation δ and the k-point sampling scheme k 1 , . . ., k N k for a given grid.If the function δ is too wide, it leads to systematic errors in the LDOS and derived quantities such as the density, band energy and total energy.However, if δ is too narrow or too few k-points are used to sample the BZ, then only a few values of jk k make substantial contributions to the LDOS at each point on the grid, and there is a large amount of noise in the resulting LDOS.
Figure 2 demonstrates these trade-offs for a Gaussian representation δ(x) = exp −x 2 /σ 2 / √ πσ 2 by plotting the DOS D[D]( ; R) in the upper panel and the error in the band energy calculated from the DOS in the lower panel.The top panel shows that using fewer k-points or a narrower δ leads to unphysical noise in the calculated DOS, while the bottom panel shows that using a wider δ leads to an increasing error in the band energy.The lower panel also shows that the noise in the DOS that arises with a narrower δ is reflected in an increasing variability in the band energy error.
Since a larger k-point sampling allows a narrower δ to be used without introducing excessive noise into the LDOS, we can minimize the errors in quantities calculated from the LDOS by using as many k-points as possible.In the LDOS calculations used to train our ML model, we used 8 × 8 × 8 k-point sampling, which was the largest k-point sampling that was computationally tractable for our 256 atom cells.We then chose a Gaussian δ with σ = 0.2 eV (indicated in the lower panel of Figure 2), which gives a relatively small and very consistent error in derived quantities.For example, the average error in the band energies calculated from the resulting LDOS (relative to fully converged DFT) is -4.483 meV/atom when the average is taken over all of our snapshots.The average error becomes -4.466, -4.500, and -4.483 meV/atom when the average is restricted to the 298K, 933K solid, and 933K liquid snapshots, respectively.These results show that the error changes very little between different temperatures and structures.These errors are also very consistent for different structures at the same temperature and phase.In particular, the root mean square variations in the errors are 0.004, 0.011, and 0.017 meV/atom for the 298K, 933K solid, and 933K liquid snapshots respectively.Our choices for the grid spacing and the width of the Gaussian broadening turn out to be similar to those made in Ref. [52].
In electronic structure calculations, consistent errors in calculated energies (for example, the errors due to incomplete convergence with respect to the plane-wave cutoffs in traditional DFT calculations) are generally not a problem.Instead, quantities of interest almost always involve the difference between two energies, and thus it is the variation of the errors between different structures that affects actual results.With the k-point sampling and the choice of δ described above, the errors differ by only about one-thousandth of the room-temperature thermal energy, and such consistent errors are very unlikely to impact quantities of interest.Furthermore, these errors are independent of the errors introduced by using ML to approximate the LDOS.In particular, a perfect ML approximation would reproduce the results calculated from the DFT LDOS (defined in the same manner as in the generation of our training data) rather than the results calculated by exact DFT.Finally, we believe that further work will be able to identify methods that use DFT to calculate even more accurate LDOS training data with less computational effort than our 8×8 ×8 k-point calculations.Some ideas include Fourier interpolation of the Kohn-Sham bands and orbitals, or an approach analogous to the tetrahedron method [83].The remainder of this paper will focus on the effects of approximating the LDOS with ML.In light of the above considerations, errors will be quoted relative to results calculated from the DFT LDOS rather than exact DFT.

D. Evaluation of Energies from the Local Density of States
A key step in calculating the total energy defined in Eq. ( 9) from the LDOS is the evaluation of several integrals in Eqs.(10), (11), and ( 12) that have the form for some function g( ).The evaluation of Eq. ( 17) poses a challenge, because g( ) changes rapidly compared to D( ) in several cases of interest.We provide a solution in terms of an analytical integration, as illustrated in Appendix A. Given our analytical integration technique, it is straightforward to evaluate the total energy using Eq. ( 9) and the standard methods of DFT in an accurate and numerically stable manner.

E. Fingerprint Generation
We assume that the LDOS at any point in space can be approximated by a function that depends only on the positions and chemical identities of atoms within some finite neighborhood of the point.In order to approximate this function using ML, we need to construct a fingerprint that maps the neighborhood of any point to a set of scalar values called descriptors.On physical grounds, good descriptors must satisfy certain minimum requirements: (i) invariance under permutation, translation, and rotation of the atoms in the neighborhood and (ii) continuous differentiable mapping from atomic positions to descriptors, especially at the boundary of the neighborhood.While these requirements exclude some otherwise appealing choices, such as the Coulomb matrix [84], the space of physically valid descriptors is still vast.In the context of ML-IAPs, the construction of good atomic neighborhood descriptors has been the subject of intense research.Recent work by Drautz [85] has shown that many prominent descriptors [86][87][88] belong to a larger family that are obtained from successively higher order terms in an expansion of the local atomic density in cluster integrals.The SNAP bispectrum descriptors [32] that we use in this work correspond to clusters of three neighbor atoms yielding four-body descriptors.
In contrast to prior work on ML-IAPs in which descriptors are evaluated on atom-centered neighborhoods, here we evaluate the SNAP descriptors on the same 200 × 200 × 200 Cartesian grid points at which we evaluated the LDOS training data (see Eq. ( 16) above).For each grid point, we position it at the origin and define local atom positions in the neighborhood relative to it.
The total density of neighbor atoms is represented as a sum of δ-functions in a three-dimensional space: where r k is the position of the neighbor atom k of element ν k relative to the grid point.The w ν coefficients are dimensionless weights that are chosen to distinguish atoms of different chemical elements ν, while a weight of unity is assigned to the location of the LDOS point at the origin.The sum is over all atoms k within some cutoff distance R ν k cut that depends on the chemical identity of the neighbor atom.The switching function f c (r; R ν k cut ) ensures that the contribution of each neighbor atom goes smoothly to zero at R ν k cut .Following Bartók et al. [87], the radial distance r k is mapped to a third polar angle θ 0 defined by The additional angle θ 0 allows the set of points r k in the 3D ball of possible neighbor positions to be mapped on to the set of points (θ, φ, θ 0 ) on the unit 3-sphere.The neighbor density function can be expanded in the basis of 4D hyperspherical harmonic functions where u j and U j are rank (2j + 1) complex square matrices, u j are Fourier expansion coefficients given by the inner product of the neighbor density with the basis functions U j of degree j, and the • symbol indicates the scalar product of the two matrices.Because the neighbor density is a weighted sum of δ-functions, each expansion coefficient can be written as a sum over discrete values of the corresponding basis function The expansion coefficients u j are complex-valued and they are not directly useful as descriptors because they are not invariant under rotation of the polar coordinate frame.However, scalar triple products of the expansion coefficients are real-valued and invariant under rotation [87].The symbol ⊗ j1j2j indicates a Clebsch-Gordan product of matrices of rank j 1 and j 2 that produces a matrix of rank j, as defined in our original formulation of SNAP [88].These invariants are the components of the bispectrum.They characterize the strength of density correlations at three points on the 3-sphere.The lowest-order components describe the coarsest features of the density function, while higher-order components reflect finer detail.The bispectrum components defined here have been shown to be closely related to the 4-body basis functions of the Atomic Cluster Expansion introduced by Drautz [85].For computational convenience, the LAMMPS implementation of the SNAP descriptors on the grid includes all unique bispectrum components B j1j2j with indices no greater than J max .For the current study we chose J max = 5, yielding a fingerprint vector consisting of 91 scalar values for each grid point.The cutoff distance and element weight for the aluminum atoms were set to 4.676 Å and 1.0, respectively.

F. Machine-learning Workflow and Neural Network Model
Neural network models, especially deep neural networks, are profoundly successful in numerous tasks, such as parameter estimation [89,90], image classification [91,92], and natural language processing [93,94].Moreover, neural networks may act as function approximators [95,96].A typical feed-forward neural network is constructed as a sequence of transformations, or layers, with the form: where v (m,1) is the intermediate column-vector of length m at layer , W (n,m) is a matrix of size n × m, b (n,1) is a bias vector of length n, and v +1 (n,1) is the next columnvector of length n in the sequence.The first and the last layers accept input vectors, v 0 , and produce output predictions, v L , respectively.There can be any arbitrary number L − 1 of hidden layers constructed between the input layer and the output layer.After each layer, a non-linear activation unit ϕ transforms its input.The composition of these layers is expected to learn complex, non-linear mappings.
Given a dataset of N s input vectors and target output vectors, (v 0 , vL ), we seek to find the optimal W and b , ∀ ∈ [0, L − 1], such that the neural network minimizes the root mean squared loss between the neural network predictions v L and targets vL .In this work, the network predicts LDOS at each grid point using the SNAP fingerprint as input.The specific feed-forward networks we consider have input fingerprint vectors of length 91 scalar values and output LDOS vectors of length 250 scalar values.The grid points are defined on a uniform 200 × 200 × 200 Cartesian grid, or 8 million total points per snapshot.
Neural network models are created in two stages: training and validation, using separate training and validation datasets.The training phase performs gradientbased updates, or backpropagation [96], to W and b using a small subset, or mini-batch, of column vectors in the training dataset.Steps taken in the direction of the gradient are weighted by a user provided learning rate (LR).Successive updates incrementally optimize the neural network.An epoch is one full pass through the training dataset for the training phase.After each epoch, we perform the validation phase, which is an evaluation of the current model's accuracy on a validation dataset.The training process continues alternating between training and validation phases until an early stopping termination criterion [97] is satisfied.Early stopping terminates the learning process when the root mean squared loss on the validation dataset does not decrease for a successive number of epochs, or patience count.Increasing validation loss indicates that the model is starting to over-fit the training dataset and will have worse generalization performance if the training continues.
Next, the hyperparameters, which determine the neural network architecture and optimizer specifications, must be optimized to maximize model accuracy (and minimize loss).Due to the limited throughput, we make the assumption that all hyperparameters are independent.Independence allows the hyperparameter tuning to optimize one parameter at a time, greatly reducing the total rounds of training.We then ensure local optimality of the hyperparameter minimizer by direct search [98].
The ML-DFT model uses as input vectors the SNAP fingerprints defined at real-space grid points.The target output vectors of the ML-DFT model are the Quantum Espresso generated LDOS vectors defined at the corresponding real-space grid points.Importantly, each fingerprint and LDOS vector is treated as independent of any other grid point.Further, the training data may include multiple aluminum snapshots, in which case the number of grid points is equal to the number of snapshots times the number of grid points per snapshot.No distinction is made in the model by grid points belonging to separate snapshots.The validation snapshot is always a full set of grid points from a single snapshot.
For inference, all remaining snapshots are used to test the generalization of the ML-DFT model.Consider a single snapshot where we wish to predict the total energy.At each grid point, a SNAP fingerprint vector is used to predict an LDOS vector.Using the steps in Section II D and Appendix A, the full set of ML predicted LDOS vectors produce a total energy.
In summary, the ML-DFT training workflow is provided in Table I.The ML-DFT models are built on top of the PyTorch [99] framework and the Numpy [100] and Horovod [101] packages.PyTorch provides the necessary infrastructure for neural network construction and efficiently running the training calculations on GPU machines.The Numpy package contains many performant tools for manipulating tensor data, and the Horovod package enables data-parallel training for MPI-based scalability.

III. RESULTS
The difficulty of predicting the electronic structure of materials is strongly dependent on the diversity of local atomic configurations considered in training and testing.During the development stage of this work we focused on the relatively simple case of solid aluminum at ambient temperature and density (298 K, 2.699 g/cm 3 ).We then turned our attention to the more challenging case of solid and liquid aluminum near the melting point (933 K, 2.699 g/cm 3 ) and we present results for that case here.
Specifically, we train and tune a collection of unique ML-DFT models for each temperature.In the 298 K case, we train a single model using 1 snapshot, validate on 1 snapshot, and test on 8 snapshots.At 933 K, we train 8 models with different combinations of liquid and solid snapshots as the training set.The first 3 cases use 4 training snapshots with either 4 liquid snapshots, 4 solid snapshots, or 2 liquid and 2 solid snapshots.The second 3 cases, similarly, use 8 training snapshots with either 8 liquid, 8 solid, or 4 liquid and 4 solid snapshots.Finally, there are two models that use 8 or 12 training snapshots and a validation set that include both liquid and solid snapshots.Each 933 K model created is then tested on the remaining 933 K snapshots.In order to study the variability in LDOS between the three groups (298 K solid, 933 K solid, and 933 K liquid), we reduce the dimensionality of the LDOS datasets and study them using principal component analysis (PCA).
Further technical details on fingerprint generation, neural network architecture, ML training, optimization of hyper-parameters, and the variability in LDOS outputs are given in Appendix B. We provide results for the ambient case in Appendix C.

A. Local Density of States Predictions for Solid and Liquid Aluminum from ML-DFT
First we assess the accuracy of ML-DFT for spatially resolved and energy resolved quantities.To this end, Figure 3 illustrates the DOS and its errors in the relevant energy range from -5 to 10 eV, where the Fermi energy is at 7.689 eV in the liquid and 7.750 eV in the solid phase.The DOS prediction is computed from Eq. ( 13) using the ML-DFT predicted LDOS.In both liquid (left) and solid (right) phases, the ML-DFT model reproduces the DOS to very high accuracy.The illustration of the errors (bottom panels) confirms the ML-DFT model's accuracy and shows the absence of any unwanted error cancellation over the energy grid.Figure 3 also shows that the Van Hove singularities [82] at and above ≈ 5 eV, which show up strongly in the 298K DOS in Figure 2, are noticeably smeared out by thermal fluctuations in the atomic positions in the 933 K solid-phase DOS and essentially gone in the 933 K liquid.Figure 4 shows the DFT target electronic density versus the predicted electronic density for both liquid (left) and solid (right) phases at each point on the Cartesian grid.The frequency of the predicted results is indicated by the marker color.The ML-DFT prediction is computed from Eq. ( 10) using the ML-DFT predicted LDOS.The alignments of the predictions along the diagonal exhibit small standard deviations for the liquid and solid test snapshot.It illustrates how well the ML-DFT model reproduces the electronic density in a systematic manner with high accuracy, despite the fact that the electronic structure is qualitatively very distinct in the solid and liquid phases of aluminum.Furthermore, maximum absolute errors for the liquid and solid test snapshot are 0.00601 e − / Å3 and 0.00254 e − / Å3 , respectively.The mean absolute errors are an order of magnitude lower at 0.000284 e − / Å3 and 0.000167 e − / Å3 , respectively.The mean absolute percentage errors are 1.12% and 0.66%, respectively.We demonstrate the errors in total energy and band energy on models trained on data from the same phase, either solid or liquid.Figure 5 and Table II show these results.When training, validation, and test sets are all from the same phase, either solid or liquid, the overall ML-DFT accuracy is very high.The mean absolute error in the total energy for the 8 snapshot liquid model is 9.31 meV/atom.Moreover, the 8 snapshot solid model has a mean absolute error of 15.71 meV/atom.In both cases, the main contribution to the error is due to the band energy error which is 5.21 meV/atom in the liquid test set and 13.37 meV/atom in the solid.Both models have a similar mean absolute percentage error, 0.01% in the liquid model and 0.03% in the solid model.
However, training on a single phase results in poor predictive accuracy for the phase unseen in training, as is shown clearly in Figure 5 and Table II.For the ML-DFT model trained on 8 liquid snapshots, the mean absolute total energy error in the solid test set rises to 111.41 meV/atom.For the model trained on 8 solid snapshots, the mean absolute total energy error in the liquid test set rises to 123.29 meV/atom.Again, the major contribution to the error stems from the band energy which is 113.40 meV/atom in the solid test set and 139.96 meV/atom in the liquid.

C. Hybrid Liquid-Solid Aluminum ML-DFT Models
In Figure 6 and Table II we demonstrate our central result, a single ML-DFT model that is capable of yielding accurate results for both liquid and solid aluminum at a temperature of 933 K with accuracy sufficient for meaningful atomistic simulations of electronic, thermodynamic, and kinetic behavior of materials.Using the hyperparameters listed in Table III, the most accurate dual-phase hybrid ML-DFT model was trained using 6 liquid and 6 solid snapshots (snapshots 0-5 and 10-15) as the training dataset and 1 solid and 1 liquid snapshot (snapshots 6 and 16) as a validation set.This ML-DFT model achieves state-of-the-art ML accuracy for all test snapshots considered.
Overall, Table II summarizes the total energy (Eq.9) and band energy (Eq.11) errors for each of the eight ML-DFT models.Training and validation sets determine the specific ML-DFT model and the test sets are used to measure the model's ability to generalize to liquid or solid phase snapshots.
Most notably, our overall most accurate hybrid ML-DFT model predicts the total energy over both solid and liquid phases with a mean absolute error of 13.04 meV/atom in the solid test set and 31.60 meV/atom in the liquid test set, as illustrated in Figure 6 and listed in the last row of Table II.In comparison with the singlephase ML-DFT models in Figure 5, the hybrid ML-DFT model is able to generalize to both liquid and solid snapshots.This is adequate to resolve the solid-liquid total energy difference (110 meV/atom) and is approaching the 5 meV/atom accuracy of the best ML-IAPs used in large-scale MD simulations [102].
Furthermore, the sensitivity of ML-DFT towards the volume of training data is also assessed in Table II.Reducing the training dataset to only 2 liquid and 2 solid snapshots degrades the accuracy for both the band and total energies as expected.While the mean absolute error of 29.50 meV/atom in the solid phase is about the same as in the larger training set, the error in the liquid phase rises to 48.47 meV/atom.This emphasizes the fact that a desired accuracy in the ML-DFT model can be approached systematically by increasing the volume of training data.
Interestingly, in Figure 6, the training snapshots do not necessarily have the lowest error as might be expected.The band and total energy errors for training snapshot 0, for example, are slightly larger than the errors for test snapshots 8 and 9.For the solid snapshots, both training snapshots (10)(11)(12)(13)(14)(15) and test snapshots (17)(18)(19) have very similar errors in both the band energy and total energy.Similar observations can be made for Figure 5.A greater number of liquid and solid test snapshots will only further refine the statistics of the ML-DFT model accu-

IV. DISCUSSION
In the present work, we establish an ML framework that eliminates the most computationally demanding and poorly scaling components of DFT calculations.Once trained with DFT results for similar systems, it can predict many of the key results of a DFT calculation, includ-ing the LDOS, DOS, density, band energy, and total energy, with good accuracy.After training, the only input required by the model is the atomic configuration, i.e., the positions and chemical identities of atoms.We have been able to train a single model that accurately represents both solid and liquid properties of aluminum.The ML-DFT model is distinct from previous ML methodologies, as it is centered on materials modeling in support of accurate MD-based predictions for the evolution of both electronic and atomic properties.We represent materials descriptors (such as the LDOS) on a regular grid as opposed to basis functions [50] and use a specific formula-tion in terms of the LDOS to reconstruct the total energy.Our approach also generalizes to non-zero electronic temperatures in a straightforward way, which allows it to be easily applied to materials at extreme conditions.Furthermore, the very recently published work [53] stopped short of demonstrating that the ML-predicted LDOS can be utilized to accurately calculate energies and forces needed to enable MD calculations.Further work on computing accurate forces of materials under elevated temperatures and pressures will follow, as well as a detailed exploration of ML-DFT's accuracy when scaled to O(10 4 ) atoms, well beyond the reach of standard DFT.
When D( ) = D[D]( ; R), I 0 = F 0 and I 1 = F 1 , we obtain I = N e , the total number of electrons in the system.The Fermi level µ is then adjusted until N e is equal to the total number of valence electrons in the system (i.e., the system is charge neutral).When I 0 = F 1 and I 1 = F 2 , we obtain I = E b [D] − µN e , and we can easily obtain the band energy E b [D] by adding µN e .When I 0 = S 0 and I 1 = S 1 , we obtain I = −β −1 S S [D], the entropy contribution to the energy.Finally, when D( ) = D( , r; R), I 0 = F 0 and I 1 = F 1 , we obtain n[D](r), the electronic density.Given these quantities, it is straightforward to evaluate the total (free) energy using Eq. ( 9) and the standard methods of DFT.The 298 K systems include 10 snapshots of SNAP fingerprint and Quantum Espresso LDOS data.At 933 K, there are 10 snapshots of liquid aluminum and 10 snapshots of solid aluminum.For the fingerprints and LDOS data, each snapshot uses the same uniform Cartesian real-space grid, (200 × 200 × 200 = 8 × 10 6 points).Each snapshot, including both SNAP fingerprint and LDOS data, is 20.7 gigabytes.
We observe that the 933 K ML-DFT models require a greater amount of training data.Furthermore, the optimized 298 K model requires approximately 1.6 million weights, whereas the 933 K models require 33.4 million weights.The larger training set and greater model complexity can be attributed to greater LDOS variability at the higher temperature.
To facilitate comparing LDOS variability between the three groups (298 K solid, 933 K solid, and 933 K liquid), the dimensionality of the LDOS datasets was reduced using principal component analysis (PCA).Because of the large amount of data (30 × 8 × 10 6 samples by 250 energy levels), PCA was performed incrementally using  We perform a sequence of hyperparameter tunings once for each temperature using the optimization strategy described in Section II F. The final 933 K models are trained using the same hyperparameters.Further accuracy may be obtainable once training throughput is increased and multiple, independent hyperparameter optimizations are made possible.The optimized ML-DFT model parameters are given in Table III.Scaling of the input and output data prior to training is critical to ML-DFT model accuracy.We perform an element-wise standardization to mean 0 and standard deviation 1 for the fingerprint inputs.For example, standardize the first scalar of all fingerprint vectors in the training set, then standardize the second scalar of all fingerprints, and so on.For the LDOS outputs, we perform a normalization using the maximum LDOS value of the entire training dataset.
The hyperparameter convergence for the 298 K and 933 K models required 48 and 23 training iterations, re-    approximately 76 minutes.With a sufficient number of nodes, we are able to stage each training snapshot on an independent node to alleviate the lazy loading data movement penalty completely.Future performance improvements to the workflow will target this bottleneck chiefly.
Finally, ML-DFT inference requires only 54 seconds to obtain a full grid of LDOS predictions for one snapshot.A full overview of the computational time required for each individual step of inference is given in Tab.V. Apart from neural network inference, the major time contributions to overall inference time stem from the calculation of fingerprint vectors, LDOS integration and evaluation of density dependent terms.Parallelization and/or optimization is possible for these calculation steps, with drastic speedups to be expected.Once the cost of creating a ML-DFT model is sufficiently amortized, the computational speed up relative to standard DFT becomes quite stark.
Appendix C: Solid Aluminum at Ambient Temperature and Pressure We consider modeling of the ambient temperature and pressure aluminum systems as a base case for all future work and experiments.Future investigations of the gridbased ML-DFT model, such as an extensive exploration of alternative ML models, fingerprints, LDOS formulations, and non-aluminum systems, should use this baseline in terms of ML accuracy and computational performance.
In Table IV and Figure 8, we demonstrate that the 298 K model is able to achieve the highest and most consistent accuracy of all models considered in this work.Using a single training snapshot and a single validation snapshot, the ML-DFT model is able to achieve mean absolute band energy error of 2.48 meV/atom and mean absolute total energy error of 3.07 meV/atom.Furthermore, the 298 K model surpasses the 5 meV/atom accuracy for the most accurate ML-IAPs in large-scale MD simulations [102].The low variability, seen in Figure 7, is more easily exploited by ML-DFT's grid-based approach relative to the 933 K cases.

FIG. 2 .
FIG. 2. The DOS calculated for one of the 256-atom aluminum snapshots at ambient temperature (298 K) using different Gaussian smearing widths and k-point sampling schemes (above) and the error in the band energy as a function of the Gaussian smearing width for three different snapshots (below).In the upper panel, "1G Gaussian" indicates results obtained from a Gaussian δ(x) with σ = 0.1 eV (equal to the grid spacing), while "2G Gaussian" indicates that σ = 0.2 eV (twice the grid spacing) was used.Results are shown for 4 × 4 × 4 and 8 × 8 × 8 Monkhorst-Pack k-point sampling in the upper panel, and 8 × 8 × 8 sampling in the lower panel.

FIG. 3 .
FIG. 3. Liquid (left) and solid (right) test snapshot DOS comparisons for the ML-DFT model trained on 6 liquid and 6 solid snapshots at 933 K.The vertical lines indicate the Fermi energy which is 7.689 eV for the liquid snapshot and 7.750 eV for the solid snapshot.

FIG. 5 .
FIG. 5. Band energy comparisons for ML-DFT models trained on either 8 liquid or 8 solid snapshots at 933 K.The red line divides the 10 liquid (left) and 10 solid (right) snapshots considered.The ML-Liquid training snapshots are snapshots 0-7 and the validation snapshot is snapshot 8.The ML-Solid training snapshots are snapshots 10-17 and the validation snapshot is snapshot 18.

FIG. 6 .
FIG. 6. Band energy (left) and total energy (right) comparisons for the ML-DFT model trained on 6 liquid and 6 solid snapshots at 933 K.The red line in each figure divides the 10 liquid (left) and 10 solid (right) snapshots considered.The ML-Hybrid training snapshots are snapshots 0-5 and 10-15 and the validation snapshots are snapshots 6 and 16.
Appendix B: Machine-learning Model Details

FIG. 7 .
FIG. 7. Variance of the first and second principal component scores for each group of snapshots, 298 K solids, 933 K solids, and 933 K liquids.Each group contains 10 snapshots.The boxes enclose the second and third quartiles, and the whiskers indicate the minimum and maximum values.
the scikit-learn Python package[103].The resulting first principal component explained approximately 81.7% of the variance in the LDOS, and the second explained an additional 7.9%.The variance of the first two principal component scores was calculated for each of the 30 snapshots.Figure7shows these variances grouped by snapshot temperature and phase.933 K liquid snapshots are seen to possess the highest variance, followed by the 933 K solid snapshots.The 298 K solid snapshots have the lowest variance.

FIG. 8 .
FIG. 8. Band energy (left) and total energy (right) comparisons for the ML-DFT model trained on 1 solid snapshot at 298 K.The training and validation snapshot are snapshot 0 and 1, respectively.

FIG. 9 .
FIG.9.DOS comparison for a test snapshot using the ML-DFT model trained on snapshots at 298 K.The vertical line indicates the Fermi energy which is 7.797 eV.

TABLE I .
ML-DFT model training workflow.

TABLE II .
Band energy and total energy test errors from different ML-DFT models inferred for aluminum snapshots at 933 K.The training set consists of snapshots in liquid or solid states.When trained, validated, and tested on the same phase, either liquid (highlighted in blue) or solid (highlighted in yellow), the overall ML-DFT accuracy is very high.The training on a single phase results in poor predictive accuracy for the phase unseen in training, as expected.In contrast, our main result, the hybrid ML-DFT model predicts the total energy over both solid and liquid phases with accuracy sufficient for meaningful atomistic simulations (highlighted in green).

TABLE III .
Optimized hyper-parameters for the 298K and 933K models.

TABLE IV .
Band energy and total energy test errors from the ML-DFT models inferred for aluminum snapshots at 298 K.

TABLE V .
Timings in different steps contributing to the total evaluation of KS-DFT quantities based on neural network inference.The reported times are averaged over the inference of 10 snapshots.Once the 298 K case was optimized, using those optimized hyperparameters as the 933 K optimization's initial iterate greatly accelerated convergence resulting in fewer training iterations.Within an iteration, the 298 K model requires 93 epochs.For the 8 snapshot 933 K cases, the single phase models, either liquid or solid, required just 39-50 epochs to converge.Comparatively, the final hybrid liquid-solid model required 101 epochs to converge.Due to the large quantity of training data in the 933 K case (12 training + 2 validation snapshot × 20.7 gi-gabytes), the required training strategy for single node machines is to lazily load the snapshots into memory each epoch.This strategy is needed for machines where the CPU memory is insufficient and allows ML-DFT to train on any arbitrary number of snapshots.All ML-DFT models have been trained on a single node with 1 NVIDIA Tesla GPU.The 298 K network, with only 1 training and 1 validation snapshot, does not need to lazily load snapshots, and so each epoch requires approximately 151 seconds.For the 933 K networks, the total runtime of a single epoch with 12 training snapshots is