Machine Learning Energies of 2 M Elpasolite (ABC$_2$D$_6$) Crystals

Elpasolite is the predominant quaternary crystal structure (AlNaK$_2$F$_6$ prototype) reported in the Inorganic Crystal Structure Database. We have developed a machine learning model to calculate density functional theory quality formation energies of all $\sim$2 M pristine ABC$_2$D$_6$ elpasolite crystals which can be made up from main-group elements (up to bismuth). Our model's accuracy can be improved systematically, reaching 0.1 eV/atom for a training set consisting of 10 k crystals. Important bonding trends are revealed, fluoride is best suited to fit the coordination of the D site which lowers the formation energy whereas the opposite is found for carbon. The bonding contribution of elements A and B is very small on average. Low formation energies result from A and B being late elements from group (II), C being a late (I) element, and D being fluoride. Out of 2 M crystals, 90 unique structures are predicted to be on the convex hull---among which NFAl$_2$Ca$_6$, with peculiar stoichiometry and a negative atomic oxidation state for Al.

Elpasolite (AlNaK 2 F 6 ) is a glassy, transparent, luster, colorless, and soft quaternary crystal in the Fm3m space group which can be found in the Rocky Mountains, Virginia, or the Apennines. The elpasolite crystal structure (See Fig. 1) is not uncommon, it is the most abundant prototype in the Inorganic Crystal Structure Database [1,2]. Some elpasolites emit light when exposed to ionic radiation, which makes them interesting material candidates for scintillator devices [3,4]. One could use first-principle methods such as density functional theory (DFT) [5,6] to computationally predict the existence and basic properties of every elpasolite. Unfortunately, even when considering crystals composed of only main group elements (columns I to VIII) the sheer number of the ∼2 M possible combinations makes DFT based screening challenging-if not prohibitive. Recently, computationally efficient machine learning (ML) models were introduced for predicting molecular properties with the same accuracy as DFT [7,8]. Requiring only milliseconds per prediction, they represent an attractive alternative when it comes to the combinatorial screening of millions of crystals. While some ML model variants have already been proposed for solids [9][10][11], a generally applicable ML-scheme with DFT accuracy of formation energies is still amiss.
In this Letter we introduce a newly developed ML model which we use to investigate the formation energies of all ∼2 M elpasolites made from all main-group elements up to Bi. Resulting estimates enable the identification of new elemental order of descending elpasolite formation energy, crystals with peculiar atomic charges, 250 elpasolites with lowest formation energies, as well as 128 new crystal structures predicted to lie on the convex hull among which NFAl 2 Ca 6 , an elpasolite with unusual composition and atomic charge. The ML model achieves the same, or better, accuracy to DFT as DFT to experimental data and can be generalized to any crystalline material.
The ML-model is based on kernel ridge regression [12][13][14] which maps the non-linear energy difference between the actual DFT energy and an inexpensive approximate baseline model into a linear feature space [15]. More specifically, we construct a ML model of the energy difference to the sum of static, atom-type dependent, atomic energy contributions It , obtained through fitting of each atom type t in all main group elements up to Bi. The energy-predicting function is a sum of weighted exponentials in similarity d between query and training crystal, where N is the number of atoms/unit cell (10 in the case of elpasolites), and the second sum runs over all N training instances. α i are the weights obtained through linear regression, and σ is the global exponential width, regulating the length scale of the problem. The similarity d i is the Manhattan distance, i.e., d i = x − x i 1 . While various crystal structure representations have previously been proposed [9][10][11]16], we have found the following representation to yield superior performance: x is a n × 2 tuple that encodes any stoichiometry within a given crystal prototype. For quaternary (n = 4) elpasolites, each x 1−4 refers to the 4 representative sites, the atom type for each site is represented by its row (principal quantum number 2 to 6) and column (number of valence electrons) I to VIII in the periodic table, and sites are ordered according to the Wyckoff sequence of the crystal. As such, x implicitly represents the energy minimum structure for a system restricted to this prototype-without explicitly encoding precise coordinates, lattice constants, or other (approximate) solutions to Schrödinger's equation. This representation is not restricted to the elpasolite structure, it can be used for any crystalline configuration: Below we also briefly discuss test results for small size ML models applied to ternary crystals.
For training and evaluation, we have generated DFT formation energies for two data sets of elpasolites i.e. 0.5% of the total number of 2 M possible crystals. The (I−VIII) data set has been generated through random selection of elpasolites while ensuring an unbiased composition. To verify that the ML model is general and not only restricted to elpasolites, we have also included a materials project [18] dataset (MPD) consisting of ∼0.5k ternary crystals in ThCr 2 Si 2 (I4/mmm) prototype and made up of 84 different atom types. The distribution of the chemical elements in the data sets are shown in Fig. 1(b). Numerical results on display in Fig. 1(c) indicate systematic improvement of the predictive accuracy of the ML model with increasing training set size, for all three datasets. The inset details normally distributed errors and scatter plots which systematically improve with training set size for the models trained on the (I−VIII) datasets. For a 10k training set, the ML model reaches a MAE of 0.1 eV/atom compared to reference, i.e. semilocal DFT. DFT, in turn, has an estimated MAE of ∼0.19 eV/atom compared to experiments on heats of formation for general chemistries with filled d-shells [19]. For transition metal oxides and elemental solids other groups report DFT errors on the order of 0.1 eV/atom [20,21]. The converging performance for training on nearly all crystals of the (III−VI) data set suggests that our crystal representation of elpasolite structures Fig. 1(a) accounts for all necessary degrees of freedom. While errors decay systematically and linearly on a log-log plot, the learning rate levels off as N approaches the 100%, i.e. 10k. This is due to the employed relaxation convergence threshold of ±10 meV/atom in the DFT calculations. Any inductive model must fail to go below this level, and only numerically more precise reference numbers would mitigate this trend. In all validation tests dealing with energy predictions for random out-of-sample crystals, the ML model performance meets the expectations set in Fig. 1(c). For example, drawing 100 crystals at random from (III−VI) and (I−VIII) datasets ML models perform as expected when compared to the result from validating DFT calculations (cf. Fig. S3 [17]). (III−VI) and (I−VIII) reaches a MAE of 0.1 eV/atom at roughly 2.5 % and 0.5 % of the total number of crystals respectively, suggesting that the machine "efficiency" increases with number of possible combinations. We note however that two observations of the same structure is not sufficient to see any trends on how much training data is needed.
Having established the performance of the ML model, we have subsequently used the 10 k training set model (I−VIII) for investigation of the elpasolite universe. Estimated formation energies for all 2 M elpasolites are featured in Fig. 2. The formation energies are clearly dominated by the chemical identity of position 4, followed by position 3 but according to a different pattern. Chemical identity at position 1 and 2 has the smallest influence and very similar impact (also illustrated in Fig. S1 of Ref. 17.) Due to the effective degeneracy of positions 1 and 2, all inner matrices in Fig. 2 appear largely symmetric. Figure  FIG. 2. Formation energies for all 2 M elpasolites made up of all main-group elements up to Bi predicted by the 10 k ML-model. The outer vertical and horizontal axis correspond to x4 and x3 symmetry position, respectively. Inner vertical and horizontal axis correspond to x2 and x1 symmetry position, respectively. Elemental sequence follows the elpasolite order of Fig. 1(d).
White pixels correspond to subspaces of ternary, binary, or elementary non-elpasolite crystals.
1(d) shows the average contribution of each element to the formation energies estimated by the 10 k ML model. These average contributions per element are used to order the elements in Fig. 2 to yield the smoothest elpasolite map. Arranging elements by their nuclear charge, or by their Pettifor order [22], results in a much more oscillatory map or stripe-like pattern due to underlying periodicities (cf. Ref. 17). This elpasolite error is dominated by the element identity in position 4 (compare  Fluorine and carbon are at the respective ends of the global scale of low and high formation energies. But also alkaline metals, alkaline earth metals, and oxygen contribute to lowering the formation energy. On average, the formation energies of elpasolites involving halogens, alkaline metals, noble gases increase as the periodic table is descended. The opposite holds for all other elements, except oxygen, boron, carbon and nitrogen, which all have a noticeably higher average formation energy than any other element. A saddle point can also be observed in the midst of the periodic table table as well as two valleys along the halogen and alkaline earth rows. Sitespecific resolution indicates that fluorine fits best with the bond coordination of sites 1, 2, and 4, whereas the same does not apply to later halogens (not shown in the paper, see supplementary materials 17). In contrast, as the element on site 3 goes down column II in the periodic table, the formation energy is successively lowered, with Ca, Sr, and Ba contributing more than any halogen atom. On sites 1 and 2, the formation energy generally increases the most for heavy noble gases. On sites 3 and 4, it is carbon, followed by neighboring B and N that increase the formation energy the most. The accuracy of linear single atom energy models based on these scales, however, is not on par with the ML-model, and-maybe more importantly-cannot be improved systematically through increasing training set sizes but rather converges to a finite residual error.
In order to achieve satisfying accuracy of ±0.1 eV/atom for elpasolites, a relatively large training set of 10 k is needed. This is likely due to the sparsity of crystals at the opposite ends of the high and low formation energy spectrum; this results in a decreased predictive ML model accuracy for crystals in these regions, which is demonstrated in Fig S6 in Ref. 17. Nevertheless, the 10 k ML model readily identifies a larger set of lowest lying elpasolites for which the actual DFT minima can be obtained through subsequent DFT based screening. This is shown in Fig. 1(e) where the 250 crystals with the lowest ML predicted formation energies are shown in ascending order (with further details on these systems in the supplementary mateirals. 17.) Subsequent screening with DFT indicates the 26 th crystal CaSrCs 2 F 6 (out of 2M) to be the global formation energy minimum at −3.44 eV/atom, closely followed a neardegenerate isomer SrCaCs 2 F 6 . The DFT energies of the next two degenerate pairs CaSrRb 2 F 6 /SrCaRb 2 F 6 and CaBaCs 2 F 6 /BaCaCs 2 F 6 correspond to −3.41, and −3.39 eV/atom, respectively. Overall, the elpasolites with the most favorable formation energies, ABC 2 D 6 , correspond to A and B being late elements from group (II), and C and D being a late element from group (I) and fluoride, respectively. Populating the four sites with elements from groups (II),(II),(I), and (VIII), respectively, differs from the experimentally established stoichiometry AlNaK 2 F 6 . In fact, the lowest DFT energy crystal with a group-(III) element is CsAlRb 2 F 6 (in 69 th position) with −3.09 eV/atom (ML energy: −2.96 eV/atom, see supplementary materials).
We have also used our predictions to analyse atomic oxidation states in elpasolites. In particular, we have found that roughly 6 % of the crystals with formation energies below −1 eV/atom exhibit unusual atomic charges: They are low in energy despite the fact that no combination of conventional atomic charges would result in a neutral system. In order to identify these crystals, we have used the absolute value of the lowest possible total oxidation state (LPTOS) that could possibly be realized using the list of typical atomic oxidation states on display in Table III in Ref. 17. The lowest lying crystals have a LPTOS of 0 (−3 to −3.44 eV/atom formation energies). However, already at −3 eV/atom crystals with LPTOS of 2 or 1 start to occur. At formation energies of ∼ −1.25 eV/atom and higher, the number of crystals with non-zero LPTOS increases rapidly, with LPTOS as high as 12. Corresponding crystal frequency distributions are shown in Fig. 1(e), along with formulas for the mutually lowest lying crystals. Interestingly, the number of crystals with zero LPTOS increases monotonically with formation energy, while for nonzero LPTOS crystals the distribution is oscillatory.
To demonstrate the usefulness of our ML model we have applied it to identify thermodynamically stable elpasolites. To this end, we first selected all those 274,213 elpasolites with negative ML formation energies, and without rare gas elements. Since stability depends on the energy difference to any possible polymorph or competing segregated phases [23,24], we have queried available DFT formation energies stored in the Materials Project (MP) [18]. Some elpasolites, such as the archetypical AlNaK 2 F 6 , are already stored in the MP. Using DFT results stored in the MP for competing quaternary, ternary and binary phases, we have constructed phase diagrams [24] for all 274,213 crystals. For each crystal, there are on average ∼12 competing phases stored in the MP; there is only one combination of elements (Cs-Li-Na-Rb) for which no binary or ternary competing phases have been stored. For 2133 out of the 274,213 crystals, the resulting stabilization energy is below the known convex hull of stability (for more details, see Ref. 17). Subsequent validation using DFT instead of ML confirms 128 out of them to be stable. Out of these 38 are polymorphs (ABC 2 D 6 vs. BAC 2 D 6 ) resulting in 90 unique stoichiometries. Such a reduction (274,213 → 90) in number of crystal candidates is to be expected since sorting crystals by ML energies being lower than the convex hull systematically favors those with negative ML formation energy errors. We note that this does not amount to proof that the 90 crystals are stable: The MP database is not exhaustive. This implies that other new competing phases and materials, with even stronger stabilization, might still be discovered in the future. Also, the intrinsic error of the employed DFT method within the MP might still alter the outcome with respect to experiment. As such, the 90 new elpasolite DFT energies represent new upper bounds on the convex hull at the corresponding compositions. They have been submitted to the MP database, and most of them have been made available for further studies (See Table V  Among these elpasolites, metals, semiconductors and insulators are roughly distributed equally. All structures with an earth alkaline metal in crystal position 4 have a low or zero band-gap. We have noted an intriguing yet stable structure of a conductor, NFAl 2 Ca 6 (MP ID: mp-989399, # 20 in Table V in Ref. 17) with Ca at position 4, instead of F or Cl. Bader charge analysis [25][26][27] (Table I) indicates an exotic negative oxidation state for Al (-II), previously only reported for Al in substantially larger Zintl phase unit cells (Sr 14 [Al 4 ] 2 Ge 3 ) [28]. Since Bader charges sometimes yield non intuitive results [29,30], calculated Hirshfeld [31] and Voronoi deformation density [29,32] charges (Table I) confirm the negative oxidation state, albeit reduced by one unit (-I).
In conclusion, we have developed and used ML-models of formation energies to investigate all possible elpasolites made up of main-group elements. We have presented numerical results for ∼2 M formation energies. The MLmodel is only implicitly dependent on spatial coordinates, through reference data used for training. No spatial coordinates are needed for new queries, yet for a training set of 10 k crystals the model reaches ±0.1 eV/atomcomparable to DFT accuracy for solids. The results have been used to identify the most strongly bound elpasolites as well as to investigate energy and bonding trends at crystal structure sites, leading to a new "elpasolite order" of elements, consistent with the bonding physics in the elpasolite crystal structure. We identified and added 128 structures (90 unique stoichiometries) to the convex hull of the MP database. Charge analysis for the metallic elpasolite NFAl 2 Ca 6 indicates a negative atomic oxidation state of Al. This outcome directly demonstrates that our method can be used for the discovery of stable as well as unconventional chemistries. Due to the low computational cost of the ML model one can now also afford to remove human bias by considering also those structures which previously would have been excluded due to "chemical intuition". Our results suggest that ML models hold great promise for the computational screening of polymorphs, other crystal structure symmetries, solid mixtures, phase transitions, or defects at unprecedented rate and extent. Other crystal properties than energies could also be considered. The