Hierarchical symbolic regression for identifying key physical parameters correlated with bulk properties of perovskites

Symbolic regression identifies key physical parameters describing materials properties by uncovering correlations as nonlinear analytical expressions. However, the pool of expressions grows rapidly with complexity, compromising its efficiency. We tackle this challenge by a hierarchical approach: identified expressions are used as input parameters for obtaining more complex expressions. Crucially, this framework can transfer knowledge among properties, highlighting physical relationships. We demonstrate this strategy by using the Sure-Independence-Screening-and-Sparsifying-Operator (SISSO) approach to identify expressions correlated with the lattice constant and cohesive energy, which are then used to model the bulk modulus of ABO3 perovskites.

The identification of physical parameters that are correlated with materials properties or functions is a key step for understanding the underlying processes and accelerating the discovery of improved or even novel materials [1]. Because these parameters reflect the different processes that trigger, facilitate, or hinder a certain property or function, they might be referred to as materials genes, in analogy to genes in biology. Ideally, one would use physical models to describe the materials properties of interest [2]. However, due to the intricate interplay of processes that might be responsible for a certain materials property, the explicit physical modelling might be unfeasible, or even inappropriate. An alternative approach is to use artificial intelligence (AI) to uncover complex relationships by a data-centric concept. Nevertheless, most widely used AI approaches require data sets that are much larger than those which are typically available in materials science, and only few AI methods incorporate the small-data aspect. [3][4][5] Furthermore, conventional AI produces black-box models that make it difficult to disentangle the contributions from the various input parameters and determine which underlying processes are the most important to optimize. These problems are exacerbated for the typical scenario in which one is interested in finding materials that exhibit an exceptional performance, for which only a few data points are available.
A possible avenue for linking physical reasoning and data-centric approaches is symbolic regression (SR). [6][7][8] SR identifies nonlinear analytical expressions relating a target property to the key input parameters by using only small data sets. Thus, it is ideally suited for materials-science problems. These input parameters are physically meaningful quantities that are possibly related to the underlying processes governing the property. In the (initial) absence of understanding of the underlying processes, it is wiser to offer an extensive set of input parameters, also ones that are later revealed as irrelevant, rather than missing the important ones. Traditionally, SR uses genetic-programming techniques to optimize the analytic expressions, which are combinations of the input parameters using mathematical operators such as addition, multiplication, exponentiation, etc., for a given problem [6,7,[9][10][11]. These approaches randomly generate an initial population of possible expressions, and then stochastically apply genetic operators (e.g., mutation and crossover) until some optimal solution is found.
More recently, the sure-independence-screening-andsparsifying-operator (SISSO) [12,13] approach was introduced for the identification of analytical expressions by applying the compressed sensing (CS) methodology [14,15] to SR. The SISSO approach starts with the collection of physical input parameters, termed primary features. Then, a more expansive pool of expressions is iteratively built by exhaustively applying mathematical operators to both the primary features and previously generated expressions. The number of recursive applications of the operators used to construct the pool of expressions is called the rung (q) of the SR. Finally, CS is used to identify the D expressions, which combined by weighting coefficients yield the best model for the property. In this step, an 0 -regularization is performed using only a subspace S of the immense pool of generated expressions. This subspace is selected by the SIS procedure. The outcome of the SISSO analysis is a low D-dimensional descriptor vector containing, as components, the expressions selected from the pool of expressions. A SISSOderived model for a property P has the form where c i are fitting coefficients, d i are the descriptor components, D is the descriptor dimension. We also label the model components, α i = c i d i , which will be used for the construction of more complex models. The primary features that appear in the (typically nonlinear) analytical expressions of the descriptor components reflect the materials genes of the property under investigation. SR has been already used to model several materials properties and functions [7,[16][17][18][19][20][21][22]. However, the combinatorial growth of the pool of possible expressions with respect to the number of primary features and to the number of times that the mathematical operators are applied can make an exhaustive search for the optimal descriptors impractical. This is problematic because the relevant primary features are typically not known a priori and one would like to initially offer as many as possible. For addressing this challenge, we introduce a hierarchical SR approach that enables an efficient identification of complex descriptors by keeping the number of expressions considered in the analysis at a manageable level. The foundation of this approach is the systematic re-feeding of expressions identified in one step as inputs for the identification of more complex expressions in subsequent steps. A crucial implication of this hierarchical framework is that it can be extended to transfer knowledge learned for one property to another one, thus also highlighting physical relationships between materials properties.
We demonstrate the hierarchical SR approach in the context of SISSO. Hierarchical SISSO (hiSISSO) starts with an initial set of primary features, which is used to obtain an initial model for the property of interest. Then, the obtained model and its components (P SISSO and α i , respectively), evaluated for all the materials in the data set, are added to the initial primary feature set. Finally, using this extended primary feature set, new, more complex models are obtained by applying SISSO for a second time. Models and components obtained for one (or more) property (properties) with SISSO can also be used to model a second, related property.
In this work, we also introduce a new concept into SISSO, hereafter called "multiple residuals", which increases the algorithm's efficiency with respect to the size of subspaces needed for 0 -regularization, as shown in the ESI ( Figure S2). In the original SISSO algorithm [12], the residual of the previously found model, ∆ 0 D−1 , i.e. difference between the vector storing the values of the property for each sample, P, and the estimates predicted by the by the (D−1)-dimensional model (∆ 0 D−1 = P D−1 − P), is used to calculate the projection score of the candidate features during the SIS step for the best D-dimensional model, Here, R is the Pearson correlation coefficient and j corresponds to each element in the feature space. In the new implementation of SISSO [23], we extend the residual definition and use the best r residuals to calculate the projection score: max s 0 j , s 1 j , . . . , s r−1 j . This is used in the SIS procedure (see further details, including the choice of r, in the ESI). The multiple-residual concept generalizes the descriptor identification step of SISSO, by using information from an ensemble of models to determine which features to add to the selected subspace. By using the multiple-residual scheme with hiSISSO, we are able to expedite the search for the best models and considerably reduce (optimize) not only the overall size of the pool of expressions to be considered in the analysis, but also the size of the subspaces of expressions needed for the identification of the best descriptors.
We demonstrate the capabilities of hiSISSO with two examples, i.e., by modeling the lattice constant (a 0 ) and bulk modulus of (B 0 ) of ABO 3 cubic perovskites. First, we identify models for the lattice constant (a 0 ) of each material. Then, we exploit the expressions identified for a 0 and cohesive energies (E 0 ) to improve the learning of the bulk moduli (B 0 ) of the perovskites. We consider 504 ABO 3 materials formed by the A and B elements indicated in Fig. 1. The lattice constants, cohesive energies and bulk moduli were calculated using densityfunctional-theory (DFT) calculations with the PBEsol exchange-correlation functional. [24] Further details and benchmarks of the calculation method are available in the ESI. The data set of calculated perovskite properties is available at the Novel-Materials Discovery (NOMAD) Repository & Archive [25].
Perovskites display a remarkable diversity of compositions and properties that make them interesting for very different functions and devices (e.g., [26,27]). We focus here on perovskite mechanical properties, specifically the equilibrium lattice constant, a 0 , and the bulk modulus, B 0 , the second derivative of the cohesive energy E 0 at a 0 . Both quantities are correlated [28][29][30], which has been described by Verma and Kummar (VK) for cubic perovskites: Here, C 0 , C 1 and C 2 are fitted constants, and n A and n B are the expected oxidation state of the A and B species in the ABO 3 compound, as approximated by their group number on the periodic table. The approximation implies that all akali and alkaline earth metals will have an oxidation state of one and two, respectively, and all other A elements will have an oxidation state of three. The oxidation state of the B atom is then set to ensure all materials are charge-neutral, i.e., n B = 6 − n A . The cohesive energy E 0 is defined as the energy per atom required to atomize the crystal.
As primary features, we use 23 properties related to the A and B elements of the ABO 3 perovskites, hereafter atomic features. These atomic features only de-pend on the elements entering the composition of the materials, and not on their environment within the material: the radius of the valence-s orbital and the radius of the highest-occupied orbital of the neutral atom (r s and r val , respectively), which represent the largest and smallest radii of the valence shell of an atom, respectively; the Kohn-Sham single-particle eigenvalue of the highestoccupied and lowest-unoccupied states ( H and L , respectively); the electron affinity (EA); the ionization potential (IP ); and the electronegativity, (EN ). Because A and B are present as cations in the perovskite structure, we also included the radius of the valence-s orbital and the radius of the highest-occupied orbital of positively charged (+1 cations) atoms, denoted as r cat s and r cat val , respectively. All the above-mentioned atomic features were calculated using free (isolated) atoms with DFT-PBEsol [24] (see further details in ESI) and are available in [31]. The nuclear elemental charges Z and the parameters n A , n B and n A * n B from Eq. 2 were also included in our primary feature set. Note that n B depends on the formula of the perovskite, and, in particular, on the A element. Let us also note that other choices of (elemental) primary features are possible (e.g., via pymatgen [32] or magpie [33]). We consider the following mathematical operators (where φ i and φ j are two arbitrary features): This list of operators can be modified, reduced, or extended. In all cases, the units of the primary features are respected so that terms such as ln (r s,A ) and r s,A + Z A are not allowed.
In order to evaluate the performance of our models, we randomly split the data set of 504 materials into five subsets. Four subsets are combined and used to train the models (training set) and the remaining subset is used to assess the performance (test set). The training set is used to determine the optimal model complexity with respect to its predictability via a 5-fold cross-validation (CV) scheme (see ESI). Within SISSO, the model complexity is controlled by the rung q used to construct the pool of expressions and by the descriptor dimension D.
Here we consider descriptors with D = 1 up to D = 5. Once the model complexity is determined by CV, a model is trained using all the materials of the training set at the optimal complexity. This model is used to predict the properties of the materials in the test set. Finally, the whole procedure is repeated five times, i.e., so that each of the five subsets is considered once as test set. We discuss the performance of the SISSO-derived models based on the distribution of absolute test errors across the 504 materials.
The distribution of a 0 values over the entire data set (504 materials) is shown in Fig. 2A. The absolute-testerror distributions associated to the models obtained with SISSO using rung q = 1 and q = 2 are shown as grey and red violin plots in Fig. 2B. The absolute-testerror distribution is shifted towards lower values when the rung increases from 1 to 2. This shows that the models become more accurate as the mathematical operators are applied for a second time in order to generate more complex expressions. With our primary features and set of operators, rung 1 and 2 pools of features contain on the order of thousands and millions of elements, respectively. To demonstrate how complex descriptors can be found while keeping the number of considered expressions small, we collect the a 0 , q = 1 model and its components, and use them as new primary features, along with the atomic features, in a second step of SISSO application. In this second step, we also used q = 1. We refer to the resulting models as hiSISSO(a 0 ) in Fig. 2B, the parentheses indicating that the expressions describing a 0 identified in the first step are added to the primary feature set, along with the atomic features, in the second step. We note that some of the primary features might be correlated with each other. We do not exclude any of these features from our analysis because they could contain complementary information (e.g., the difference between two correlated features might not be correlated with any of the features individually).
The absolute test errors associated to the hiSISSO(a 0 ) models (Fig. 2B, in blue) are lower compared to the errors of the q = 1 models obtained with one-step application of SISSO (Fig. 2B, in grey). Additionally, the performance of the hiSISSO(a 0 ) models is superior compared with the SISSO approach with q = 2 (Fig.  2B, in red). These results show that hiSISSO provides a tractable way of increasing the effective rung -and thus the complexity -of a model at a tiny fraction of the computational cost required for a higher rung, since the pool of expressions that needs to be treated is three orders of magnitude smaller. Moreover, the hiSISSO approach generates more accurate a 0 models compared to the standard SISSO strategy using either rung 1 or 2.
In Fig. 2, we note the presence of outliers for which the absolute test errors are high with respect to the distribution average. These data points are associated to materials with A and/or B elements which are significantly different compared to the A and B elements in the training sets. The detailed analysis of test errors is presented in ESI along with the discussion of a test set composed by materials containing chemical elements which were unseen during training.
We next address the bulk modulus of the perovskites. The distribution of values for this property over the entire data set is shown in Fig. 2C. The distribution of absolute test errors associated to Eq. 2 (with C 0 , C 1 , and C 2 fitted to the training sets) is shown in brown in Fig. 2D as a baseline for evaluating the performance of the models derived by AI. For the SISSO analysis of bulk modulus, we consider rung q = 2. The absolute-test-error distribution corresponding to the SISSO models obtained with the atomic features (Fig. 2D, in red) shows that this approach has an improved performance compared to the Eq. 2. By including the DFT-calculated lattice constant, a DFT 0 , as well as (a DFT 0 ) −3.5 (as suggested by Eq. 2) as primary features along with the atomic features, the performance improves significantly (Fig. 2D, in magenta). If the DFT-calculated cohesive energy, E DFT 0 , is further included as primary feature (Fig. 2D in violet), the errors get even smaller. This shows that a 0 and E 0 are both key parameters to describe the bulk modulus.
The lattice constant and the cohesive energy provide necessary information to model the bulk modulus. However, the use of a DFT 0 , (a DFT 0 ) −3.5 , and E DFT 0 as primary features is inconvenient. In order to calculate a 0 and E 0 in DFT, one must perform a geometry relaxation, which is already majority of the work needed to calculate B 0 itself. To circumvent this issue, we offered, as primary features, the SISSO and hiSISSO models for a 0 and E 0 (see ESI) -as well as their components and the rescaled quantity (a 0 ) −3.5 -instead of the DFT-calculated quantities. In this analysis, the atomic features are kept in the primary feature set. We indicate the resulting B 0 models by hiSISSO(a 0 , E 0 ) in Fig. 2D (dark green). By using the hiSISSO(a 0 , E 0 ) approach, the test errors are signif-icantly reduced compared to the one-step application of SISSO to the atomic features. Indeed, the model performance gets closer to that of the models obtained using the DFT-calculated parameters a DFT 0 and E DFT 0 , even though the hiSISSO(a 0 , E 0 ) models depend only on the atomic features, which makes it useful to search for new materials. These results demonstrate the potential of hi-SISSO to transfer information among materials properties, thus circumventing the use resource-consuming primary features.
We then exploit a B 0 model obtained by the hiSISSO(a 0 , E 0 ) approach, trained using the entire data set of 504 ABO 3 materials, for the screening of new materials (see details in ESI). We evaluate 7 308 single (ABO 3 ) and double perovskite compositions of the type A 2 BB O 6 constructed from all the A and B elements in the initial data set (Fig. 1). Then, we look at the materials with the lowest predicted B 0 values, since they are scarce in the training set. This situation corresponds to the typical scenario in materials discovery, in which the behavior of interest is associated to only few of the available observations. Among the 10 materials with the lowest B 0 predicted by the hiSISSO approach, we identify the double perovskites Cs 2 ZnBiO 6 , Cs 2 CdBiO 6 , Cs 2 CdPbO6, Cs 2 ZnPbO 6 , Cs 2 ZnCdO 6 , Rb 2 ZnBiO 6 , and Rb 2 CdBi 6 , with predicted B 0 in the range 0.49-0.53 eV/Å 3 . The properties of these materials were evaluated explicitly by further DFT calculations and they were confirmed as highly compressible perovskites, with DFTcalculated B 0 of 0.45, 0.45, 0.46, 0.45, 0.46, 0.60, and 0.41 eV/Å 3 , respectively. The root-mean-squared error calculated on the 10 materials with the lowest predicted B 0 is 0.081 eV/Å 3 . By recalling that the model was trained on simpler single perovskites, its predictive ability beyond the training region is remarkable. Moreover, only 8 materials, out of the 504 used for training, present B 0 < 0.50 eV/Å 3 . The minimum B 0 value in the training set is 0.37 eV/Å 3 for the CsCdO 3 material. The hiSISSO-predicted B 0 for this materials is 0.51 eV/Å 3 .
Finally, we applied a logarithm transformation to identify, with hiSISSO, a power-law-type expression for B 0 . For the purpose of obtaining an equation in the spirit of Eq. 2, we offered the atomic features and the SISSO q = 2 models for a 0 and E 0 , denoted a SISSO(q=2) 0 and E SISSO(q=2) 0 , respectively, as primary features. The components of these models are also included in the primary feature set. Here, we use q = 1 with a reduced mathematical operator set containing only the operators addition and subtraction. The best model identified using the entire data set of 504 materials at the optimal dimen-sionality identified via CV (D = 3, see ESI) is , and E SISSO 0 as the key parameters correlated with B 0 . Therefore, SISSO recovers the parameters a 0 and n A , which also enter Eq. 2. However, the description of B 0 provided by Eq. 3 goes beyond the empirical model, since the logregression models provides a significantly better performance (light green in Fig. 2D) compared to Eq. 2. This analysis illustrates the potential of hiSISSO to revisit and improve models derived from physical arguments based on a data-centric approach.
In this work, we introduced a hierarchical SR framework to efficiently address complex materials properties and functions. This approach provides the key physical parameters reflecting the underlying processes responsible for the behavior of interest, while increasing the performance of SR models. The analysis described in this publication can be found in a Jupyter notebook at the NOMAD AI Toolkit