Materials informatics based on evolutionary algorithms: Application to search for superconducting hydrogen compounds

We present materials informatics approach to search for superconducting hydrogen compounds, which is based on a genetic algorithm and a genetic programming. This method consists of four stages: (i) search for stable crystal structures of materials by a genetic algorithm, (ii) collection of physical and chemical property data by first-principles calculations, (iii) development of superconductivity predictor based on the database by a genetic programming, and (iv) discovery of potential candidates by regression analysis. By repeatedly performing the process as (i) $\rightarrow$ (ii) $\rightarrow$ (iii) $\rightarrow$ (iv) $\rightarrow$ (i) $\rightarrow$ $\dots$, the superconductivity of the discovered candidates is validated by first-principles calculations, and the database and predictor are further improved, which leads to an efficient search for superconducting materials. We applied this method to hypothetical ternary hydrogen compounds and predicted KScH$_{12}$ with a modulated hydrogen cage showing the superconducting critical temperature of 122 K at 300 GPa and GaAsH$_{6}$ showing 98 K at 180 GPa.


I. INTRODUCTION
The use of an informatics approach to materials science, i.e. materials informatics (MI), has been expected to bring the acceleration for the exploration of new materials [1][2][3][4] . High-throughput screening and machine learning have been employed to discover hidden trends and create predictive models in databases of physical and chemical properties for crystalline compounds, e.g. search for cathode materials with a long cycle life for lithium-ion battery 5 and new superconducting materials 6 .
Search for new physical and chemical properties have been performed by varying compositions and external parameters such as pressure, temperature, and electromagnetic field with respect to already known materials. These approaches are considered as the exploration of optimal solutions in vast and complicated search space under given conditions. In such a case, evolutionary algorithm (EA), which is a heuristic-based approach to solving problems using mechanisms inspired by biological evolution, e.g. mating, mutation, selection, inheritance, etc., is effective for the discovery of the optimal solutions. EA has been used in a wide variety of scientific research fields and has been applied to search for stable or metastable crystal structures [7][8][9][10][11] , well-performed mathematical equations and computer programs in predefined task, and so on.
In this study, we propose materials informatics approach to search for new functional materials, which is based on a genetic algorithm (GA) and a genetic programming (GP) included in the EA techniques (Fig. 1). This method consists of four stages: (i) determination of stable crystal structures of materials by GA, (ii) collection of physical and chemical property data by preforming first-principles calculations and development of database, (iii) development of a functionality predictor based on the data by GP, and (iv) discovery of potential candidates by regression analysis. By repeatedly performing the process as (i) → (ii) → (iii) → (iv) → (i) → . . . , the candidates discovered at the stage (iv) are validated by first-principles calculations through (i) to (ii), and the database and predictor are gradually improved using the validation results. Thus, this method enables to accelerate the search for new functional materials. This paper is organized as follows: the details of GA for crystal structure search and GP for predictor de- velopment are presented in Sec. II, the application of our MI method to the search for superconducting hydrogen compounds is shown in Sec. III, and the discussion and conclusion are drawn in Sec. IV. Figure 2 shows a process on the search for high-quality optimal solutions by EA. First, we prepare for a population consisting of individuals (I), which are generated randomly and ranked according to a fitness parameter (0th generation). Then, few especially inferior individuals are eliminated from the population (elimination), and the other ones are used for the creation of new individuals for the next generation. The new individuals are created by randomly applying evolutionary operations, "mating" and "mutation". Few especially superior individuals are inherited to the next generation (inheritance), and all the individuals are ranked again. By repeatedly performing this process, the population is evolved to more superior one and finally high-quality optimal solutions are obtained. As mentioned above, in our MI approach, we use GA to search for stable crystal structures at the stage (i), and GP to develop a functionality predictor at the stage (iii).

II. EVOLUTIONARY ALGORITHMS
A. GA for crystal structure search Figure 3 shows a flowchart at each generation with respect to the structure search by GA. For the evolutionary operators, the mating is the operator to create a slab structure from randomly selected two individuals ( Fig. 4 (a)) and the mutation is the operator to give a lattice distortion or an atomic permutation for randomly selected an individual ( Fig. 4 (b)) 11 . All the structures generated randomly or created by the mating and mutation are optimized at a given pressure, and the enthalpies, i.e. H = E + P V , are calculated using total energy E, pressure P , and volume V . Then, the population for the next generation is constructed by inheriting the few  elite structures with especially lower enthalpy at the previous generation, ranking all the structures according to the enthalpy, and eliminating few inferior structures with especially higher enthalpy from the population. Once the stable structures are obtained, physical and chemical property data are collected using first-principles calculations (the stage (ii) in Fig. 1). The calculated results are aggregated in the database.
B. GP for predictor development Figure 5 shows a flowchart at each generation with respect to the development of the predictor by GP. GP is an expansion of GA where individuals are represented as computer programs or mathematical functions using tree structures as shown in Fig. 6. The tree structure consists of branch and leaf nodes, and in our GP search, a branch node depicts an element from arithmetic operators (+, −, ×, ÷) and a leaf node depicts an element from constants (1-10) and variables to develop mathematical functions. For example, the left tree of Fig. 6 (a) represents the function of f (A, which is written as f (A, B) = × ÷ + 3 A 4 + − 1 2 B in the Polish notation to deal with arithmetic expression in programming language. The mating is the operator to create new tree structures by exchanging branches between two individuals selected randomly ( Fig. 6 (a)) and the mutation is the operator to exchange branches within an individual ( Fig. 6 (b)). The individuals are ranked according to the strength of the correlation between the value obtained by substituting the datasets into the function (evaluation value: x) and physical property data which we will investigate (y). The correlation coefficient (r) between x and y is calculated by r = s xy /s x s y , using the data included in the database. The parameters s x , s y , and s xy are a standard deviation of data x, that of y, and a covariance between x and y, respectively. They are calculated as follows: and where m is the number of the datasets used for the calculation, and x (y) is the average value of x (y). Then, similarly to the case of the structure search, a high-quality function is developed by repeatedly performing the evolution, i.e. inheriting, ranking according to the absolute value of r (|r|), and eliminating. For a number of high-quality functions obtained by several GP searches, the predictive ability of the function is checked by a k-fold cross-validation 12 . The datasets are randomly divided into k subsets, and a single subset is retained as the validation data for testing the functions and the remaining k − 1 subsets are used for training the functions. In this study, we used R = |r ts | for the evaluation of the predictive ability, where r ts represents the correlation coefficient calculated from the testing data and takes the value from 0 to 1. If the function developed by GP gives the strong correlation not only for the training data but also the testing data, then R shows a large value. This process is repeated k times by rotation and the values of R are averaged (R). The function with the largest R is adopted as the predictor of the physical property which we will investigate. At the stage (iv), potential candidates are selected by the regression analysis using the predictor.

III. APPLICATION TO SEARCH FOR SUPERCONDUCTING HYDROGEN COMPOUNDS
We developed the calculation codes for the above GA and GP searches and applied our MI method to search for superconducting hydrogen compounds. In 2015, hightemperature superconductivity was discovered in hydrogen sulfide (H 2 S) under high pressure and the superconducting critical temperature T c reaches 203 K at pressure of 155 GPa 13 . In 2018-2019, compressed lanthanum hydrides broke the record and the superconductivity was observed at 250 K at 170 GPa by Drozdov et al. 14 and 260 K at 190 GPa by Somayazulu et al. 15 . Therefore, other hydrogen compounds also have a potential to become similar high-T c superconductors under some conditions. However, there are a huge number of combinations for hydrogen compounds in multicomponent system. For example, there exist 13572 combinations for ternary hydrogen compounds formed by all elements with the atomic numbers from 2 to 118, and the number is much further increased by taking stoichiometry and crystal structure into account. Therefore, the MI approach is of great help to discover potential candidates with the high-T c superconductivity.

A. Collection of superconductivity data: stage (ii) in MI cycle
First we developed the database by collecting the firstprinciples calculation data with respect to chemical composition, crystal structure, and superconducting property of binary hydrogen compounds under high pressure condition from literature  . The database includes 497 datasets for the compounds with 62 elements colored in the periodic table of Fig. 7. We used only the T c data obtained by the Allen-Dynes formula 79 in this study, which are all plotted in the lower panel of Fig. 7. The highest T c value for each binary system is represented as a histogram. The superconductivity of T c ≥ 200 K is predicted in the following compounds: T c = 265 K at 250 GPa in YH 10 67 , 263 K at 300 GPa in MgH 6 48 , 238 K at 210 GPa in LaH 10 67 , 206 K at 150 GPa in CaH 12 74 , 204 K at 200 GPa in AcH 10 75 , and 204 K at 200 GPa in H 3 S 32 . Figure 8 shows the relationships between T c and hydrogen concentration, mass per atom, pressure, and space group number for all the data included in the database. Although large hydrogen concentration, light mass, and high pressure favor the increase of the superconductivity, they are not a necessary and sufficient condition for high-T c superconductivity. Further, space group number, i.e. crystal symmetry, shows a weak correlation with T c . The data suggests that the search for hydrogen compounds with high T c is seemingly simple, but it is actually difficult just to investigate these parameters independently.  Next we developed the superconductivity predictor by GP using the datasets included in the database. The datasets were divided into ten subsets for the crossvalidation, i.e. k = 10, and 90% of the data for training and 10% for testing. We used five variables, space group (S), hydrogen concentration (h), mass per atom (M ), pressure (P ), and the effective screened Coulomb repulsion constant in the Allen-Dynes formula 79 (µ * ), to obtain the superconductivity evaluation value x; x = f (S, h, M, P, µ * ) (see Fig. 3). The functions are evolved for 5000 generations, based on the correlation to the superconducting T c (T ); y = T . The number of the individuals were set at 20, in which 8 functions are created by the mating, 8 by the mutation, 2 are randomly generated, and 2 are inherited. Then the 10-fold cross-validation was carried out and the R value was calculated. We performed this process 50 times in parallel and adopted the function with the largest R as the superconductivity predictor. Figure 9 shows the correlation between T c and x obtained by the predictor. |r tr | and |r ts | are 0.78 and 0.81, respectively, where r tr represents the correlation coefficient calculated from the training data and takes the value from 0 to 1. The compounds taking small x values are potential candidates for the high-T c superconductivity. See Ref. 80 with respect to the details of the predictor, in which the function is represented as the Polish notation.
C. Search for potential candidates: stage (iv) in MI cycle We searched for the potential candidates in ternary hydrogen compounds using this predictor. First, we created 497 datasets for ternary hydrogen compounds using all the datasets for the binary ones included in our database. That is to say, if in a binary compound the element paired with hydrogen has the atomic number of Z, then the corresponding ternary compound is created by replacing it with the elements having Z − 1 and Z + 1 as follows: H 6 P(Z =15) Cl(Z =17) for H 3 S(Z =16), Ba(Z =56) Ce(Z =58) H 20 for La(Z =57) H 10 , etc.. The values of S, h, P , and µ * were matched to those of the corresponding binary data, in other word, only M were varied. Then, we calculated the x values of the hypothetical ternary compounds using the created datasets. Table  I lists a part of the results on x and T c for the hypothetical ternary compounds. We estimated the T c values of the ternary compounds (T t c ) using the relationship of squares fitting of the data (see the line in Fig. 9). In the table, the results are classified into (a) the group 1-3 elements and (b) the group 13-16 elements. For the group 1-3 elements, in which polyhydrides with hydrogen cages are formed and T c exceeds 200 K, the superconductivity is slightly suppressed for YH 10 and MgH 6 and is less affected for LaH 10 and AcH 10 by changing to the hypothetical ternary compounds. Only CaH 12 shows a slight enhancement of the superconductivity. For the group 13-16 elements, in which covalent hydrides are formed and T c shows the values from 140 to 200 K, the superconductivity is largely suppressed for H 3 S and SiH 3 , whereas it is less affected for AsH 8 , AlH 5 , and GeH 3 .

D. Validation of superconductivity from first
principles: stages (i) and (ii) in MI cycle

KScH 12
We verified the superconductivity of one of the potential candidates, KScH 12 , which is a hypothetical ternary compound approximated to CaH 6 and is included in the group 1-3 elements. CaH 6 is predicted to form the hydrogen cage structure under high pressure and show T c of 220-235 K at 150 GPa, which is obtained by numerically solving the Eliashberg equations 80 . Therefore, KScH 12 is expected to show similar crystal structure and superconductivity under high pressure.
The variation of T c shown in Table I is estimated under the hypothesis that the space group S is invariant between binary and ternary compounds. However, S for the most stable structure of the ternary compound is considered to be different from that of the binary one. Therefore, first we searched for the most stable structure of KScH 12 by combining our GA code with the Quantum ESPRESSO (QE) code 81 and applying it to KScH 12 . In our GA search, 8 structures are created by "mating", 6 "mutation (distortion)", and 6 "mutation (permutation)" in each generation. We performed the structure search at pressures of 200 and 300 GPa, using calculation cells including 2 formula units. The generalized gradient approximation by Perdew, Burke and Ernzerhof 82 was used for the exchange-correlation functional, and the Rabe-Rappe-Kaxiras-Joannopoulos ultrasoft pseudopotential 83 was employed. The k-space integration over the Brillouin zone (BZ) was carried out on a 4 × 4 × 4 grid, and the energy cutoff was set at 80 Ry for the wave function and 640 Ry for the charge density. After we obtained the stable structures, we improved the calculation accuracy by increasing the number of k-points and checked the stability. As the results, we obtained a monoclinic C2/m (No. 12) structure at 300 GPa, where it shows no phonon instability. The structure takes a modulated hydrogen-cage structure, similar to the structure of CaH 6 (compare the structure between the left and right in  and C2/m KScH 12 (S = 12) are calculated to be −179.5 and −95.5 at 300 GPa, respectively (Table II). That is to say, T c decreases by about 68 K owing to the change from Im-3m CaH 6 to C2/m KScH 12 according to the slope of −0.8130 mentioned in III C. Next, we investigated the superconductivity of C2/m KScH 12 using first-principles calculations. T c was calculated using the Allen-Dynes formula 79 , In Eq. 4, the parameters of electron-phonon coupling constant λ and logarithmic-averaged phonon frequency ω log represent a set of characters for the phonon-mediated superconductivity. To obtain these parameters, we performed the phonon calculations implemented in the QE code. We used an 8 × 16 × 8 k-point grid for the electronphonon calculation, and 4 × 8 × 4 k-point and 4 × 4 × 4 q-point grids for the dynamical matrix calculation. We also calculated T c of Im-3m CaH 6 using a 24 × 24 × 24 k-point, and 12 × 12 × 12 k-point and 4 × 4 × 4 q-point grids. The effective screened Coulomb repulsion constant µ * was assumed to be 0.13, which has been considered to be a reasonable value for hydrides. The results are listed in Table III. T c of Im-3m CaH 6 at 150 GPa is 172 K, which is lower by about 50 K than that obtained by numerically solving the Eliashberg equations reported earlier 80 . T c of C2/m KScH 12 shows 122 K at 300 GPa, which is decreased by 38 K owing to the change from Im-3m CaH 6 . This result indicates that the decrease of T c is qualitatively consistent with that of estimated from the variation of the evaluation value x shown in Table II.

GaAsH 6
We verified the superconductivity of another candidate, GaAsH 6 , included in the group 13-16 elements. Gallium arsenide (GaAs) has been well known as a III-V direct semiconductor at ambient pressure, and the hydrogenation of the compound is expected to be achieved under high-pressure conditions, as well as group IV semiconductors. We searched for stable structures at pressures of 50, 100, 200, and 300 GPa, using calculation cells including 2 formula units. The k-space integration over the Brillouin zone (BZ) was performed on an 8 × 8 × 8 grid, and the energy cutoff was set at 80 Ry for the wave function and 640 Ry for the charge density. As the results, we obtained a cubic P m-3 (No. 200) structure at 180 GPa (Fig. 11). Although the structure is similar to the A15 structure with P m-3n (No. 223) of GeH 3 reported earlier 31 , the space group is different from that of the A15 structure. The evaluation value x of P m-3 GaAsH 6 is calculated to be −40.0 at 180 GPa, which gives the T c variations of −8 K from Cccm GeH 3 , −7 K from P 4 2 /mmc GeH 3 , and −26 K from P m-3n GeH 3 (Table IV). That is to say, T c is slightly varied according to the change from Cccm or P 4 2 /mmc GeH 3 to P m-3 GaAsH 6 , whereas it is more largely decreased according to the change from P m-3n GeH 3 .  We investigated the superconductivity of P m-3 GaAsH 6 using first-principles calculations and the Allen-Dynes formula. We used a 48 × 48 × 48 k-point grid for the electron-phonon calculation, and 24 × 24 × 24 k-point and 6 × 6 × 6 q-point grids for the dynamical matrix calculation. The effective screened Coulomb repulsion constant µ * was assumed to be 0.13. The results are listed in Table V. T c of P m-3 GaAsH 6 shows 98 K at 180 GPa, which is lower by 42 K than that of a metastable P m-3n phase of GeH 3 (T c = 140 K) but is comparable with that of the most stable Cccm phase (T c = 100 K). Thus, the T c variations are qualitatively consistent with those estimated from the evaluation value x.
These results suggest that the regression analysis based on the superconductivity predictor gives reliable results on potential candidates for the high-T c superconductivity in the ternary hydrogen compounds approximated to binary ones included in the database.

IV. DISCUSSION AND CONCLUSION
We developed the MI method consisting of the four stages, which is based on EA: (i) the stable structures are determined using GA, (ii) the physical and chemical property data are collected by performing first-principles calculations for the stable structures, (iii) the functionality predictor is developed from the datasets in the database using GP, and (iv) potential candidates are discovered by the regression analysis. Turning of the cycle enables to accelerate the search for novel functional materials.
We applied the MI method to search for the superconductivity in hydrogen compounds. First, we developed the database on the superconductivity of the binary hydrogen compounds by collecting the data from literature. Then, we developed the superconductivity predictor and explored the superconductivity in the hypothetical ternary hydrogen compounds, approximated to the binary compounds included in the database. For the group 1-3 elements, the superconductivity is less affected by changing to the ternary compounds, and high-T c superconductivity is expected in the hypothetical ternary compounds, as is the case with the binary ones. For the group 13-16 elements, the superconductivity is largely suppressed in the ternary compounds approximated to H 3 S and SiH 3 , whereas those approximated to AsH 8 , AlH 5 , and GeH 3 are less affected by the changing. We actually verified the superconductivity of KScH 12 created from CaH 6 and GaAsH 6 created from GeH 3 using the first-principles calculations. For KScH 12 , a modulated hydrogen-cage structure with a space group of C2/m is obtained at 300 GPa by GA, and T c shows 122 K for µ * = 0.13. For GaAsH 6 , we predicted a cubic P m-3 structure and obtained T c of 98 K at 180 GPa. These T c values are qualitatively consistent with those estimated from the variations of the superconductivity evaluation values. These results suggest that the regression analysis based on the superconductivity predictor is effective for the discovery of potential candidates in the ternary hydrogen compounds.
The prediction ability of the superconductivity in ternary hydrogen compounds is expected to be further improved by aggregating the superconductivity data of KScH 12 , GaAsH 6 , and other ternary compounds into the database and redeveloping the predictor by GP. At present, the superconductivity of T c ≥ 200 K is found only in the pressure above 150 GPa. Therefore, the achievement of more than 200 K superconductivity at low pressure is also significantly important, and MI approaches could be effective for the research.