Artificial intelligence for high-throughput discovery of topological insulators: The example of alloyed tetradymites

Guohua Cao,1,2,3 Runhai Ouyang,2 Luca M. Ghiringhelli,2 Matthias Scheffler,2 Huijun Liu,1,* Christian Carbogno,2,† and Zhenyu Zhang3,‡ 1Key Laboratory of Artificial Microand Nano-Structures of Ministry of Education and School of Physics and Technology, Wuhan University, Wuhan 430072, China 2Fritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4–6, 14195 Berlin-Dahlem, Germany 3International Center for Quantum Design of Functional Materials (ICQD), Hefei National Laboratory for Physical Sciences at the Microscale, and Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026, China


I. INTRODUCTION
Topological insulators (TIs) constitute a new class of quantum materials with insulating bulk but metallic boundary states. Protected by time-reversal symmetry [1][2][3], those boundary states possess a spin-momentum locked Dirac structure in which the backscattering channels are suppressed, favoring dissipationless electronic conduction. This salient property renders TIs immense application potential in many areas, including spintronics [4,5], catalysis [6,7], and thermoelectricity [8], propelling the field to actively search for new TIs that may meet such high technological expectations.
Theoretically, the TIs are characterized by a nonvanishing topological invariant Z 2 , which can be determined by parity considerations or by integration of the Berry curvature in momentum space through detailed electronic structure calculations [9][10][11]. Earlier search efforts for potential TIs were performed on a case-by-case basis, as exemplified by the successful predictions and experimental confirmations of the well-known strong TIs Bi 2 Te 3 , Sb 2 Te 3 , and Bi 2 Se 3 [12][13][14][15]. More recently, the field has witnessed rapid advances in high-throughput screening over many candidate topological materials using empirical descriptors or symmetry-based indicators [16][17][18][19][20][21][22]. Out of the tens of thousands of materials with well-defined symmetries in the Inorganic Crystal Structure Database, hundreds to thousands have been identified to be topologically nontrivial.
Beyond the systems listed in the existing databases, there exists a much larger materials space, e.g., compounds that can be created by alloying or fractionally varying the stoichiometric compositions. While these immensely vast material classes were out of the scope of the symmetry-based approaches, they are the focus of the present study. Using tetradymites as a prototypical class of examples, we establish an artificial intelligence (AI)-based descriptor for the prediction of TIs without prior determination of their specific symmetries and detailed band structures, covering a previously uncharted and much larger territory in the materials space. To this goal, we first investigate a moderate number (hundreds) of layered tetradymites by using accurate first-principles calculations that account for van der Waals (vdW) interactions [23][24][25] and many-body effects [26,27] at a perturbative level. We then employ the SISSO (Sure Independence Screening and Sparsifying Operator) approach [28,29] based on the compressedsensing technique to establish a simple and physically intuitive descriptor for the identification of the topological character. By leveraging this descriptor that contains only the atomic number and electronegativity of the constituent species, we have readily scanned a huge number of alloys in the tetradymite family. Strikingly, nearly half of them are identified to be topological insulators. The present work also attests to the increasingly important role of such AI-based approaches in modern materials discovery.

II. RESULTS AND DISCUSSION
In contrast to other AI-based approaches applied to the prediction of TIs [30][31][32][33][34][35], SISSO is built to determine descriptors that are elementary functions of key physical inputs, thus enabling human inspection into the underlying mechanisms. To construct a reliable training set, we have computed the topological characters of 243 (or 3 5 ) tetradymites by combining group-VA elements (As, Sb, and Bi) with group-VIA elements (S, Se, and Te) in a five-atom unit cell. These systems can be viewed as stackings of quintuple layers (QLs) along the c direction, where vdW interactions bind neighboring QLs to each other. As an example, Fig. 1 shows the crystal structure of the tetradymite AsSbSeTeS, where the atoms As, Sb, Se, Te, and S occupy the sites A, B, L, M, and N, respectively. To get reliable equilibrium structures of such layered systems, the vdW interaction in the form of the optB86b-vdW functional is included in our first-principles calculations (as cross-checked by using the optB88-vdW functional), and the lattice constants and all the internal degrees of freedom are fully relaxed. In addition, we have performed GW calculations to accurately predict the band structures and the topological invariants for those systems with small gap systems (usually less than 0.1 eV). Details are given in Sec. 1 of the Supplemental Material [36]. Based on the high-level electronic structure data, we utilize SISSO [28,29] to single out a simple and physically intuitive descriptor from a huge number of potential candidate forms. In practice, more than ten billion candidate descriptors are first constructed iteratively by combining elemental physical properties of the constituent atoms. More details can be found in Sec. 1 of the Supplemental Material [36]. Secondly, the optimized descriptor is obtained via the SISSO approach and is used to determine a low-dimensional representation of the materials space, in which the TIs and the normal insulators (NIs) belong to well-defined, nonoverlapping domains. Since the transparent functional form of the identified descriptor reflects the mechanism underlying the topological character, it is physically meaningful and suitable for extrapolation, as also demonstrated below.
By performing accurate first-principles calculations for the set of 243 teradymites, we identify 177 systems that are mechanically stable (namely, without pronounced negative phonons), while 66 are unstable. Here we also find that the topological characters of 230 systems do not depend on the choice of the vdW functional, while the remaining 13 systems give different Z 2 values for the optB86b-vdW and optB88-vdW functionals. Furthermore, among the remaining 166 stable systems, there are ten weak TIs (namely, those with small band gaps, and their topological surface states only exist on certain surface orientations), and four semimetals. Excluding those ambiguous systems, we finally obtain a set of 152 stable tetradymites with clear TI/NI classifications as the training set. As mentioned above, the topological characters for 55 small-band-gap systems have been also checked by GW calculations. Among these 152 tetradymites, 67 systems are TIs (including 4 binary, 27 ternary, 30 quaternary, and 6 quinary tetradymites), while the remaining 85 systems are NIs (containing 3 binary, 24 ternary, 46 quaternary, and 12 quinary tetradymites). The stability and topological properties of all the 243 tetradymites are summarized in Sec. 2 of the Supplemental Material [36]. It should be mentioned that we are dealing with the topological properties of intrinsic tetradymites. The effect of defects and doping, which may destroy the salient properties of TIs [14,37,38], is however beyond the focus of our current work.
It should be noted that the training data set naturally includes the TIs Bi 2 Te 3 , Bi 2 Se 3 , Sb 2 Te 3 , Sb 2 Te 2 S, Sb 2 Te 2 Se, Bi 2 Te 2 Se, Bi 2 Se 2 Te, and BiSbSeTe 2 that have already been previously identified [12][13][14][15][39][40][41]. Furthermore, the set also includes 59 potential TIs, and some of those feature relatively large band gaps, such as SbBiSeSeSe (0.22 eV), SbBiTeTeS (0.27 eV), and SbBiTeTeSe (0.23 eV), and are thus promising candidates for potential applications. On the other hand, we want to emphasize that for a certain tetradymite, the topological nature may be sensitive to the atomic configuration within the QL. Taking Sb 2 Te 2 S as an example, our additional first-principles calculations indicate that the structure with constituent elements arranged as SbSbTeTeS is identified to be TI, while SbSbSTeTe with the same stoichiometry is a NI. Even at this level, these findings are significant.
The SISSO training has been performed for the described data set of 152 tetradymites by using different combinations of key physical inputs: (1) the atomic number Z, the electronegativity χ , and the spin-orbit coupling (SOC) strength  [36]. The optimized two-dimensional (2D) descriptor is identified as follows: Here, the subscripts A, B, L, M, and N refer to the sites occupied by the atoms (see Fig. 1). We note that, even though λ has been widely recognized to induce band inversion and topological phase transition, it was not explicitly selected by the optimized descriptor. Using D 1 and D 2 as the x and y coordinates, we can plot a "topological phase diagram" of the 152 tetradymites as shown in Fig. 2, where the cyan and green regions are the convex envelopes of the 85 NIs and 67 TIs, respectively. The support vector machine technique [43] is further used to obtain the blue dividing line D 2 = −238.23 + 0.039D 1 between the TI and NI domains. We see from Fig. 2 that the resulting 2D descriptor gives a perfect classification of the training data, i.e., there is no overlap between the NI and TI domains. The robustness of the descriptor has been checked via the leave-one-out cross-validation approach, and the all-data descriptor is identified in 86% of all the 152 iterations (each descriptor is identified out of ∼10 23 possible candidates). Furthermore, it is remarkable that even the 66 mechanically unstable systems (open symbols) are perfectly classified by the phase diagram, despite the fact that the descriptor is obtained by using only the 152 stable ones (filled symbols) as the training set. Even more, exactly the same 2D descriptor (D 1 , D 2 ) is identified if the full data of 216 stable and unstable tetradymites are adopted. These striking observations further substantiate that the obtained 2D descriptor is able to capture the underlying physical mechanism determining the topological characters in this materials class. We can rewrite Eq. (1) as showing that larger values of Z A , Z B , Z L , and Z M lead to higher D 1 , making the system to be in the TI region. This is consistent with the common understanding that heavier atoms usually have larger SOC, which is a key factor in inducing topological band inversion [12]. For instance, the well-known TI Bi 2 Te 3 exhibits the largest D 1 value among all the 152 tetradymites and thus represents a vertex of the convex TI envelope, since it features the largest atomic numbers of the cations (Z A = Z B = 83) and anions (Z L = Z M = Z N = 52). It should also be noted that Z N does not appear in Eq. (3), because the anions at the N sites make minimal contributions to the highest valence band and lowest conduction band of the systems. More details can be found in Sec. 4 of the Supplemental Material [36].
Here we stress that, although D 1 is the decisive descriptor for a large majority of these tetradymites, it alone is not sufficient to predict the topological character in the region with 6806 < D 1 < 7854 (enclosed by two vertical lines in Fig. 2). In this case, D 2 of the 2D descriptor becomes crucial: among the 29 tetradymites located within this particular region, 14 compounds with smaller D 2 values are TIs, while the remaining 15 compounds with larger D 2 values are NIs. To better reveal the delicate physical mechanism, we rewrite Eq. (2) as Conceptually, D 2 is the relative electronegativity difference between the anions (at the L and M sites) and cations (at the A sites). As the electronegativity difference is approximately positively correlated with the band gap of an inorganic compound [44,45], a smaller D 2 value (namely, a smaller electronegativity difference) leads to a smaller band gap, making it easier to generate a band inversion, and the system is more likely a TI. Indeed, our first-principles calculations for the 29 tetradymites (with 6806 < D 1 < 7854) indicate that the TIs with smaller D 2 tend to exhibit smaller band gaps (before SOC) than the respective NIs at similar D 1 . Details can be found in Sec. 4 of the Supplemental Material [36]. Overall, although the proposed 2D descriptor only depends on the atomic number Z and the electronegativity χ of the constituent atoms, it correctly captures the essential physical factors of TIs, namely, the delicate competition between the SOC strength and band gap. Compared with previous empirical models [16,18], the present 2D descriptor of (D 1 , D 2 ) identified by SISSO features a much richer functional form, enabling quantitative and reliable predictions of TIs far beyond the training data, as further demonstrated below.
The 243 tetradymites discussed so far all feature integer stoichiometry. In fact, alloyed tetradymites with fractional stoichiometry such as Bi 2 Te 3−x S x and Bi 1.4 Sb 0.6 Te 1.8 S 1.2 [46] are promising TIs [47] for potential broader technological applications, since such variations allow further tuning 034204-3 of the electronic and topological properties. Unfortunately, reliable first-principles data are hard if not impossible to obtain for such systems, since prohibitively large supercells are needed to represent such fractional stoichiometries given by As x Sb y Bi 2−x−y S a Se b Te 3−a−b (0 x, y 2, and 0 a, b  3). Generalizing the 2D descriptor discovered above, however, allows one to predict the topological characters for the complete class of tetradymites, including alloyed compounds. For this purpose, we define site-specific atomic numbers as and site-specific electronegativities as In close analogy to the virtual crystal approximation [48] used for first-principles modeling of random alloys, these site-specific properties are weighted averages, i.e., the coefficients x 1 , x 2 , a 1 , a 2 , a 3 ,y 1 , y 2 , b 1 , b 2 , b 3 (each taking the value between 0 and 1) denote the fractional occupancy of each site. Collectively, these coefficients define the stoichiometry of an alloyed system via x = x 1 + x 2 , y = y 1 + y 2 , a = a 1 + a 2 + a 3 , Utilizing the weighted elemental properties to evaluate the descriptor in Eqs. (1) and (2) immediately allows one to map out the topological phase diagram for the tetradymites with any fractional stoichiometry. This is illustrated in Fig. 3 using 4 084 101 alloyed tetradymites as yet still only a subset of the example systems, where the coefficients x 1 , x 2 , a 1 , a 2 , a 3 ,y 1 , y 2 , b 1 , b 2 , b 3 are all multiples of 0.2. Even in this relatively small subset of alloyed tetradymites, we already obtain 1 965 047 systems located on the right side of the dividing line, i.e., they are predicted to be TIs.
Noteworthy enough, the obtained predictions are consistent with the few available experimental data, i.e., the topologically nontrivial alloyed tetradymites Bi 2 Te 1.6 S 1.4 , Bi 0.5 Sb 1.5 Te 3 , Bi 1.5 Sb 0.5 Te 3 , Bi 1.5 Sb 0.5 Se 1.3 Te 1.7 , and Bi 1.1 Sb 0.9 STe 2 [40,47,49]. Taking Bi 2 Te 1.6 S 1.4 as an example, the atomic arrangement within the unit cell is that are still reasonably tractable. Marked as the solid circles (NIs) and squares (TIs), these representative examples cover all the different areas shown in Fig. 3, including the regions close to the NI/TI boundary. Accordingly, four tetradymites sample the TI area (As 0. 33  as well as the NIs (As 0.33 SbBi 0.67 S 1.67 Te 1.33 and As 0.33 Sb 0.67 BiS 2 Te) are located in the region with 6806 < D 1 < 7854 so that their topological character is determined by the descriptor D 2 . For all the eight systems with fractional stoichiometry, the first-principles calculations of the Z 2 invariant fully confirm the predictions of the 2D descriptor. As an example, Fig. 4(a) shows the orbital-decomposed band structure of Sb 2 S 0.33 Se 0.67 Te 2 , in which a band inversion is present at the point between the valence (mainly occupied by the p z orbitals of the Sb atoms) and conduction bands (mainly occupied by the p z orbitals of the Te atoms). Conversely, the band structure of SbBiS 2.33 Se 0.67 shown in Fig. 4(b) exhibits a semiconducting behavior with normal band order. The crystal structures, band structures, and the corresponding evolution of Wannier centers for these eight tetradymites are summarized in Sec. 5 of the Supplemental Material [36]. Due to the moderate system sizes, these layered compounds necessarily exhibit pronounced short-(interlayer) and long-(intralayer) range orders, especially if compared with the much more disordered random alloys studied experimentally. Nonetheless, the (D 1 , D 2 ) descriptor achieves excellent agreement between the SISSO predictions and first-principles calculations as well as the experimental data, further substantiating its reliable and robust nature in determining the topological characters of tetradymites with either integer or fractional stoichiometry.

III. CONCLUSIONS
In summary, we have performed extensive, high-level firstprinciples calculations to obtain reliable information on the topological character of 243 tetradymites, of which many are discovered as TIs. From this training data set, we have obtained a simple predictive 2D descriptor for the TI/NI classification using the AI-based SISSO approach. The descriptor only depends on the atomic number and Pauling electronegativity of the constituent elements, and captures the essential physics governing the TIs. Accordingly, it exhibits perfect classification accuracy and strong predictive power, even for systems drastically beyond the training data, as explicitly demonstrated for alloyed tetradymites compounds, i.e., a material class that can be created by alloying and would be prohibitively expensive to investigate even using the lowest-level first-principles approaches. With this descriptor, we are able to identify all the potential TIs in this complex material class in a fast and reliable fashion, most of which are out of the scope of the recently proposed symmetrybased indicators. The approach established here should be applicable to the classification of the topological characters of many other classes of materials beyond the tetradymite family. The present study therefore offers a major step forward in exploration of the topological materials space, allowing us to investigate complex materials without prior determination of their specific symmetries and detailed band structures. Collectively, these findings also attest to the increasingly important role of such AI-based approaches in modern materials discovery.
The full input and output files for the calculations in this work are available in the NOMAD repository [62].