Engineering Topological Phases Guided by Statistical and Machine Learning Methods

The search for materials with topological properties is an ongoing effort. In this article we propose a systematic statistical method supported by machine learning techniques that is capable of constructing topological models for a generic lattice without prior knowledge of the phase diagram. By sampling tight-binding parameter vectors from a random distribution we obtain data sets that we label with the corresponding topological index. This labeled data is then analyzed to extract those parameters most relevant for the topological classification and to find their most likely values. We find that the marginal distributions of the parameters already define a topological model. Additional information is hidden in correlations between parameters. Here we present as a proof of concept the prediction of the Haldane model as the prototypical topological insulator for the honeycomb lattice in Altland-Zirnbauer (AZ) class A. The algorithm is straightforwardly applicable to any other AZ class or lattice and could be generalized to interacting systems.

The search for materials with topological properties is an ongoing effort. In this article we propose a systematic statistical method supported by machine learning techniques that is capable of constructing topological models for a generic lattice without prior knowledge of the phase diagram. By sampling tight-binding parameter vectors from a random distribution we obtain data sets that we label with the corresponding topological index. This labeled data is then analyzed to extract those parameters most relevant for the topological classification and to find their most likely values. We find that the marginal distributions of the parameters already define a topological model. Additional information is hidden in correlations between parameters. Here we present as a proof of concept the prediction of the Haldane model as the prototypical topological insulator for the honeycomb lattice in Altland-Zirnbauer (AZ) class A. The algorithm is straightforwardly applicable to any other AZ class or lattice and could be generalized to interacting systems.

I. INTRODUCTION
In recent years machine learning techniques have enjoyed growing attention among the physics community. Fueled by popular success in automation across a wide variety of industrial applications, implementations to fundamental research have been proposed. Apart from, for instance, the popularized computer vision application in black hole research [1], a lot of effort has been devoted to increase the efficiency of available algorithms, such as Monte Carlo [2][3][4][5][6] or Density Functional Theory [7][8][9][10]. Moreover, the concept of machine learning has been shown to be able to grasp even the very complex nature of topological phases, finding the correct order parameter by itself [11][12][13]. Successful reports of both, supervised and unsupervised paradigms have been published recently [14][15][16][17]. An overview in terms of an extensive review of machine learning applications to condensed matter physics is also available [18].
In this work, we are proposing a different scheme where we lay emphasis on minimal bias. Rather than speeding up a (in this case) manageable computational task, we aim at machine-assisted learning of previously unknown information using the toolkit of data science/statistics. Specifically we construct, following this scheme, topological models for honeycomb lattices. Dissecting first the well-known Haldane model [19] to benchmark and validate our findings, we then look at the most general model on a honeycomb lattice and use our analysis to extract a topological prototype model for each individual class label. These generated models turn out to be exactly of the Haldane type. This procedure can be generalized to any generic lattice and shows that topological models can be "learned" from the statistics of a randomized data set, not only by a machine since the result is readily comprehensible.
The paper is organized as follows. In Section II we discuss the generation of our data and features. Section III contains the motivation and definition of the quantities used to extract information from the data, which is then applied to the Haldane model in Section IV and a general honeycomb lattice in Section V.

II. DATA GENERATION
We first start by introducing some definitions of quantities that will be used throughout the paper. We define "data" as a set of feature vectors x i with dimension n f (number of features), which can be stacked into a data matrix X = (x 0 , x 1 , x 2 , . . .) T with dimensions n s × n f , where n s is the number of samples or data points. The corresponding labels are stored in variables y i ∈ Z, which can be written as a single vector Y . We denote a specific feature as x j := X ij = [x i ] j , where we omit the sample index if possible. The feature matrix X and the label vector Y are related by a non-linear transformation f , such that f (X) = Y .
Here, we compute the label from X by calculating the topological index (in this case the Chern number) from the model specified by x i (the i-th row of X) where H k (x i ) is the Bloch Hamiltonian of the model and f = C • H. We note that in contrast to many other approaches [20,21] we do not attempt to use machine learning techniques to "learn" the mapping f , which attracts a lot of interest due to the often increased performance of neural network classification compared to the corresponding classical algorithms. Instead, we aim to extract physical information from the structure of the data. Data points are generated by choosing a reference point x ref and subsequently sampling perturbations δ i to this point from suitable random distributions to create a cloud of data points around x ref . For each point we store both x i = δ i and the label y i .

Choice of features
A model describing a quantum material is typically represented in terms of tight-binding parameters, where symmetries are already accounted for. A general representation applicable to multiorbital materials is that of hopping matrix elements or overlap integrals of orbitals. By denoting every parameter t ij (R) with the displacement vector R between the different orbitals, in addition to the site-orbital indices i, j, we have more parameters at our disposal which allow us to break symmetries and potentially discover unknown topological phases. Our feature vector thus consists of all t ij (R) up to a cutoff distance |R|. We note that this choice would pose a great challenge to typical machine learning applications, since not only the computation of the Chern number, but also the diagonalization and construction of the Hamiltonian has to be learned, which would require an extremely complex model. By choosing this most general data set (model parameters, topological class label) we make sure that we can learn about the relation of the topological classification to the physical parameters of the system. In contrast to a similar approach, where machine learning was used to speed up the construction of a tight binding model [22], we are here only interested in extracting previously unknown information from the data that is not otherwise attainable.
We note that, concerning our study on topological phases, this description of quantum materials encloses both, non-interacting electron systems as well as interacting electron systems where the concept of topological Hamiltonian is applicable [23,24]. Since the validity of this topological Hamiltonian is restricted to the weak to intermediate regime of correlations, the self-energy is not strongly momentum-dependent [25]. The weak sensitivity of the topological invariants w.r.t. this momentumdependence [24] suggests that modifications of the local hopping parameters (R = 0) can also describe correlation effects.
For simplicity we work with real features x ∈ R n f . However, overlap integrals t ij are generally complex numbers, not necessarily real, therefore we impose a mapping g : C → R 2 to obtain a real feature vector. For complex parameters natural choices are either (Re(x i ), Im(x i )) or (|x i |, −i log(x i /|x i |)). Since we don't know a priori which is the better choice, we will use in what follows both mappings. For strictly real features we just take the real part of the definition above.
In order to be as unbiased as possible we choose a uniform probability distribution for sampling our features. However, since we do not want to generate too many extremely unphysical data points, we set the sample space independently for each feature x i as Ω α = where B r (x) denotes the solid sphere with radius r, centered at x. The external parameter α := r i /x i ref is the ratio between the spread of the data and the initial value, cf. Fig. 1. The probability density function (PDF) is then given by the uniform distribution on the sample space Ω α ρ α (x) = U (Ω α ). (2) This choice guarantees our two requirements, namely being unbiased and, preserving at least some amount of physicality of our model given a proper choice of the reference point x ref .
Re(x i )

III. STATISTICAL METHOD
After generating a reasonably large data set, we proceed with the analysis of the information contained within.
We find that clustering algorithms, although at first glance perfectly suited for the task, are not very helpful. The reason for that is the high density of data in the investigated domain, what causes the clusters to be infinitely close. For instance, it would be possible to train a logistic classifier to learn the metallic separation line between two insulating phases, however, this is an extremely complicated high-dimensional surface, which lacks interpretability and therefore its usefulness is rather doubtful.
We proceed in a different way, in the first step we extract the most characteristic features from the labeled data. We can define the relevance of a feature through the discrimination between different labels. Restricting the data set to a specific class label will reduce the entropy of certain features, which becomes clear if we interpret the feature data and the label data as separate random variables X and Y , respectively H(X|Y ) = H(X) − I(X; Y ). One expects the reduction in entropy, given by the mutual information I(X; Y ) (Eq. 5), to be a measure for the importance of a feature. Given our particular data at least, we find that this definition lacks robustness with respect to noise and is therefore inapplicable to a general case. We can nevertheless inspect the probability distributions, or rather the frequency/empirical probability, of the individual features.
We restrict our discussion to weakly correlated features and comment on possible treatment of correlations beyond that further below. Comparing probability distributions between different classes should thus yield a measure of importance for the individual features. An illustration of this motivation is provided in Fig. 2, where we show the difference between less important features (x 0 ) and important features (x 1 ). The projection onto the subspace corresponding to label L results in only a minor modification for the former, while the latter deviates substantially.
Illustration of a probability distribution function for two features x0 and x1. When restricted to the data subset with class label L the distribution of feature x1 deviates significantly from the base distribution, i.e. the feature is more important for the classification.
We quantify the difference between two probability distribution functions p(x), q(x) : R → [0, 1] in terms of the Bhattacharyya distance [26] which satisfies D B (p, q) ≥ 0 and D B (p, q) = 0 iff p = q. Thus, according to the argument above, larger values of D B represent a larger importance of the feature. This measure has several advantages over the use of divergences in signal selection [27] and is also used for feature extraction for image recognition [28,29]. We note that, mathematically speaking, D B is not a distance since it does not satisfy the triangle inequality. The related is a true distance function. In our calculations, though, the Bhattacharyya distance proved to be more effective.
By only considering those features with the highest importances we can perform a dimensional reduction on the data set. One could now introduce new features that have an e.g. polynomial dependence on the original features (x i0 , x i1 , ..., x iN , x i0 x i1 , ...). This can be repeated to find a more optimal representation of the data. Albeit conceptionally simple, an actual implementation is not straightforward, though feasible since all operations required in a single step are basically O(N ).
Without introducing the aforementioned features it is unclear how this approach performs if features are correlated, i.e. if phase separation lines do not lie along parameter axes. We employ a twofold analysis where we measure the statistical dependence in terms of a normalized variant of the mutual information, that we call redundancy where I is the mutual information and H(X, Y ) the joint entropy of random variables X, Y Alternatively, when features are dependent on one another we quantify the nature of correlations in terms of the Pearson correlation coefficient (PCC) which can differentiate uncorrelated and positively/negatively correlated features. Technically, the PCC is only good for a linear dependence, considering the limited window of parameter values, though, this method is still applicable and proves to be reliable enough.
We illustrate both the redundancy and the PCC in Fig. 3. Technically, statistical independence and correlations are two different quantities, here we usually use the term "correlations". This simplification is fine since we always look at statistical independence first and discuss statistical correlations only in case of dependent features.
We note that at this point we choose to simplify and only take into account correlations between pairs of features. Generalizations to higher order correlations exist, such as the total correlation [30], however, it is clear that the higher the order of the correlation function the more obvious the result will be in terms of a finite value, since a large number of random variables is less likely to be independent compared to a pair. At the same time the information content of such quantities decreases since one loses the fine granularity. Finding the right balance between complexity and information content is thus very difficult but necessary to fully understand the interplay between parameters.

IV. BENCHMARK CASE: HALDANE MODEL
The Haldane model [19] is defined as where φ ij = ±1 for counter-/clockwise hopping within a hexagon ensures a staggered flux pattern that results in a vanishing overall magnetic field. Since both timereversal and particle hole symmetry are broken, Eq. (8) is an example of a topological insulator in AZ class A [31,32]. One obtains a rich phase diagram, see Fig. 4 for φ = π/2, with a trivial insulator (C = 0) at m/|t 2 | > a, a Chern insulator with topological index C = +1 at 0 < |m|/t 2 < a and a Chern insulator with topological index C = −1 at a < |m|/t 2 < 0. The value of a ∈ R depends on φ and will approach 0 when reaching φ = nπ for n ∈ Z. Implicitly, Eq. (8) assumes a perfect honeycomb. If we relax this requirement we obtain a model with 11 independent parameters namely three nearest neighbor terms t 1 , six next-nearestneighbor terms t 2 and two onsite terms ε i with ε A −ε B = 2m. Due to the requirement that the Hamiltonian be hermitian, ε i must be real. All other parameters are sampled as complex values. Thus, we have nine complex and two real features or equivalently 20 real features. In order to fix the energy scale, one of the onsite terms should always be set to zero, which leaves a total of 19 real features. The order of the complex features is defined in the following way where the superscript index differentiates the three (six) different values of t 1 (t 2 ). The leading 0 corresponds to the onsite energy ε A . We first fix as a reference point the coordinates of the Haldane model with m/t 1 = 1.05, t 2 /t 1 = 0.2, which lies just barely inside the trivial phase region, cf. . The sign change of the next-nearest neighbor term is due to Haldane's requirement that the total flux be zero. We run a fully unbiased sweep, where we draw samples in this 19-dimensional space from the uniform probability density function Eq. (2) with α = 2, which, on the one hand, is large enough to allow for a sign change, but, on the other hand, is small enough not to require an unfeasible number of samples. For each sample the Chern number is computed and stored in the label vector. By using a binning analysis we extract the frequency of different values for all features within the different class labels.
We find a considerable number of non-trivial samples, cf. Fig. 5, even in our totally unbiased approach. This number is large enough to extract useful statistical information. With the given x ref we obtain two topological phases (1, -1), however, data with -1 is less abundant due to the larger distance of x ref from that phase region. The importance scores [Eq. (3)] computed from the distributions are shown in Fig. 6. Here, we show both mappings to the real axis (Re/Im, |.|/ϕ). The mass m is apparently most important, following behind are Re(t 1 ) and the phase of  6. Importance scores in terms of the Bhattacharyya distance for all (real) features. Here, we take into account only the topological class with Chern index C = 1. Most relevant are apparently the mass m, real part and phase of t1 and the phase of t2. We plot a separate bar for every individual hopping vector, equal colors indicate equal lengths.
ranks comparatively low the phase information must relate to the sign. Obviously the real part contains the information about the sign, so we choose here the real part. Therefore, we can restrict the following discussion to the reduced set of 10 out of the total 39 features. We have also trained a random forest classifier on the data and extracted importance scores via the permutation importance, cf. e.g. [33], which resulted in a very similar ranking. The advantage of the present method is that we skip the costly training phase entirely.
Given the importance scores we inspect the underlying distributions more closely. These are expected to show a certain symmetry such that e.g. nearest neighbors are interchangeable. While this is true, here, next-nearest neighbors are divided into two distinct groups, namely those that connect A and B sites, respectively. Thus, we end up with four distinct distributions, for which we show the measured values in Fig. 7.
Having extracted those features that show the clearest statistical response to the change of the topological label or vice versa, the question about the relationships between different features remains open. Due to the extremely unbiased approach and the large number of degrees of freedom therein it is clear that there will be no clearcut distinction between the different phases, since different features can balance one another out. Therefore, we aim here at only finding the characteristic behavior. As a consequence the correlations between any two features are rather small. This is interesting as it demonstrates the stability of the topological phase with respect to noise. Apparently, changing a single hopping parameter-even drastically-can leave the topological phase unchanged. This is also visible in the joint PDFs between any pair of features, which are all close to the independent PDF p(x i , x j ) = p(x i )p(x j ), resulting in small redundancy values. Correlations between many (if not all) features should be present and the corresponding joint PDF contains the complete information about the classification. Nevertheless, the joint PDFs are extremely difficult to interpret.
Finding a prototype feature set for a specific label can intuitively be done by taking the mean of the corresponding data points in case of a symmetric distribution or the peaks in case of an asymmetric distribution. However, this does not always lead to a correct classification, since correlations are neglected. Given the measured frequency of a particular set of features it is apparently more likely j i j i 0.00 0.02 R(ϕi, ϕj ) that for a single sample most values lie close to the respective peaks, while only few deviate significantly. Taking into account the correlation coefficient between the features we can distinguish between actual correlation and noise. We investigate the statistical dependence of the parameters in terms of the redundancy (Eq. 4) in Fig. 8(a), and the Pearson correlation coefficient (Eq. 7) in Fig. 8(c, d). In addition we illustrate the corresponding joint PDF between a pair of features in Fig. 8(b). We find that the nearest-neighbor hoppings are positively correlated in the topological class C = 1 [see Fig. 8(c)], which indicates that the three different values are similar. For the C = −1 class [see Fig. 8(d)], however, we find the opposite sign, i.e. the hopping values are negatively correlated. This means that one or two values have the opposite sign w.r.t. the mean.
Given this information we can construct effective models for the two classes C = 1 and C = −1. To this end we reduce the complexity further by assuming a symmetry between the t 1 and t 2 features. While this is not necessary, as shown by the statistical independence of the parameters in the data [ Fig. 8(a)], it greatly improves the interpretability of the data. Depending on the topological class label and the associated correlations, the hopping terms are either equal or have opposite signs. The t 2 values are split into two independent groups based on the distinct PDFs obtained in the unbiased run. This reduced set of parameters contains seven independent de- grees of freedom vs the original 19.
The improved model with reduced complexity is given by four distinct parameters, i.e. one real onsite term, one complex nearest-neighbor term and two complex nextnearest-neighbor terms. Due to the reduced complexity, a good statistics is obtained at lower sample sizes, allowing for a quicker evaluation. In Fig. 9 we show that the visibility of the non-trivial topological C = +1 phase in the data has greatly improved, which validates the choice of symmetries for our biased model. We use the data obtained from this run to finally settle exemplary values for the prototype model.
By measuring the frequency of the features, distinguished by class labels, cf. Fig. 10, we make an interesting observation. Apparently, choosing the symmetry in the particular way that we did, introduced a certain bias to our model. As a consequence, the nearest-neighbor hopping term is now completely irrelevant for the classification. The next-nearest neighbor terms, though, are showing improved contrast, since there is less possibility for noise, which is also apparent in the redundancy and joint PDF, cf. Fig. 8. While we are able to detect a redundancy in Fig. 8(a), the values are still rather small. As a consequence we can regard the parameters as mostly independent and consider their marginal distributions. The C = −1 phase was not produced in a statistically relevant sample size. We can relate this to the fact that we chose the correlations of the C = +1 phase when setting up symmetries and that the reference point is much closer to the C = +1 phase. Implementing the correlations between the nearest-neighbor hoppings via a sign change will result in a data set with a majority of samples belonging to the C = −1 class. Relative frequency (approximate probability density function) for the mass m, a nearest-neighbor term t1 and two next-nearest neighbor terms t2 for the biased data. Compared to the unbiased data (see Fig. 7) the nearest-neighbor term is suddenly completely indistinguishable between different phases, while the contrast of the next-nearest neighbor terms is increased. The y = −1 label can apparently only appear for a specific sign of the next-nearest neighbor phases.

V. GENERAL HONEYCOMB LATTICE
So far the reference point was carefully chosen to represent the Haldane model and located close to a non-trivial phase to make sure that both trivial and non-trivial samples are produced. In this section we want to test if our analysis also works for cases where no prior information is known. Therefore, we start from a very general honeycomb lattice, where we choose the reference point as where t i = 1/d i is chosen to be the inverse distance of the respective link. t A 0 , t B 0 are set to 0 and 1, respectively, which fixes the scale and units of energy. For the honeycomb lattice odd neighbors come in triplets and even neighbors come in sixtuplets. Therefore, we can write . .. This constitutes a rather generic but realistic a priori ansatz that is known to be topologically trivial. We run a fully unbiased sweep without assuming any symmetries and obtain the data presented in the top row of Fig. 11.
Despite the presumably large distance of the reference point to a topologically non-trivial phase we obtain a reasonable number of non-trivial samples [ Fig. 11(a)]. Apparently, regardless of the greatly increased number of degrees of freedom, the phases of the hopping terms are revealed to be distinctly important, second only to the mass term. We take a look at the PDFs of these features in Fig. 11(b) and observe that the phases for the next-nearest neighbor hoppings are split into two distinct categories. We note that the sign of the class index is reflected in the distribution of the next-nearest neighbor terms. In addition to the known phases from the Haldane model we observe also larger indices ±2 and ±3 (not shown). We compare the PDFs within the four different classes of hopping parameters in terms of D B (p t1,i , p t1,j ) etc., and observe that all distributions are very similar, except the ones of t 2A and t 2B . This observation lends itself as an argument for introducing a symmetry between the hoppings with equal PDFs.
Taking into account this symmetry of the probability density functions and the correlations between features we reduce the model to a six-parameter model with m, t 1 , t 2A , t 2B , t 3 , t 4 , which corresponds to 11 real features instead of the general 37.
Within this symmetrized ("biased") model (bottom row of Fig. 11) we then observe a large number of different class labels. The C = ±1 classes that also appeared in the Haldane model represent by far the largest group of the non-trivial data and show very similar statistics, compare Fig. 11(d) with Fig. 10. The phases of the next-nearest neighbor hoppings have a tendency towards opposite signs between A and B sublattices, which accounts for the vanishing net magnetic field. It is interesting how the added higher-order terms come into play. Statistically speaking, the added third and fourth nearest neighbor terms are irrelevant for the C = ±1 phase, which becomes apparent from the negligible deviation of their probability density functions from the base distribution and the absence of correlations. Obviously, samples of these two classes are continuously connected to the Haldane model. The new information here is that these phases are stable w.r.t. noise and added longer range hopping terms.
During the sampling, especially in the general honeycomb model, it is clear that some combinations of parameters will not produce an insulating phase. Especially among the non-trivial data points we find only a small fraction to be insulating, while the majority lacks a band gap, cf. Fig. 11(a,c). However, in all cases we find topological bands that are clearly separable, which guarantees that the Chern index is well-defined. Although these phases are not insulators at all, we chose to keep them in the initial unbiased run to reduce the amount of samples needed. In fact, comparing the distributions between the topological metals and the topological insulators reveals that the key features are the same, i.e. it is not strictly necessary to discard these data points, although the contrast, and therefore the amount of information, is higher for the insulating phases due to reduced noise. This is reflected in higher importance scores for all features in the topological insulator set compared to the topological metal set. It is possible to increase the insulating fraction by choosing the distribution observed for the topological insulator instead of the uniform distribution for the sampling process. This could be interpreted as learning the ideal distribution for generating topological insulators by  001000000110001101100001  011011100010000001110010  011001010110000101100100  001000000111010001101000  011010010111001100100000  011110010110111101110101   011001000010000001100010   001000000111001101101000  011011110111010101101100   011001010010000001110000  011100100110111101110101  011001000010000001100001  011011100110010000100000  011101000111001001100101  011000010111010000100000  011110010110111101110101  011100100111001101100101  011011000110011000100000  011101000110111100100000  011000010010000001100011  011011110110111101101011  011010010110010100101110  001000000011101000101001  001000000000101001001001  001000000111010001101000   001000000110011001100001   011000010110111001101011  001000000110110101111001   011011010110100101101100  011110010010000001100001  011011100110010000100000  011001100111001001101001  011001010110111001100100  011100110010000100100001  001000000011110000110011   011100110010000100100001  011100110010000100100001  001000000000101001001001  001000000000101001001001  001000000000101001001001  011110010110111101110101  looking at a completely unbiased data set, but performs less than ideal due to the assumption of independence during the sampling process.
In case the features are uncorrelated we can extract an effective model for each topological phase by looking at the peaks and average of the PDFs for each class label. More information, however, is encoded in the PDFs themselves and can be readily inspected due to the dimensional reduction. This information can be a guide to form a decision tree, i.e. understand which parameters must be taken to produce a topological insulator.
The effective model found by our algorithm is shown in Fig. 11(e). For both the unbiased and biased parameter selection we observe the characteristic features of the Haldane model with an added phase on the nearest neighbor hopping and real third-and fourth-neighbor hopping. The latter terms have already been found to be rather unimportant, i.e. the occurrence in our effective model is entirely due to the reference point. The beauty of this result is that by starting from a completely generic topologically trivial honeycomb model we reproduced the Haldane model as the characteristic topological Chern insulator by purely statistical means. Although we did introduce a bias to combat the noise in the data there are traces of the Haldane model already visible in the unbiased data set. The effective models for the C = +1 and C = −1 phase differ only in the sign of the phase in the next-nearest neighbor hopping as is known from Haldane's original work [19].

VI. CONCLUSION & OUTLOOK
We have presented a scheme to learn the characteristics of topological phases and extract minimal models for a specific lattice. Using methods from data science and statistics toolbox we performed dimensional reduction on an initially large feature space by extracting the most relevant features for the classification of each phase. Methods like these are essential to the construction of efficient machine learning models. Instead of training a classifier on the data we chose to inspect only the statistical distributions of the individual parameters and their correlations between one another, which comes at comparably low computational cost, and found that these quantities already contain enough information to extract a prototypical model for each topological phase. In particular, by starting from a generic (far from topological) honeycomb model, we recovered the prototypical Haldane model as the topological model in the Altland-Zirnbauer class A for the honeycomb lattice. It is expected that the method works even better for symmetry protected phases due the much lower potential for noise in models with fewer free parameters. While the presented results are valid only for the non-interacting regime one can use a similar approach to learn about possible topological phases in interacting systems [24].
Our method relies mainly on the inspection of integrated quantities, i.e. distribution functions where all but one features are integrated out. This raises the question if this can still be useful since more often than not phase boundaries are complicated functions of many if not all parameters of the model. However, we have observed that our approach captures the exact same physics as e.g. the permutation importance of random forests at much lower computational cost. In the present work correlations between pairs of features are taken into account, where we constrain the algorithm to features regarded as important in the first place.
By using the bare tight-binding parameters as features we maximize the potential of learning comprehensible information about the data itself, since these parameters carry a straight-forward meaning. The success of the method shows that this information can be easily extracted.
Engineering new features in the data processing phase would allow for a more quantitative description of the phase diagram. To this end one could make use of higherorder correlation functions and try to maximize the importance score of a proposed new feature in an iterative learning algorithm. The prospects of such a method highly depend on the complexity of the model, though.
Besides finding topological models for arbitrary lattices, as demonstrated, the method can be applied for data preparation and feature engineering for machine learning. In particular, by choosing fitted or ab-initio computed parameters as a starting point our method can easily predict the possibility of engineering a topological phase for that particular material as well as a guide to how one could achieve this goal. The task of extracting a prototypical model can be accomplished much easier than with a complicated machine learning model, which by construction is good at predicting but hard to understand. A way to combine both approaches would be to increase the interpretability of machine learning, which has been a highly active field of research in recent years [34][35][36][37]. By performing feature optimization to reduce the complexity of the model we have applied one possible step in this direction in the present work.