Machine learning for complete intersection Calabi-Yau manifolds: a methodological study

We revisit the question of predicting both Hodge numbers $h^{1,1}$ and $h^{2,1}$ of complete intersection Calabi-Yau (CICY) 3-folds using machine learning (ML), considering both the old and new datasets built respectively by Candelas-Dale-Lutken-Schimmrigk / Green-H\"ubsch-Lutken and by Anderson-Gao-Gray-Lee. In real world applications, implementing a ML system rarely reduces to feed the brute data to the algorithm. Instead, the typical workflow starts with an exploratory data analysis (EDA) which aims at understanding better the input data and finding an optimal representation. It is followed by the design of a validation procedure and a baseline model. Finally, several ML models are compared and combined, often involving neural networks with a topology more complicated than the sequential models typically used in physics. By following this procedure, we improve the accuracy of ML computations for Hodge numbers with respect to the existing literature. First, we obtain 97% (resp. 99%) accuracy for $h^{1,1}$ using a neural network inspired by the Inception model for the old dataset, using only 30% (resp. 70%) of the data for training. For the new one, a simple linear regression leads to almost 100% accuracy with 30% of the data for training. The computation of $h^{2,1}$ is less successful as we manage to reach only 50% accuracy for both datasets, but this is still better than the 16% obtained with a simple neural network (SVM with Gaussian kernel and feature engineering and sequential convolutional network reach at best 36%). This serves as a proof of concept that neural networks can be valuable to study the properties of geometries appearing in string theory.


Introduction
The last few years have seen a major uprising of machine learning (ML), and more particularly of neural networks [1][2][3]. This technology is extremely efficient at discovering and predicting patterns and now pervades most fields of applied sciences and of the industry. In view of its versatility, it is likely that ML will find its way towards high-energy and theoretical physics (see [4][5][6][7][8][9] for selected reviews). One of the most critical places where progress can be expected is in understanding the geometries used to describe string compactifications.
String theory is the most developed candidate for a theory of quantum gravity together with the unification of matter and interactions. However, it predicts ten spacetime dimensions: to recover our four-dimensional Universe, it is necessary to compactify six dimensions. For string theory to be a fundamental theory of reality, a single compactification should describe the current Universe (obviously, other compactifications may enter at early or later stages since spacetime is dynamical). Unfortunately, the number of possibilities -forming the so-called string landscape -is huge (numbers as high as 10 272 000 have been suggested for some models) [10][11][12][13][14][15][16][17][18][19], the mathematical objects entering the compactifications are complex and typical problems are often NP-complete, NP-hard, or even undecidable [9,20,21], making an exhaustive classification impossible. Additionally, there is no single framework to describe all the possible (flux) compactifications. As a consequence, each class of models must be studied with different methods. This has prevented any precise connection to the existing and tested theories (in particular, the Standard Model of particle physics) or the proposal of a sharply defined and doable experiment.
Until recently, the string landscape has been studied using different methods: 1) analytic computations for simple examples, 2) general statistics, 3) random scans, 4) algorithmic enumerations of possibilities. This has been a large endeavor of the string community, and we refer to the reviews [9,[22][23][24][25][26] and to references therein for more details. The main objective of such studies is to understand what are the generic predictions of string theory: even if "the" correct compactification has not been found, this helps to narrow down what to look for experimentally. The first conclusion of these studies is that compactifications giving an effective theory close to the Standard Model are scarce. 1 Each of the four approaches display different limitations: 1) lacks of genericity, 2) is too much general, 3) ignores the structure of the landscape and has few chances to discover rare compactifications, 4) requires too much computational power to move beyond "simple" examples. As a result, no major phenomenological progress has been seen in the last decade and finding a physical compactification looks still as remote. In reaction to these difficulties and starting with the seminal paper [27], new investigations based on ML appeared in the recent years, focusing on different aspects of the string landscape and of the geometries used in compactifications  (see also [52][53][54][55][56][57][58][59][60][61][62][63][64][65] for related works). For more context and a summary of the state of the art, the reader is referred to the excellent review [9]. ML is extremely adequate when it comes to pattern search, which motivates two main applications to string theory: 1) explore systematically a space of possibilities (if they are not random, ML should be able to find a pattern, even if it is too complicated to be formulated explicitly), 2) obtain approximate results on distributions from which mathematical formulas can be deduced.
We want to address the question of computing the Hodge numbers h 1,1 and h 2,1 (positive integers) for complete intersection Calabi-Yau (CICY) 3-folds [66] using different machine learning algorithms. A CICY is completely specified by its configuration matrix (with entries being positive integers), which is the basic input of the algorithms. The CICY 3-folds are the simplest Calabi-Yau and they have been well studied. In particular, they have been completely classified and their topological properties computed [67][68][69] (see [70][71][72][73] for reviews). For these reasons, they provide an excellent sandbox to test ML algorithms in a controlled environment. More particularly, simple tests show that the task is difficult for simple ML algorithms -even neural networks -such that this is an interesting challenge to solve before moving to more difficult problems.
The goal is to predict two positive integers from a matrix of positive integers. This task is complicated by various redundancies in the description (such as an independence in the permutations of lines and columns). A simple sequential network taking only the matrix as input performs badly, especially for h 2,1 . As a consequence, more advanced methods are needed. While usual physics application of ML reduces to feeding a (big) sequential neural network with raw data, real-world applications are built following a more general workflow [3,74,75]: 1) understanding of the problem, 2) exploratory data analysis (EDA), 3) design of a baseline, 4) definition of a validation strategy, 5) feature engineering and selection, 6) design of ML models, 7) ensembling.
While the first step is straightforward, it is still interesting to notice that computations involved in string geometries (using algebraic topology) are far from standard applications of ML algorithms, which makes the problem even more interesting. EDA aims at understanding better the dataset, in particular, by finding how the variables are distributed, correlated, determining if there are outliers, etc. This analysis naturally leads to designing new features from the existing ones, which is called feature engineering. Indeed, putting derived features by hand may make the data more easily understandable by the ML algorithms, for example by emphasizing important properties. 2 This phase is followed by feature selection, where different set of features are chosen according to the need of each algorithm from step 6). In between, one needs to set up a validation strategy to ensure that the predictions appropriately reflect the real values, together with a baseline model, which gives a lower bound on the accuracy together with a working pipeline. 3 For instance, we find that a simple linear regression using the configuration matrix as input gives 43.6-48.8 % for h 1,1 and 9.6-10.4 % for h 2,1 using from 20% to 80% of data for training. Hence, any algorithm must do better than this to be worth considering. Finally, we can build different models in step 6), in particular, by considering different topologies of neural networks beyond the simplest sequential models. The last optional step consists in combining different models together in order to improve the results. With respect to the whole process, the purpose of this paper is also pedagogical and aims at exemplifying how these steps are performed in an applied ML project.
There is a finite number of 7890 CICY 3-folds. Due to the freedom in representing the configuration matrix, two datasets have been constructed: the "original dataset" [67,68] and the "favourable dataset" [69]. A configuration matrix is said to be favorable if its second cohomology descends completely from the second cohomology of the ambient space: this implies that h 1,1 equals the number of projective spaces in the ambient space [69,76]. In the "favourable dataset", all configuration matrices are favorable whenever possible (99.1 %), whereas in the "original dataset" only 61.8 % of the matrices are favorable. Both datasets will be described in more details in Section 2.2.
Our analysis continues and generalizes [30,33] at different levels. We compute h 2,1 , which has been ignored in [30,33], where the authors argue that it can be computed from h 1,1 and from the Euler characteristics (a simple formula exists for the latter). In our case, we want to push the idea of using ML to learn about the physics (or the mathematics) of CY to its very end: we assume that we do not know anything about the mathematics of the CICY, except that the configuration matrix is sufficient to derive all quantities. Moreover, we have already mentioned that ML algorithms have rarely been used to derive data in algebraic topology, which can be a difficult task. For this reason, obtaining also h 2,1 from ML techniques is an important first step towards using ML for more general problems in string geometries. In particular, this helps to prepare the study of CICY 4-folds (classified in [77]) for which there are four Hodge numbers which are expected to be even more difficult to compute. Finally, regression is also more useful for extrapolating results: a classification approach assumes that we already know all the possible values of the Hodge numbers and has difficulties to predict labels which do not appear in the training set. This is necessary when we move to a dataset for which not all topological quantities have been computed, for instance CY constructed from the Kreuzer-Skarke list of polytopes [78].
In this paper, we compare the performances of the following algorithms: linear regression, support vector machines (SVM) with linear and Gaussian kernels, decision trees and ensemble thereof -random forests and gradient boosting -, and deep neural networks. The best results obtained with and withou feature engineering are displayed in Figure 1 for the old dataset. We find that, in all cases except neural networks, using engineered features greatly enhance the performances. The EDA reveals that the number of projective spaces forming the ambient space (equal to the number of rows) is a particularly distinguished feature. In fact, all algorithms yield an accuracy of 99-100 % for h 1,1 in the favorable dataset. For the linear regression, this directly gives the well-known results [69] that h 1,1 equals the number of projective spaces for favorable configuration matrix. In the case of the original dataset, the best model is a neural network inspired by Google's Inception model [79][80][81], which allows to reach nearly 100 % accuracy. This neural network is further studied in [82]. The algorithms are not as successful for h 2,1 , with the Inception model giving again the best result, close to 50 % accuracy -which is still much better that what the baseline or simple models do. We leave improving the computation of h 2,1 and interpreting what the different algorithms learn for a future work.
This paper is organized as follows. In Section 2, we first recall the definition of Calabi-Yau manifolds (Section 2.1) and describe the two existing CICY datasets (Section 2.2). We then engineer new features before performing an EDA for both datasets (Section 2.3), reproducing some well-known figures from the literature. Then, in Section 3, we implement the different ML algorithms. Our paper culminates in the description of the Inception-like neural network in Section 3.6.3 where we reach the highest accuracy. Finally, we discuss our results in Section 4. Appendix A contains details on the different algorithms used in this paper.

Data Analysis
In this section, we introduce Calabi-Yau (CY) manifolds before describing the two datasets of CICY manifolds (Section 2.2). Since the CICY have been completely classified, they provide a good opportunity for testing ideas from ML in a controlled setting. In order to select the most appropriate learning algorithm, we perform a preliminary exploratory data analysis (EDA) in Section 2.3.   The plots show the best accuracy reached by the models considered in this paper for the old dataset. The models are trained to predict separately h 1,1 and h 2,1 and using 30% and 80% of the data for training.

Calabi-Yau Manifolds
A CY n-fold is a n-dimensional complex manifold X with SU(n) holonomy (they have 2n real dimensions). An equivalent definition is the vanishing of its first Chern class. A standard reference for the physicist is [71] (see also [72,73] for useful references). The most relevant case for superstring compactifications are CY 3-folds. Indeed, superstrings are well-defined only in 10 dimensions: in order to recover a 4-dimensional theory, it is necessary to compactify 6 dimensions [71]. Importantly, the compactification on a CY leads to the breaking of a large part of the supersymmetry, which is phenomenologically more realistic.
Calabi-Yau manifolds are characterized by a certain number of topological properties, the most salient being the Hodge numbers h 1,1 and h 2,1 , counting respectively the Kähler and complex structure deformations, and the Euler characteristics 4 Interestingly, topological properties of the manifold directly translates into features of the 4-dimensional effective action (in particular, the number of fields, the representations and the gauge symmetry) [71,90]. 5 In particular, the Hodge numbers count the number of chiral multiplets (in heterotic compactifications) and the number of hyper-and vector multiplets (in type II compactifications): these are related to the number of fermion generations (3 in the Standard Model) and is thus an important measure of the distance to the Standard Model.
The simplest CYs are constructed by considering the complete intersection of hypersurfaces in a product A of projective spaces P ni (called the ambient space) [66-69, 72, 91]: Such hypersurfaces are defined by homogeneous polynomial equations: a Calabi-Yau X is described by the solution to the system of equations, i.e. by the intersection of all these surfaces (that the intersection is "complete" means that the hypersurface is non-degenerate).
To gain some intuition, consider the case of a single projective space P n with (homogeneous) coordinates Z I , I = 0, . . . , n. In this case, a codimension 1 subspace is obtained by imposing a single homogeneous polynomial equation of degree a on the coordinates p a (Z 0 , . . . , Z n ) = P I1···Ia Z I1 · · · Z Ia = 0, p a (λZ 0 , . . . , λZ n ) = λ a p a (Z 0 , . . . , Z n ). ( This construction can be generalized to include m projective spaces and k equations, which can mix the coordinates of the different spaces. A CICY 3-fold X as a topological manifold is completely specified by a configuration matrix denoted by the same symbol as the manifold: where the coefficients a r α are positive integers and satisfy the following constraints The first relation states that the dimension of the ambient space minus the number of equations equals the dimension of the CY 3-fold. The second set of constraints arise from the vanishing of its first Chern class; they imply that the n i can be recovered from the matrix elements. In this case also, two manifolds described by the same configuration matrix but different polynomials are equivalent as real manifold (they are diffeomorphic) -and thus as topological manifolds -, but they are different as complex manifolds. Hence, it makes sense to write only the configuration matrix.
A given topological manifold is not described by a unique configuration matrix. First, any permutation of the lines and columns leave the intersection unchanged (it amounts to relabelling the projective spaces and equations). Secondly, two intersections can define the same manifold. The ambiguity in the line and column permutations is often fixed by imposing some ordering of the coefficients. Moreover, in most cases, there is an optimal representation of the manifold X, called favourable [69]: in such a form, topological properties of X can be more easily derived from the ambient space A.

Datasets
Simple arguments [66,67,70] show that the number of CICY is necessarily finite due to the constraints (2.6) together with identities between complete intersection manifolds. The classification of the CICY 3-folds has been tackled in [67], which established a dataset of 7890 CICY. 6 The topological properties of each of these manifolds have been computed in [68]. More recently, a new classification has been performed [69] in order to find the favourable representation of each manifold whenever it is possible.
Below we show a list of the CICY properties and of their configuration matrices: • general properties • "original dataset" [67,68] maximal size of the configuration matrices: 12 × 15 6 However, there are redundancies in this set [67,69,92]; this fact will be ignored in this paper. The configuration matrix completely encodes the information of the CICY and all topological quantities can be derived from it. However, the computations are involved and there is often no closed-form expression. This situation is typical in algebraic geometry, and it can be even worse for some problems, in the sense that it is not even known how to compute the desired quantity (think to the metric of CYs). For these reasons, it is interesting to study how we can retrieve these properties using ML algorithms. In the current paper, following [30,33], we focus on the computation of the Hodge numbers with the initial scheme: Input: configuration matrix −→ Output: Hodge numbers (2.7) To provide a good test case for the use of ML in context where the mathematical theory is not completely understood, we will make no use of known formulas.

Exploratory Data Analysis
A typical ML project does not consist of feeding the raw data -here, the configuration matrix -to the algorithm. It is instead preceded by a phase of exploration in order to better understand the data, which in turn can help to design the learning algorithms. We call features properties given as inputs, and labels the targets of the predictions. There are several phases in the exploratory data analysis (EDA): 1. feature engineering: new features are derived from the inputs; 2. feature selection: the most relevant features are chosen to explain the targets; 3. data augmentation: new training data is generated from the existing ones; 4. data diminution: part of the training data is not used.
For pragmatical introductions, the reader is refereed to [74,75]. Engineered features are redundant, by definition, but can help the algorithm learn more efficiently by providing an alternative formulation and by drawing attention on salient characteristics. A simple example is the following: given a series of numbers, one can compute different statistics -median, mean, variance, etc. -and add them to the inputs. It may happen that the initial series becomes then irrelevant once this new information is introduced.
Another approach to improve the learning process is to augment or decrease the number of training samples artificially. For example, one can use invariances of the inputs to generate more training data. This does not help in our case because the entries of the configuration matrices are partially ordered. Another possibility is to remove outliers which can damage the learning process by driving the algorithm far from the best solution. If there are few of them, it is better to ignore them altogether during training since an algorithm which is not robust to outliers will in any case make bad predictions (a standard illustration is given by the Pearson and Spearman correlation coefficients, with the first not being robust to outliers [75]).
Finding good features and selecting those to keep requires trials and errors. In general, it is not necessary to keep track of all steps, but we feel that it is useful to do so in this paper for a pedagogical purpose.
Before starting the EDA, the first step should be to split the data into training and validation sets to avoid biasing the choices of the algorithm and the strategy: the EDA should be performed only on the training set. However, the dataset we consider is complete and quite uniform: a subset of it would display the same characteristics as the entire set. To give a general overview of the properties -which can be useful for the reader interested in understanding the statistics of the CICY and for applications to string compactifications -we work with the full dataset.

Engineering
Any transformation of the input data which has some mathematical meaning can be a useful feature. We have established the following list of possibly useful quantities (most of them are already used to characterise CICY in the literature [71]): • the number of projective spaces (rows), m = num_cp; • the number of equations (columns), k = num_eqs; • the number of P 1 , f = num_cp_1; • the number of P 2 , num_cp_2; • the number of P n with n = 1, F = num_cp_neq1; • the excess number N ex = F r=1 (n r + f + m − 2k) = num_ex; • the dimension of the cohomology group H 0 of the ambient space, dim_h0_amb; • the Frobenius norm of the matrix, norm_matrix; • the list of the projective space dimensions dim_cp and statistics thereof (min, max, median, mean); • the list of the equation degrees deg_eqs and statistics thereof (min, max, median, mean); • k-means clustering on the components of the configuration matrix (with a number of clusters going from 2 to 15); 7 • principal components of the configuration matrix derived using a principal components analysis (PCA) with 99% of the variance retained (see Figure 3).

Selection
Correlations To get a first general idea, it is useful to take a look at the correlation matrix of the features and the labels. 8 The correlation matrices for the scalar variables are displayed in Figure 4 for the original and favourable datasets (this excludes the configuration matrix). As we can see, some engineered features are strongly correlated, especially in the favourable dataset. In particular h 1,1 (respectively h 2,1 ) correlates (respectively anti-correlates) strongly with the number of projective spaces m and with the norm and rank of the matrix. This gives a first hint that these variables could help improve predictions by feeding them to the algorithm along with the matrix. On the other hand, finer information on the number of projective spaces and equations do not correlate with the Hodge numbers.
From this analysis, in particular from Figure 4, we find that the values of h 1,1 and h 2,1 are also correlated. This motivates the simultaneous learning of both Hodge numbers since it can increase chances for the neural network to learn more universal features. In fact, this is something that often happens in practice: counter-intuitively, it has been found that multi-tasking enhances the ability to generalize [93][94][95][96][97]. 7 The algorithm determines the centroids of conglomerates of data called clusters using an iterative process which computes the distance of each sample from the center of the cluster. It then assigns the label of the cluster to the nearest samples. We used the class cluster.KMeans in scikit-learn. 8 The correlation is defined as the ratio between the covariance of two variables σ(x, y) = i (x i −x)(y i − y) and the product of the standard deviations σ(x)σ(y) (in this casex andȳ are the sample means).  Feature importance A second option is to sort the features by order of importance. This can be done using a decision tree which is capable to determine the weight of each variable towards making a prediction. One advantage over correlations is that the algorithm is non-linear and can thus determine subtler relations between the features and labels. To avoid biasing the results using only one decision tree, we trained a random forest of trees (using ensemble.RandomForestRegressor in scikit-learn). It consists in a large number of decision trees which are trained on different random subsets of the training dataset and averaged over the outputs (see also Section 3.5 and appendix A.3). The algorithm determines the importance of the different features to make predictions as a by-product of the learning process, because the most relevant features tend to be found at the first branches since they are the most important to make the prediction. The importance of a variable is a number between 0 and 1, and the sum over all of them must be 1. Since a random forest contains many trees, the robustness of the variable ranking usually improves with respect to a single tree (Appendix A.3). Moreover, as the main objective is to obtain a qualitative preliminary understanding of the features, there is no need for fine tuning at this stage and we use the default parameters (in particular, 100 decision trees). We computed feature importance for both datasets and for two different set of variables: one containing the engineered features and the configuration matrix, and one with the engineered features and the PCA components. In the following figures, we show several comparisons of the importance of the features, dividing the figures into scalars, vectors and configuration matrix (or its PCA), and clusters. The sum of importance of all features equals 1.
In Figure 5, we show the ranking of the scalar features in the two datasets (differences between the set using the configuration matrix and the other using the PCA are marginal and are not shown to avoid redundant plots). As already mentioned, we find again that the number of projective spaces is the most important feature by far. It is followed by the matrix norm in the original dataset, and by the matrix rank for h 2,1 in the favourable dataset, but in a lesser measure. Finally, it points out that the other features have a negligible impact on the determination of the labels and may as well be ignored during training.
The same analysis can be repeated for the vector features and the configuration matrix component by component. In Figure 6, we show the cumulative importance of the features (i.e. the sum of the importance of each component). We can appreciate that the list of the projective space dimensions plays a major role in the determination of the labels in both datasets. In the case of h 2,1 , we also have a large contribution from the dimensions of the cohomology group dim_h0_amb, as can be expected from algebraic topology [71].
In Figure 7, we finally show the importance associated to the number of clusters used during the EDA: no matter how many clusters we use, their relevance is definitely marginal compared to all other features used in the variable ranking (scalars, vectors, and the configuration matrix or its PCA) for both datasets.  Figure 5: Importance of the scalar features in the datasets. The same computation involving the PCA of the configuration matrix shows very marginal differences in this case: the importance of the scalar features is mostly unchanged, especially for the higher ranked variables.
Conclusion It seems therefore that the number of projective spaces plays a relevant role in the determination of h 1,1 and h 2,1 as well as the list of dimensions of the projective spaces. In order to validate this observation, in Figure 8 we present a scatter plot of the Hodge number distributions versus the number of projective spaces: it shows that there is indeed a linear dependence in m for h 1,1 , especially in the favourable dataset. In fact, the only exceptions to this pattern in the latter case are the manifolds which do not have a favourable embedding [69]. Hence, a simple data analysis hints naturally towards this mathematical result. Finally, we found other features which may be relevant and are worth to be included in the algorithm: the matrix rank and norm, the list of projective space dimensions and of the associated cohomology dimensions. However, we want to emphasize one caveat to this analysis: correlations look only for linear relations, and the random forest has not been optimized or could just be not powerful enough to make good predictions. This means that feature selection just gives a hint but it may be necessary to adapt.

Removing Outliers
The Hodge number distributions (Figures 2 and 9) display few outliers which lie outside the tail of the main distributions. Such outliers may negatively impact the learning process and drive down the accuracy: it makes sense to remove them from the training set.
It is easy to see that the 22 outlying manifolds with h 1,1 = h 2,1 = 0 are product spaces, recognisable from their block-diagonal matrix. Moreover, we will also remove outliers with h 1,1 = 19 and h 2,1 > 86, which represent 15 and 2 samples. In total, this represents 39 samples, or 0.49 % of the total data.
To simplify the overall presentation and because the dataset is complete, we will mainly focus on the pruned subset of the data obtained by removing outliers, even from the test  Figure 6: Importance of the vector features the configuration matrix (or its principal components) in the datasets: notice how the PCA plays a much more important role with respect to the full configuration matrix.  Figure 7: Incidence of the numbers of clusters on the variable ranking. Also in this case the difference between using the configuration matrix or its PCA is marginal and actually the clusters have even lower importance when using the latter. We therefore avoid presenting nugatory information and show only the importance of clusters when using the configuration matrix.
set. 9 This implies that Hodge numbers lie in the ranges 1 ≤ h 1,1 ≤ 16 and 15 ≤ h 2,1 ≤ 86. Except when stated otherwise, accuracy is indicated for this pruned dataset. Obviously, the very small percentage of outliers makes the effect of removing them from the test set negligible when stating accuracy.

Machine Learning Analysis
In this section, we compare the performances of different ML algorithms: linear regression, SVM, random forests, gradient boosted trees and neural networks. Before reporting the results for each algorithm, we detail the feature selection (Section 3.1) and the evaluation strategy (Section 3.2). We obtain the best results in Section 3.6.3 where we present a neural network inspired by the Inception model [79][80][81]. We provide some details on the different algorithms in Appendix A and refer the reader to the literature [1-3, 5, 7, 9, 74, 75] for more details.

Feature Extraction
In Section 2, the EDA showed that several engineered features are promising for predicting the Hodge numbers. In what follows, we will compare the performances of various algorithms using different subsets of features: • only the configuration matrix (no feature engineering); • only the number of projective spaces m; • only a subset of engineered features and not the configuration matrix nor its PCA; • a subset of engineered features and the PCA of the matrix. 9 There is no obligation to use a ML algorithm to label outliers in the training set, it is perfectly fine to decide which data to include or not, even based on targets. However, for a real-world application, outliers in the test set should be labeled by some process based only on the input features. Flagging possible outliers may improve the predictions by helping the machine understand that such samples require more caution. Following the EDA and feature engineering, we finally select the features we use in the analysis by choosing the highest ranked features. We will therefore keep the number of projective spaces (num_cp in the dataset) and the list of the dimension of the projective spaces (dim_cp) for both h 1,1 and h 2,1 ). We will also include the dimension of the cohomology group of the ambient space dim_h0_amb but only for h 2,1 .

Analysis Strategy
For the ML analysis, we split the dataset into training and test sets: we fit the algorithms on the first and then show the predictions on the test set, which will not be touched until the algorithms are ready.

Test split and validation
The training set is made of 90 % of the samples for training, which leaves the remaining 10 % in the test set (i.e. 785 manifolds out of the 7851 in the set). 10 For most algorithms, we use leave-one-out cross-validation on the training set as evaluation of the algorithm: we subdivide the training set in 9 subsets, each of them containing 10 % of the total amount of samples, then, we train the algorithm on 8 of them and evaluate it on the 9th. We then repeat the procedure changing the evaluation fold until the algorithm has been trained and evaluated on all of them. The performance measure in validation is given by the average over all the left out folds.
When training neural networks, we will however use a single holdout validation set made of 10 % of the total samples.

Predictions and metrics
Since we are interested in predicting exactly the Hodge numbers, the appropriate metric measuring the success of the predictions is the accuracy (for each Hodge number separately): where N is the number of samples. In the paper, accuracy of the predictions on the test set is rounded to the nearest integer.
Since the Hodge numbers are integers, the problem of predicting them looks like a classification task. However, as argued in the introduction, we prefer to use a regression approach. Indeed, regression does not require to specify the data boundaries and allows to extrapolate beyond them, contrary to a classification approach where the categories are fixed at the beginning. 11 Most algorithms need a differentiable loss function since the optimization of parameters (such as neural networks weights) uses some variant of gradient descent. For this reason, the accuracy cannot be used and the models are trained by minimizing the mean squared error (MSE), which is simply the squared 2 -norm between of the difference between the predictions and the real values. There will however be also a restricted number of cases in which we will use either the mean absolute error (MAE), which is the 1 -norm of the same difference, or a weighted linear combination of MSE and MAE (also known as Huber loss): we will point them out at the right time. When predicting both Hodge numbers together, the total loss is the sum of each individual loss with equal weight: h 1,1 is simpler to learn so it is useful to put emphasis on learning h 2,1 , but the magnitudes of the latter are higher, such that the associated loss is naturally bigger (since we did not normalize the data).
Since predictions are real numbers, we need to turn them into integers. In general, rounding to the nearest integer gives the best result, but we found algorithms (such as linear regression) for which flooring to the integer below works better. The optimal choice of the integer function is found for each algorithm as part of the hyperparameter optimization (described below). The accuracy is computed after the rounding stage.
Learning curves for some models are displayed. They show how the performances of a model improves by using more training data, for fixed hyperparameters. To obtain it, we train models using from 10 % to 90 % of all the data ("training ratio") and evaluate the accuracy on the remaining data. 12 To avoid redundant information and to avoid cluttering the paper with graphs, the results for models predicting separately the Hodge numbers for the test set are reported in tables, while the results for the models predicting both numbers together are reported in the learning curves. For the same reason, the latter are not displayed for the favourable dataset.
Visualisation of the performance Complementary to the predictions and the accuracy results, we also provide different visualisations of the performance of the models in the form of univariate plots (histograms) and multivariate distributions (scatter plots).
The usual assumption behind the statistical inference of a distribution is that the difference between the observed data and the predicted values can be modelled by a random variable called residual [98,99]. 13 As such we expect that its values can be sampled from 11 A natural way to transform the problem in a regression task is to normalize the Hodge numbers, for example by shifting by the mean value and diving by the standard deviation. Under this transformation, the Hodge numbers are mapped to real numbers. While normalizing often improve ML algorithms, we found that the impact was mild or even negative. 12 Statistics are not provided due to the limitations of our available computational resources. However, we check manually on few examples that the reported results are typical. 13 The difference between the non observable true value of the model and the observed data is known as statistical error. The difference between residuals and errors is subtle but the two definitions have different interpretations in the context of the regression analysis: in a sense, residuals are an estimate of the errors. a normal distribution with a constant variance (i.e. constant width), since it should not depend on specific observations, and centered around zero, since the regression algorithm tries to minimise the squared difference between observed and predicted values. Histograms of the residual errors should therefore exhibit such properties graphically.
Another interesting kind of visual realisation of the residuals is to show their distribution against the variables used for the regression model: in the case of a simple regression model in one variable, it is customary to plot the residuals as a function of the independent variable, but in a multivariable regression analysis (such as the case at hand) the choice usually falls on the values predicted by the fit (not the observed data). We shall therefore plot the residuals as functions of the predicted values. 14 Given the assumption of the random distribution of the residuals, they should not present strong correlations with the predictions and should not exhibit trends. In general the presence of correlated residuals is an indication of an incomplete or incorrect model which cannot explain the variance of the predicted data, meaning that the model is either not suitable for predictions or that we should add information (that is, add features) to it.
Hyperparameter optimisation One of the key steps in a ML analysis is the optimisation of the hyperparameters of the algorithm. These are internal parameters of each estimator (such as the number of trees in a random forest or the amount of regularisation in a linear model): they are not modified during the training of the model, but they directly influence it in terms of performance and outcome.
Hyperparameter optimization is performed by training many models with different hyperparameters, and keeping those which perform best according to some metric on the validation set(s). As it does not need to be differentiable, we use the accuracy as a scoring function to evaluate the models. There is however subtle issue because it is not clear how to combine the accuracy of h 1,1 and h 2,1 to get a single metric. For this reason, we will perform the analysis on both Hodge numbers separately. Then, we can design a single model computing both Hodge numbers simultaneously by making a compromise by hand between the hyperparameters found for the two models computing the Hodge numbers separately.
The optimization is implemented using the API from scikit-learn, using the function metrics.make_scorer and the accuracy as a custom scoring function. There are several approaches to perform this search automatically, in particular: grid search, random search, genetic evolution, and Bayes optimization.
Grid and random search are natively implemented in scikit-learn. The first takes a list of possible discrete values of the hyperparameters and will evaluate the algorithm over all possible combinations. The second samples values in both discrete sets and continuous intervals according to some probability distributions, repeating the process a fixed number of times. The grid search method is particularly useful for discrete hyperparameters, less refined searches or for a small number of combinations, while the second method can be used to explore the hyperparameter space on a larger scale [100]. Genetic algorithms are based on improving the choice of hyperparameters over generations that successively select only the most promising values: in general, they require a lot of tuning and are easily influenced by the fact that the replication process can also lead to worse results totally at random [101]. They are however effective when dealing with very deep or complex neural networks.
Bayes optimisation [102,103] is a very well established mathematical procedure to find the stationary points of a function without knowing its analytical form [104]. It relies on assigning a prior probability to a given parameter and then multiply it by the probability distribution (or likelihood) of the scoring function to compute the probability of finding a better results given a set of hyperparameters. This has proven to be very effective in our case and we adopted this solution as it does not require fine tuning and leads to better results for models which are not deep neural networks. We choose to use scikit-optimize [105] whose method BayesSearchCV has a very well implemented Python interface compatible with scikit-learn. We will in general perform 50 iterations of the Bayes search algorithm, unless otherwise specified.

Linear Models
Linear models attempt to describe the labels as a linear combinations of the input features while keeping the coefficients at order one (Appendix A.1). However, non-linearity can still be introduced by engineering features which are non-linear in terms of the original data.
From the results of Section 2.3, we made a hypothesis on the linear dependence of h 1,1 on the number of projective spaces m. As a first approach, we can try to fit a linear model to the data as a baseline computation and to test whether there is actual linear correlation between the two quantities. We will consider different linear models, including their regularised versions.
Parameters The linear regression is performed with the class linear_model.ElasticNet from scikit-learn. The hyperparameters involved in this case are: the amount of regularisation α, the relative ratio (l1_ratio) between the 1 and 2 regularization losses, and the fit of the intercept.
By performing the hyperparameter optimization, we found that 2 regularization has a minor impact and can be removed, which corresponds to setting the relative ratio to 1 (this is equivalent to using linear_model.Lasso).
In Table 1 we show the choices of the hyperparameters for the different models we built using the 1 regularised linear regression.
For the original dataset, we floored the predictions to the integers below, while in the favourable we rounded to the next integer. This choice for the original dataset makes sense: the majority of the samples lie on the line h 1,1 = m, but there are still many samples with h 1,1 > m (see Figure 8). As a consequence, the ML prediction pulls the line up, which can only damage the accuracy. Choosing the floor function is a way to counteract this effect. Note that accuracy for h 2,1 is only slightly affected by the choice of rounding, so we just choose the same one as h 1,1 for simplification.  Table 1: Hyperparameter choices of the 1 regression model used. In addition to the known hyperparameters α and fit_intercept, we also include the normalize parameter which indicates whether the samples have been centered and scaled by their 2 norm before the fit: it is ignored when the intercept is ignored.

Results
In Table 2, we show the accuracy for the best hyperparameters. For h 1,1 , the most precise predictions are given by the number of projective spaces which actually confirms the hypothesis of a strong linear dependence of h 1,1 on the number of projective spaces. In fact, this gives close to 100% accuracy for the favourable dataset, which shows that there is no need for more advanced ML algorithms. Moreover, adding more engineered features decreases the accuracy in most cases where regularization is not appropriate. The accuracy for h 2,1 remains low but including engineered features definitely improves it.
In Figure 10, we show the plots of the residual errors of the model on the original dataset. For the 1 regularised linear model, the univariate plots show that the errors seem to follow normal distributions peaked at 0 as they generally should: in the case of h 1,1 , the width is also quite contained. The scatter plots instead show that, in general, there is no correlation between a particular sector of the predictions and the error made by the model, thus the variance of the residuals is in general randomly distributed over the predictions. Only the case of the fit of the number of projective spaces seems to show a slight correlation for h 2,1 , signalling that the model using only one feature might be actually incomplete: in fact it is better to include also other engineered features.
The learning curves ( Figure 11) clearly shows that the model underfits. Moreover, we also noticed that the models are only marginally affected by the number of samples used for training. In particular, this provides a very strong baseline for h 1,1 . For comparison, we also give the learning curve for the favourable dataset in Figure 12: this shows that a linear regression is completely sufficient to determine h 1,1 in that case.

Support Vector Machines
Support Vector Machines (SVM) are a family of algorithms which use a kernel trick to map the space of input data vectors into a higher dimensional space where samples can be accurately separated and fitted to an appropriate curve (Appendix A.2).
In this analysis, we show two such kernels, namely a linear kernel (also known as no kernel since no transformations are involved) and a Gaussian kernel (known as rbf in ML literature, from radial basis function).

Linear Kernel
For this model we use the class svm.LinearSVR in scikit-learn. Table 3 we show the choices of the hyperparameters used for the model. As we show in Appendix A.2 parameters C and are related to the penalty assigned to the samples lying outside the no-penalty boundary (the loss in this case is computed according to the 1 or 2 norm of the distance from the boundary as specified by the loss hyperparameter). Other parameters are related to the use of the intercept to improve the prediction.

Parameters In
We rounded the predictions to the floor for the original dataset and to the next integer for the favourable dataset.   The parameter intercept_scaling is clearly only relevant when the intercept is used. The different losses used simply distinguish between the 1 norm of the -dependent boundary where no penalty is assigned and its 2 norm.

Results
In Table 4, we show the accuracy on the test set for the linear kernel. As we can see the performance of the algorithm strongly resembles a linear model in terms of the accuracy reached.
It is interesting to notice that the contributions of the PCA do not improve the predictions using just the engineered features: it seems that the latter work better than using the configuration matrix or its principal components.
The residual plots in Figure 13 confirm what we already said about the linear models with regularisation: the model with only the number of projective spaces shows a tendency to heteroscedasticity 15 which can be balanced by adding more engineered feature, also helping in having more precise predictions (translated into peaked univariate distributions). In all cases, we notice that the model slightly overestimates the real values (residuals are computed as the difference between the prediction and the real value) as the second, small peaks in the histograms for h 1,1 suggest: this may also explain why flooring the predictions produces the highest accuracy.
As in general for linear models, the influence of the number of samples used for training is marginal also in this case: we only noticed a decrease in accuracy when also including the PCA or directly the matrix.

Gaussian Kernel
We then consider SVM using a Gaussian function as kernel. The choice of the function can heavily influence the outcome of the predictions since they map the samples into a much higher dimensional space and create highly non-linear combinations of the features before fitting the algorithm. In general, this can help in the presence of "obscure" features which badly correlate one another. In our case, we can hope to leverage the already good correlations we found in the EDA with the kernel trick. The implementation is done with the class svm.SVR from scikit-learn.
Parameters As we show in Appendix A.2, this particular choice of kernel leads to profoundly different behaviour with respect to linear models: we will round the predictions to the next integer in both datasets since the loss function strongly penalises unaligned samples.
In Table 5, we show the choices of the hyperparameters for the models using the Gaussian kernel. As usual the hyperparameter C is connected to the penalty assigned to the samples outside the soft margin boundary (see Appendix A.2) delimited by the . Given the presence of a non linear kernel we have to introduce an additional hyperparameter γ which controls the width of the Gaussian function used for the support vectors.  Table 5: Hyperparameter choices of the SVR regression with Gaussian kernel.

Results
In Table 6, we show the accuracy of the predictions on the test sets. In the favourable dataset, we can immediately appreciate the strong linear dependence of h 1,1 on the number of projective spaces: even though there are a few non favourable embeddings in the dataset, the kernel trick is able to map them in a better representation and improve the accuracy. The predictions for the original dataset have also improved and are the best results we found using shallow learning. The predictions using only the configuration matrix matches [33], but we can slightly improve the accuracy by using a combination of engineered features and PCA.
In Figure 14, we show the residual plots and their histograms for the original dataset: residuals follow peaked distributions which, in this case, do not present a second smaller peak (thus we need to round to the next integer the predictions) and good variate distribution over the predictions.
The Gaussian kernel is also more influenced by the size of the training set. Using 50% of the samples as training set we witnessed a drop in accuracy of 3% while using engineered features and the PCA, and around 1% to 2% in all other cases. The learning curves ( Figure 15) show that the accuracy improves by using more data. Interestingly, it shows that using all engineered features leads to an overfit on the training data since both Hodge numbers reach almost 100%, while this is not the case for h 2,1 . For comparison, we also display in Figure 16 the learning curve for the favourable dataset: this shows that predicting h 1,1 accurately works out-of-the-box.

Decision Trees
We now consider two algorithms based on decision trees: random forests and gradient boosted trees. Decision trees are powerful algorithms which implement a simple decision rule (in the style of an if. . . then. . . else. . . statement) to classify or assign a value to the predictions. However, they have a tendency to adapt too well to the training set and to not be robust enough against small changes in the training data. We consider a generalisation of this algorithm used for ensemble learning: this is a technique in ML which uses multiple estimators (they can be the same or different) to improve the performances. We will present the results of random forests of trees which increase the bias compared to a single decision tree, and gradient boosted decision trees, which can use smaller trees to decrease the variance and learn better representations of the input data by iterating their decision functions and use information on the previous runs to improve (see Appendix A.3 for a more in-depth description).

Random Forests
The random forest algorithm is implemented with Scikit's ensemble.RandomForestRegressor.

Parameters
Hyperparameter tuning for decision trees can in general be quite challenging. From the general theory on random forests (Appendix A.3), we can try and look for partic-   Table 7, we show the hyperparameters used for the predictions. As we can see from n_estimator, random forests are usually built with a small number of fully grown (specified by max_depth and max_leaf_nodes) trees (not always the case, though). In order to avoid overfit we also tried to increase the number of samples necessary to split a branch or create a leaf node using min_samples_leaf and min_samples_split (introducing also a weight on the samples in the leaf nodes specified by min_weight_fraction_leaf to balance the tree). Finally the criterion chosen by the optimisation reflects the choice of the trees to measure the impurity of the predictions using either the mean squared error (mse) or the mean absolute error (mae) of the predictions (see Appendix A.3).   Table 7: Hyperparameter choices of the random forest regression.

Results
In Table 8, we summarise the accuracy reached using random forests of decision trees as estimators. As we already expected, the contribution of the number of projective spaces helps the algorithm to generate better predictions. In general, it seems that the engineered features alone can already provide a good basis for predictions. In the case of h 2,1 , the introduction of the principal components of the configuration matrix also increases the prediction capabilities. As in most other cases, we used the floor function for the predictions on the original dataset and the rounding to next integer for the favourable one. As usual, in Figure 17 we show the histograms of the distribution of the residual errors and the scatter plots of the residuals. While the distributions of the errors are slightly wider than the SVM algorithms, the scatter plots of the residual show a strong heteroscedasticity in the case of the fit using the number of projective spaces: though quite accurate, the model is strongly incomplete. The inclusion of the other engineered features definitely helps and also leads to better predictions. Learning curves are displayed in Figure 18.

Gradient Boosted Trees
We used the class ensemble.GradientBoostingRegressor from Scikit in order to implement the gradient boosted trees.
Parameters Hyperparameter optimisation has been performed using 25 iterations of the Bayes search algorithm since by comparison the gradient boosting algorithms took the longest learning time. We show the chosen hyperparameters in Table 9.
With respect to the random forests, for the gradient boosting we also need to introduce the learning_rate (or shrinking parameter) which controls the gradient descent of the optimisation which is driven by the choice of the loss parameters (ls is the ordinary least squares loss, lad is the least absolute deviation and huber is a combination of the previous two losses weighted by the hyperparameter α). We also introduce the subsample hyperparameter which chooses a fraction of the samples to be fed into the algorithm at each iteration. This procedure has both a regularisation effect on the trees, which should not adapt too much to the training set, and speeds up the training (at least by a very small amount).

Results
We show the results of gradient boosting in Table 10. As usual, the linear dependence of h 1,1 on the number of projective spaces is evident and in this case also produces the best accuracy result (using the floor function for the original dataset and rounding to the next integer for the favourable dataset) for h 1,1 . h 2,1 is once again strongly helped by the presence of the redundant features.
In Figure 19, we finally show the histograms and the scatter plots of the residual errors for the original dataset showing that also in this case the choice of the floor function can be justified and that the addition of the engineered features certainly improves the overall variance of the residuals.

Neural Networks
In this section we approach the problem of predicting the Hodge numbers using artificial neural networks (ANN), which we briefly review in Appendix A.4. We use Google's Tensorflow framework and Keras, its high-level API, to implement the architectures and train the networks [88,89]. We explore different architectures and discuss the results.
Differently from the previous algorithms, we do not perform a cross-validation scoring but we simply retain 10 % of the total set as a holdout validation set (also referred to as    development set) due to the computation power available. Thus, we use 80 % of the samples for training, 10 % for evaluation and 10 % as a test set. For the same reason, the optimisation of the algorithm has been performed manually. We always use the Adam optimiser with default learning rate 10 −3 to perform the gradient descent and a fix batch size of 32. The network is trained for a large number of epochs to avoid missing possible local optima. In order to avoid overshooting the minimum, we dynamically reduce the learning rate both using the Adam optimiser, which implements learning rate decay, and through the callback callbacks.ReduceLROnPlateau in Keras, which scales the learning rate by a given factor when the monitored quantity (e.g. the validation loss) does not decrease): we choose to reduce it by 0.3 when the validation loss does not improve for at least 75 epochs. Moreover, we stop training when the validation loss does not improve during 200 epochs. Clearly, we then keep only the weights of the neural networks which gave the best results. Batch normalization layers are used with a momentum of 0.99.
Training and evaluation were performed on a NVidia GeForce 940MX laptop GPU with 2 GB of RAM memory.

Fully Connected Network
First, we reproduce the analysis from [33] for the prediction of h 1,1 .
Model The neural network presented in [33] for the regression task contains 5 hidden layers with 876, 461, 437, 929 and 404 units ( Figure 20). All layers (including the output layer) are followed by a ReLU activation and by a dropout layer with a rate of 0.2072. This network contains roughly 1.58 × 10 6 parameters.
The other hyperparameters (like the optimiser, batch size, number of epochs, regularisation, etc.) are not mentioned. In order to reproduce the results, we have filled the gap as follows: • Adam optimiser with batch size of 32; • maximal number epochs of 2000 without early stopping; 16 • we implement learning rate reduction by 0.3 after 75 epochs without improvement of the validation loss; • no 1 or 2 regularisation; • a batch normalization layer [106] after each fully connected layer.

Results
We have first reproduced the results from [33], which are summarized in Table 11. The training process was very quick and the loss function is reported in Figure 21. We obtain an accuracy of 77% both on the development and the test set of the original dataset with 80% of training data (see Table 12). Using the same network, we also achieved 97% of accuracy in the favourable dataset.

Convolutional Network
We then present a new purely convolutional network to predict h 1,1 and h 2,1 , separately or together. The advantage of such networks is that it requires a smaller number of parameters and is insensitive to the size of the inputs. The latter point can be helpful to work without padding the matrices (of the same or different representations), but the use of a flatten layer removes this benefit.

Model
The neural network has 4 convolutional layers. They are connected to the output layer with a intermediate flatten layer. After each convolutional layer, we use the ReLU activation function and a batch normalisation layer (with momentum 0.99). Convolutional layers use the padding option same and a kernel of size (5,5) to be able to extract more meaningful representations of the input, treating the configuration matrix somewhat similarly to an object segmentation task [107]. The output layer is also followed by a ReLU activation in order to force the prediction to be a positive number. We use a dropout layer only after the convolutional network (before the flatten layer), but we introduced a combination of 2 and 1 regularisation to reduce the variance. The dropout rate is 0.2 in the original dataset and 0.4 for the favourable dataset, while 1 and 2 regularisation are set to 10 −5 . We train the model using the Adam optimiser with a starting learning rate of 10 −3 and a mini-batch size of 32.
The architecture is more similar in style to the old LeNet presented for the first time in 1998 by Y. LeCun during the ImageNet competition. In our implementation, however, we do not include the pooling operations and swap the usual order of batch normalisation and activation function by first putting the ReLU activation.
In Figure 22, we show the model architecture in the case of the original dataset and of predicting h 1,1 alone. The convolution layers have 180, 100, 40 and 20 units each.

Results
With this setup, we were able to achieve an accuracy of 94% on both the development and the test sets for the "old" database and 99% for the favourable dataset in both validation and test sets (results are briefly summarised in Table 12). We thus improved the results of the densely connected network and proved that convolutional networks can be valuable assets when dealing with the extraction of a good representation of the input data: not only are CNNs very good at recognising patterns and rotationally invariant objects inside pictures or general matrices of data, but deep architectures are also capable of transforming the input using non linear transformations [108] to create new patterns which can then be used for predictions.
Even though the convolution operation is very time consuming, another advantage of CNN is the extremely reduced number of parameters with respect to FC networks. 17 The architectures we used were in fact made of approximately 5.8 × 10 5 parameters: way less than half the number of parameters used in the FC network. Ultimately, this leads to a smaller number of training epochs necessary to achieve good predictions (see Figure 23).
Using this classic setup, we tried different architectures. The network for the original dataset seems to work best in the presence of larger kernels, dropping by roughly 5% in accuracy when a more "classical" 3 × 3 kernel is used. We also tried to use to set the padding to valid, reducing the input from a 12 × 15 matrix to a 1 × 1 feature map over the course of 5 layers with 180, 100, 75, 40 and 20 filters. The advantage is the reduction of the number of parameters (namely ∼ 4.9 × 10 5 ) mainly due to the small FC network at the end, but accuracy dropped to 87%. The favourable dataset seems instead to be more independent of the specific architecture, retaining accuracy also with smaller kernels.
The analysis for h 2,1 follows the same prescriptions. For both the original and favourable dataset, we opted for 4 convolutional layers with 250, 150, 100 and 50 filters and no FC network for a total amount of 2.1 × 10 6 parameters.
In this scenario we were able to achieve 36% of accuracy in the development set and 40% on the test set for h 2,1 in the "old" dataset and 31% in both development and test sets in the favourable set (see Table 12).
The learning curves for both Hodge numbers are given in Figure 24. This model uses the same architecture as the one for predicting h 1,1 only, which explains why it is less accurate as it needs to also adapt to compute h 2,1 -a difficult task, as we have seen (see for example Figure 28).

Inception-like Neural Network
In the effort to find a better architecture, we took inspiration from Google's winning CNN in the annual ImageNet challenge in 2014 [79][80][81]. The architecture presented uses inception modules in which separate 3×3, 5×5 convolutions are performed side by side (together with max pooling operations) before recombining the outputs. The modules are then repeated until the output layer is reached. This has two evident advantages: users can avoid taking a completely arbitrary decision on the type of convolution to use since the network will take care of it tuning the weights, and the number of parameters is extremely restricted as the network can learn complicated functions using fewer layers. As a consequence the architecture of such models can be made very deep while keeping the number of parameters contained, thus being able to learn very difficult representations of the input and producing accurate predictions. Moreover, while the training phase might become very long due to the complicated convolutional operations, the small number of parameters is such that predictions can be generated in a very small amount of time, making inception-like models extremely appropriate whenever quick predictions are necessary. Another advantage of the architecture is the presence of different kernel sizes inside each module: the network automatically learns features at different scales and different positions, thus leveraging the advantages of a deep architecture with the ability to learn different representations at the same time and compare them.
Model In Figure 25, we show a schematic of our implementation. Differently from the image classification task, we drop the pooling operation and implement two side-by-side convolution over rows (12 × 1 kernel for the original dataset, 15 × 1 for the favourable) and one over columns (1 × 15 and 1 × 18 respectively). 18 We use same as padding option. The output of the convolutions are then concatenated in the filter dimensions before repeating the "inception" module. The results from the last module are directly connected to the output layer through a flatten layer. In both datasets, we use batch normalisation layers (with momentum 0.99) after each concatenation layer and a dropout layer (with rate 0.2) before the FC network. 19 For both h 1,1 and h 2,1 (in both datasets), we used 3 modules made by 32, 64 and 32 filters for the first Hodge number, and 128, 128 and 64 filters for the second. We also included 1 and 2 regularisation of magnitude 10 −4 in all cases. The number of parameters was thus restricted to 2.3 × 10 5 parameters for h 1,1 in the original dataset and 2.9 × 10 5 in the favourable set, and 1.1 × 10 6 parameters for h 2,1 in the original dataset and 1.4 × 10 6 in the favourable dataset. In all cases, the number of parameters has decreased by a significant amount: in the case of h 1,1 they are roughly 1 3 of the parameters used in the classical CNN and around 1 6 of those used in the FC network. For training we used the Adam gradient descent with an initial learning rate of 10 −3 and a batch size of 32. The callbacks helped to contain the training time (without optimisation) under 5 hours for each Hodge number in each dataset.

Results
With these architectures, we were able to achieve more than 99 % of accuracy for h 1,1 in the test set (same for the development set) and 50 % of accuracy for h 2,1 (a slightly smaller value for the development set). We report the results in Table 12. 18 Pooling operations are used to shrink the size of the input. Similar to convolutions, they use a window of a given size to scan the input and select particular values inside. For instance, we could select the average value inside the small portion selected, performing an average pooling operation, or the maximum value, a max pooling operation. This usually improves image classification and object detection tasks as it can be used to sharpen edges and borders. 19 The position of the batch normalisation is extremely important as the parameters computed by such layer directly influence the following batch. We however opted to wait for the scan over rows and columns to finish before normalising the outcome to avoid biasing the resulting activation function.   Figure 25: In each concatenation module (here shown for the "old" dataset) we operate with separate convolution operations over rows and columns, then concatenate the results. The overall architecture is composed of 3 "inception" modules made by two separate convolutions, a concatenation layer and a batch normalisation layer (strictly in this order), followed by a dropout layer, a flatten layer and the output layer with ReLU activation (in this order).
We therefore increased the accuracy for both Hodge numbers (especially h 2,1 ) compared to what can achieve a simple sequential network, while at the same time reducing significantly the number of parameters of the network. 20 This increases the robustness of the method and its generalisation properties.
In Figure 27, we show the distribution of the residuals and their scatter plot, showing that the distribution of the errors does not present pathological behaviour and the variance of the residuals is well distributed over the predictions.
In fact, this neural network is much more powerful than the previous networks we considered, as can be seen by studying the learning curves ( Figure 28). When predicting only h 1,1 , it surpasses 97% accuracy using only 30% of the data for training. While it seems that the predictions suffer when using a single network for both Hodge numbers, this remains much better than any other algorithm. It may seem counter-intuitive that convolutions work well on this data since they are not translation or rotation invariant, but only permutation invariant. However, convolution alone is not sufficient to ensure invariances under these transformations but it must be supplemented with pooling operations [1], which we do not use. Moreover, convolution layers do more than just taking translation properties into account: they allow to make highly complicated combinations of the inputs and to share weights among components, which allow to find subtler patterns than standard fully connected layers. This network is more studied in more details in [82].

Boosting the Inception-like Model
To improve further the accuracy of h 2,1 , we have tried to modify the network by adding engineered features as auxiliary inputs. This can be done by adding inputs to the inception neural network and merging the different branches at different stages. There are two possibilities to train such a network: 1) train all the network directly, or 2) train the inception network alone, then freeze its weights and connect it to the additional inputs, training only the new layer. We found that the architectures we tried did not improve the accuracy, but we briefly describe our attempts for completeness. We focused in particular on the number of projective spaces, the vector of dimensions of the projective spaces and the vector of dimensions of the principal cohomology group) and predicting h 1,1 and h 2,1 at the same time. The core of the neural network is the Inception network described in Section 3.6.3. Then, the engineered features are processed using fully connected layers and merged to the predictions from the Inception branch using a concatenation layer. Obviously, output layers for h 1,1 and h 2,1 can be located on different branches, which allow for different processing of the features.
As mentioned earlier, a possible approach is to first train the Inception branch alone, before freezing its weights and connecting it to the rest of the network. This can prevent spoiling the already good predictions and speed up the new learning process. This is a common technique called transfer learning: we can use a model previously trained on a slightly different task and use its weights as part of the new architecture.
Our trials involved shallow fully connected layers (1-3 layers with 10 to 150 units) between the engineered features and after the concatenation layer. Since the EDA analysis (Section 2.3) shows a correlation between both Hodge numbers, we tried architectures where the result for h 1,1 is used to predict h 2,1 .
For the training phase, we also tried an alternative to the canonical choice of optimising the sum of the losses. We first train the network and stop the process when the validation loss for h 1,1 does not longer improve, load back the best weights and save the results, keep training and stop when the loss for h 2,1 reaches a plateau.
With this setup we were able to slightly improve the predictions of h 1,1 in the original dataset, reaching almost 100 % of accuracy in the predictions, while the favourable dataset stayed at around 99 % of accuracy. The only few missed predictions (4 manifolds out of 786 in the test set) are in very peculiar regions of the distribution of the Hodge number. For h 2,1 no improvement has been noticed.

Ensemble Learning: Stacking
We conclude the ML analysis by describing a method very popular in ML competitions [74]: ensembling. This consists in taking several ML algorithms and combining together the predictions of each individual model to obtain a more precise predictions. Using this technique it is possible to decrease the variance and improve generalization by compensating weaknesses of algorithms with strengths of others. Indeed, the idea is to put together algorithms which perform best in different zones of the label distribution in order to combine them to build an algorithm better than any individual component.
The simplest such algorithm is stacking whose principle is summarised in Figure 29. First, the original training set is split in two parts (not necessarily even). Second, a certain number of first-level learners is trained over the first split and used to generate predictions over the second split. Third, a "meta learner" is trained of the second split to combine the predictions from the first-level learners. Predictions for the test set are obtained by applying both level of models one after the other.
We have selected the following models for the first level: linear regression, SVR with the Gaussian kernel, the random forest and the "inception" neural network. The meta-learner is a simple linear regression with 1 regularisation (Lasso). The motivations for the first-level algorithms is that stacking works best with a group of algorithms which work in the most diverse way among them.
Also in this case, we use a cross-validation strategy with 5 splits for each level of the training: from 90 % of total training set, we split into two halves containing each 45 % of the total samples and then use 5 splits to grade the algorithm, thus using 9 % of each split for cross correlation at each iteration) and the Bayes optimisation for all algorithms but the ANN (50 iterations for elastic net, SVR and lasso and 25 for the random forests). The ANN was trained using a holdout validation set containing the same number of samples as each cross-validation fold, namely 9 % of the total set. The accuracy is then computed as usual using numpy.rint for SVR, neural networks, the meta learner and h 1,1 in the original dataset in general, and numpy.floor in the other cases.
In Table 13, we show the accuracy of the ensemble learning. We notice that accuracy improves slightly only for h 2,1 (original dataset) compared to the first-level learners. However, this is much lower than what has been achieved in Section 3.6.3. The reason is that the learning suffers from the reduced size of the training set. Another reason is that the different algorithms may perform similarly well in the same regions.

Discussion
In this paper, we have proved how a proper data analysis can lead to improvements in predictions of Hodge numbers h 1,1 and h 2,1 for CICY 3-folds. Moreover, considering more complex neural networks -in particular, architectures inspired by the Inception model [79][80][81] -allowed us to reach close to 100% accuracy for h 1,1 with much less data and less parameters than in previous works.
While our analysis improved the accuracy for h 2 Figure 29: Stacking ensemble learning with two level learning. The original training set is split into two training folds and the first level learners are trained on the first. The trained models are then used to generate a new training set (here the "1st level labels") using the second split as input features. The same also applies to the test set. Finally a "meta-learner" uses the newly generated training set to produce the final predictions on the test set.
push further our study to improve the accuracy. Possible solutions would be to use a deeper Inception network, find a better architecture including engineered features, and refine the ensembling (for example using StackNet [109]).
Another interesting question to probe is related to representation learning, i.e. finding a better description of the Calabi-Yau. Indeed, one of the main difficulty in making predictions is the redundancy of the possible descriptions of a single manifold. For example, one could try to set up a map from any matrix to its favourable representation (if it exists). Or, on the contrary, one could generate more matrices for the same manifold in order to increase the size of the training set. Another possibility is to use the graph representation of the configuration matrix to which is automatically invariant under permutations [71] (another graph representation has been decisive in [48] to get a good accuracy). Techniques such as (variational) autoencoder [110][111][112], cycle GAN [113], invertible neural networks [114] Table 13: Accuracy of the first and second level predictions of the stacking ensemble for elastic net regression (EN), support vector with rbf kernel (SVR), random forest (RF) and the artificial neural network (ANN) as first level learners and lasso regression as meta learner. neural networks [115,116] or more generally techniques from geometric deep learning [117] could be helpful.
Finally, our techniques apply directly to CICY 4-folds [76,77]. However, there are much more manifolds in this case, such that one can expect to reach a better accuracy for the different Hodge numbers (the different learning curves for the 3-folds indicate that the model training would benefit from more data). We hope to report soon on these issues. Another interesting class of manifolds to explore with our techniques are generalized CICY 3-folds [118].
We leave these questions to future explorations. where w and b are the weights and intercept of the fit. One of the key assumptions behind a linear fit is the independence of the residual error between the predicted point and the value of the model, which can therefore be assumed to be sampled from a normal distribution peaked at the average value [98,99]. The parameters of the fit are then chosen to maximise their likelihood function, or conversely to minimise its logarithm with a reversed sign (the χ 2 function). A related task is to minimise the mean squared error, without assuming a statistical distribution of the residual error: ML for regression usually implements this as loss function of the estimators. In this sense, loss functions for regression are more general than a likelihood approach, but are nonetheless related. For plain linear regression, the associated loss is

A Machine Learning Algorithms
where N is the number of samples and x There are however different versions of possible regularisation which we might add to constrain the parameters of the fit and avoid adapting too well to the training set. In particular we may be interested in adding a 1 regularisation: or the 2 version: Notice that in general we do not regularise the intercept. These terms can be added to the plain loss function to try and avoid large parameters to influence the predictions and to keep better generalisation properties: • add both 1 and 2 regularisation (this is called elastic net): • keep only 1 regularisation (i.e. the lasso regression): • keep only 2 regularisation (ridge regression): The role of the hyperparameter L is to balance the contribution of the additional terms.
For larger values of the hyperparameter α, w (and b) assume smaller values and adapt less to the particular training set.

A.2 Support Vector Machines for Regression
This family of supervised ML algorithms were created with classification tasks in mind [119] but have proven to be effective also for regression problems [120]. Differently from the linear regression, instead of minimising the squared distance of each sample, the algorithm assigns a penalty to predictions of samples x (i) ∈ R F (for i = 1, 2, . . . , N ) which are further away than a certain hyperparameter ε from their true value y, allowing however a soft margin of tolerance represented by the penalties ζ above and ξ below. This is achieved by minimising w, b, ζ and ξ in the function 21 where α (i) , β (i) , ρ (i) , σ (i) ≥ 0 such that the previous expression encodes the constraints and where φ(x (i) ) ∈ R F is a function mapping the feature vector x (i) ∈ R F in a higher dimensional space (F > F ), whose interpretation will become clear in an instant. The minimisation problem leads to such that 0 ≤ α (i) , β (i) ≤ C, ∀ i = 1, 2, . . . , N . This can be reformulated as a dual problem in finding the extrema of α (i) and β (i) in where θ = α − β are called dual coefficients (accessible through the attribute dual_coef_ of svm.SVR in scikit-learn) and K( Notice that the Lagrange multipliers α (i) and β (i) are non vanishing only for particular sets of vectors l (i) which lie outside the ε dependent bounds of (A.10) and operate as landmarks for the others. They are called support vectors (accessible using the attribute support_vectors_ in svm.SVR), hence the name of the algorithm. There can be at most N when ε → 0 + . As a consequence any sum involving α (i) or β (i) can be restricted to the subset of support vectors. Using the kernel notation, the predictions will therefore be where A ⊂ {1, 2, . . . , N } is the subset of labels of the support vectors. In Section 3.4 we consider two different implementations of the SVM algorithm: • the linear kernel, namely the case when K ≡ id and the loss, in the scikit-learn implementation of svm.LinearSVR, can be simplified to without resolving to the dual formulation of the problem.
• the Gaussian kernel (called rbf, from radial basis function) in which From the definition of the loss function (A.9) and the kernels, we can appreciate the role of the main hyperparameters of the algorithm. While the interpretation of ε is straightforward as the margin allowed without penalty for the prediction, γ represents the width of the normal distribution used to map the features in the higher dimensional space. Furthermore, C plays a similar role to the l 2 additional term in (A.8) by controlling the entity of the penalty for samples outside the ε-dependent bound, however its relation to the linear regularisation is α ridge = C −1 , thus C > 0 by definition. Given the nature of the algorithm, SVMs are powerful tools which usually grant better results in both classification and regression tasks with respect to logistic and linear regression, but they scale poorly with the number of samples used during training. In particular the time complexity is at worst 22 O(F × N 3 ) due to the quadratic nature of (A.12) and the computation of the kernel function for all samples: for large datasets (N 10 4 ) they are usually outperformed by ANNs.

A.3 Decision Trees, Random Forests and Gradient Boosting
Decision trees are supervised ML algorithms which model simple decision rules based on the input data [121,122]. They are informally referred to with the acronym CART (from Classification And Regression Trees) and their name descends from the binary tree structure coming from such decision functions separating the input data at each iteration (node), thus creating a bifurcating structure with branches (the different paths, or decisions made) and leaves (the samples in each branch): the basic idea behind them is an if. . . then. . . else structure. In scikit-learn this is implemented in the classes tree.DecisionTreeClassifier and tree.DecisionTreeRegressor.
The idea behind it is to take input samples x (i) ∈ R F (for i = 1, 2, . . . , N ) and partition the space in such a way that data with the same label y (i) ∈ R is on the same subset of samples (while for classification this may be natural to visualise, for regression this amounts to approximate the input data with a step function whose value is constant inside the partition). Let in fact j = 1, 2, . . . , F be a feature and x (i) j the corresponding value for the sample i, at each node n of the tree we partition the set of input data M n into two subsets: n (t j, n ) = M n \ M [1] n (t j, n ), where A n is the full set of labels of the data samples in the node n and t j, n ∈ R is a threshold value for the feature j at node n. The measure of the ability of the split to reach the objective (classifying or creating a regression model to predict the labels) is modelled through an impurity function (i.e. the measure of how often a random data point would be badly classified or how much it would be badly predicted). Common choices in classification tasks are the Gini impurity, a special quadratic case of the Tsallis entropy (which in turn is a generalisation of the Boltzmann-Gibbs entropy, recovered as the first power of the Tsallis entropy) and the information theoretic definition of the entropy. In regression tasks it is usually given by the l 1 and l 2 norms of the deviation from different estimators (mean and median) for each node n: • mean absolute error pred, n (x) , (x (i) , y (i) ) ∈ M n (t j, n ), (A.17) • mean squared error: n (t j, n ) is the cardinality of the set M n (t j, n ) for l = 1, 2 and n ⊂ A n are the subset of labels in the left and right splits (l = 1 and l = 2, that is) of the node n.
The full measure of the impurity of the node n and for a feature j is then: n (t j, n ) |M n | H [1] n (x; t j, n ) + M [2] n (t j, n ) |M n | H [2] n (x; t j, n ), (A.20) from which we select the parameterŝ t j, n = argmin tj, n G n (M n ; t j, n ). (A.21) We then recurse over all M n (t j, n ) (for l = 1, 2) until we reach the maximum allowed depth of the tree (at most |M n | = 1).
Other than just predicting a class or a numeric value, decision trees provide a criterion to assign the importance of each feature appearing in the nodes. The implementation of the procedure can however vary between different libraries: in scikit-learn the importance of a feature is computed by the total reduction in the objective function due to the presence of the feature, normalised over all nodes. Namely it is defined as the difference between the total impurity normalised by the total amount of samples in the node and the sum of the separate impurities of the left and right split normalised over the number of samples in the respective splits, summed over all the nodes. Thus features with a high variable ranking (or variable importance) are those with a higher impact in reducing the loss of the algorithm and can be expected to be seen in the initial branches of the tree. A measure of the variable importance is in general extremely useful for feature engineering and feature selection since it gives a natural way to pick features with a higher chance to provide a good prediction of the labels.
By nature decision trees have a query time complexity of O(log(N )) as most binary search algorithms. However their definition requires running over all F features to find the best split for each sample thus increasing the time complexity to O(F × N log(N )). Summing over all samples in the whole node structure leads to the worst case scenario of a time complexity O(F × N 2 log(N )). Well balanced trees (that is, nodes are approximately symmetric with the same amount of data samples inside) can usually reduce that time by a factor N , but it may not always be the case.
Decision trees have the advantage to be very good at classifying or creating regression relations in the presence of "well separable" data samples and they usually provide very good predictions in a reasonable amount of time (especially when balanced). However if F is very large, a small variation of the data will almost always lead to a huge change in the decision thresholds and they are usually prone to overfit. There are however smart ways to compensate this behaviour based on ensemble learning such as bagging 23 and boosting as well as pruning methods such as limiting the depth of the tree or the number of splits and introducing a dropout parameter to remove certain nodes of the tree. Also random forests of trees provide a variable ranking system by averaging the importance of each feature across all base estimators in the bagging aggregator.
As a reference, random forests of decision trees (ensemble.RandomForestRegressor in scikit-learn) are ensemble learning algorithms based on fully grown (deep) decision trees. They were created to overcome the issues related to overfitting and variability of the input data and are based on random sampling of the training data [123]. The idea is to take K random partitions of the training data and train a different decision tree for each of them and combine the results: for a classification task this would resort to averaging the a posteriori (or conditional) probability of predicting the class c given an input x (i.e. the Bayesan probability P (c|x)) over the K trees, while for regression this amount to averaging the predictions of the trees y (i) {k} pred,n where k = 1, 2, . . . , K andn is the final node (i.e. the node containing the final predictions). This defines what has been called a random forest of trees which can usually help in improving the predictions by reducing the variance due to trees adapting too much to training sets.
Boosting methods are another implementation of ensemble learning algorithms in which more weak learners, in this case shallow decision trees, are trained over the training dataset [124,125]. In general parameterst j, n in (A.21) can be approximated by an expansion are enough to specify the value of t j, n (x) and can be compute by iterating 23 The term bagging comes from the contraction of bootstrap and aggregating: predictions are in fact made over randomly sampled partitions of the training set with substitution (i.e. samples can appear in different partitions, known as bootstrap approach) and then averaged together (aggregating). Random forests are an improvement to this simple idea and work best for decision trees: while it is possible to bag simple trees and take their predictions, using the random subsampling as described usually leads to better performance and results. 24 Different implementations of the algorithm refer to the number of iterations in different way. For instance scikit-learn calls them n_estimators in the class ensemble.GradientBoostingRegressor in analogy to the random forest where the same name is given to the number of trained decision trees, while XGBoost prefers num_boost_rounds and num_parallel_tree to name the number of boosting rounds (the iterations) and the number of trees trained in parallel in a forest. where 0 ≤ ν ≤ 1 is the learning rate which controls the magnitude of the update. Through this procedure, boosted trees can usually vastly improve the predictions of very small decision trees by increasing variance over bias. Another way to prevent overfitting the training set is to randomly subsample the features vector by taking a subset of them (in scikit-learn it is represented as a percentage of the total number of features). Moreover scikit-learn introduces various ways to control the loss of gradient boosting: apart from the aforementioned least squares and least absolute deviation, we can have hybrid versions of these such as the huber loss which combines the two previous losses with an additional hyperparameter α [126]. While more implementations are present, also the boosted trees provide a way to measure the importance of the variables as any decision tree algorithm.

A.4 Artificial Neural Networks
ANNs are a state of the art algorithm in ML. They usually outperform any other algorithm in very large datasets (the size of our dataset is roughly at the threshold) and can learn very complicated decision boundaries and functions 25 . In the main text we used two types of neural networks: fully connected (FC) networks and convolutional neural networks (CNN). They both rely on being built in a layered structure, starting from the input layers (e.g. the configuration matrix of CY manifolds or an RGB image or several engineered features) going towards the output layers (e.g. the Hodge numbers or the classification class of the image).
In FC networks the input of layer l is a feature vector a (i) {l} ∈ R n l (for i = 1, 2, . . . , N ) and, as shown in Figure 20, each layer is densely connected to the following 26 . In other words, each entry of the vectors a (i) {l} j (for j = 1, 2, . . . , n l ) is mapped through a function ψ to all the components of the following layer a {l+1} ∈ R n l+1 : where I ∈ R n l+1 is an identity vector. The matrix W {l} is weight matrix and b {l} is the bias term. The function φ is a non linear function and plays a fundamental role: without it the successive application of the linear map a {l} · W {l} + b I would prevent the network from learning more complicated decision boundaries or functions as the ANN would only be capable of reproducing linear relations. φ is known as activation function and can assume different forms, as long as its non linearity is preserved (e.g. a sigmoid function in the output layer of a network squeezes the results in the interval [0, 1] thus reproducing the probabilities of of a classification). A common choice is the rectified linear unit (ReLU) function φ(z) = ReLU(z) = max(0, z), (A. 28) which has been proven to be better at training deep learning architectures [127], or its modified version LeakyReLU(z) = max(αz, z) which introduces a slope α > 0 to improve the computational performance near the non differentiable point in the origin. CNN architectures were born in the context of computer vision and object localisation [128]. As one can suspect looking at Figure 22 for instance, the fundamental difference with FC networks is that they use a convolution operation K {l} * a (i) {l} instead of a linear map to transform the output of the layers, before applying the activation function 27 . This way the network is no longer densely connected, as the results of the convolution (feature map) depends only on a restricted neighbourhood of the original feature, depending on the size of the kernel window K {l} used and the shape of the input a (i){l} , which is no longer limited to flattened vectors. In turn, its size influences the convolution operator which we can compute: one way to see this is to visualise an image being scanned by a smaller window function over all pixels or by skipping some a certain number of them (the length of the stride of the kernel). In general the output will therefore be different than the input, unless the latter is padded (with zeros usually) before the convolution. The size of the output is therefore: O n = I n − k n + 2p n S n + 1, n = 1, 2, . . . , (A.29) where O is the output size, I the input size, k the size of the kernel used, p the amount of padding (symmetric at the start and end of the axis considered) and S the stride. In the formula, n runs over the number of components of the input tensor. While any padding is possible, we are usually interested in two kinds of possible convolutions: • "same" convolutions for which O n = I n , thus p n = In(Sn−1)−Sn+kn 2 , • "valid" convolutions for which O n < I n and p n = 0.
In both cases the learning process aims to minimise the loss function defined for the task: in our regression implementation of the architecture we used the mean squared error of the predictions. The objective is to find best possible values of weight and bias terms W {l} and b {l} ) or to build the best filter kernel K {l} through backpropagation [129], that is by reconstructing the gradient of the loss function climbing back the network from the output layer to the input and then using the usual gradient descent procedure to select the optimal parameters. For instance, in the case of FC networks we need to find where L is the total number of layers in the network. A similar relation holds in the case of CNN architectures. In the main text we use the Adam [130] implementation of gradient descent and add batch normalisation layers to improve the convergence of the algorithm.
As we can see from their definition, ANNs are capable of learning very complex structures at the cost of having a large number of parameters to tune. The risk of overfitting the training set is therefore quite evident. There are in general several techniques to counteract the tendency to adapt the training set, one of them being the introduction of regularisation (l 2 and l 1 ) in the same fashion of a linear model (we show it in Appendix A.1). Another successful way is to introduce dropout layers [131] where connections are randomly switched off according to a certain retention probability (or its complementary, the dropout rate): this regularisation technique allows to keep good generalisation properties since the prediction can rely in a less incisive way on the particular architecture since which is randomly modified during training (dropout layers however act as the identity during predictions to avoid producing random results).