Calabi-Yau Four/Five/Six-folds as P n w Hypersurfaces: Machine Learning, Approximation, and Generation

: Calabi-Yau four-folds may be constructed as hypersurfaces in weighted projective spaces of complex dimension 5 defined via weight systems of 6 weights. In this work, neural networks were implemented to learn the Calabi-Yau Hodge numbers from the weight systems, where gradient saliency and symbolic regression then inspired a truncation of the Landau-Ginzburg model formula for the Hodge numbers of any dimensional Calabi-Yau constructed in this way. The approximation always provides a tight lower bound, is shown to be dramatically quicker to compute (with computation times reduced by up to four orders of magnitude), and gives remarkably accurate results for systems with large weights. Additionally, complementary datasets of weight systems satisfying the necessary but insufficient conditions for transversality were constructed, including considerations of the interior point, reflexivity, and intradivisibility properties. Overall producing a classification of this weight system landscape, further confirmed with machine learning methods. Using the knowledge of this classification, and the properties of the presented approximation, a novel dataset of transverse weight systems consisting of 7 weights was generated for a sum of weights ≤ 200 ; producing a new database of Calabi-Yau five-folds, with their respective topological properties computed. Further to this an equivalent database of candidate Calabi-Yau six-folds was generated with approximated Hodge numbers.


Introduction
Calabi-Yau manifolds have been an epicentre for academic breakthroughs since their conception by the late great Professor Eugenio Calabi [1], some 80 years ago.Amplified by the awarding of a Fields medal for the proof of their existence by Professor Shing-Tung Yau [2], their importance within mathematics and to the mathematical community has since been firmly substantiated.However, beyond their interest in mathematics, these geometries have received notable acclaim within the physics community as well.For self-consistency, in superstring theory, the space-time within which we live must be 10-dimensional in nature; to ensure compatibility with the 4-dimensional space-time we observe, the remaining 6-dimensions must form some compact geometry, of which Calabi-Yau manifolds and their orbifolds are the most prudent and popular candidates [3].
A selection of the defining features of Calabi-Yau manifolds are what makes them so appropriate for string compactification.Beyond being compact, their Kähler SU(n) holonomy allows them to support the appropriate fields, as fluxes, which can reduce to those seen in the standard model.Moreover, being Ricci-flat in nature, they manifestly satisfy the vacuum Einstein equations desired to incorporate gravity.Under dimensional reduction of a string theory via Calabi-Yau compactification, many properties of the subsequent 4dimensional theory become directly dependent on the used Calabi-Yau's geometry, and thus choosing the correct Calabi-Yau becomes paramount to producing a theory that well models the universe.
Unfortunately, the landscape of these geometries is enormous, and its structure largely unknown [4].Through a variety of construction methods, billions of these geometries have so far been enumerated [5,6], and with numbers at this scale brute-force analysis of the corresponding theories becomes computationally infeasible [7].Databases of this size hence require statistical methods of analysis to extract meaningful insight, and, inspired by a multitude of successes in other fields, academics have been recently experimenting with the application of techniques from machine learning.
Machine learning is a broadly-used umbrella term for techniques in computational statistics; loosely separated into 3 subfields: supervised, unsupervised, and reinforcement learning [8][9][10].The first subfield of supervised learning can be considered as advanced techniques in function fitting, requiring both input and output data to fit.The second of unsupervised learning includes more general feature analysis and dimensional reduction, looking at input data on its own.The final subfield is reinforcement learning, which trains an agent to search a space of potential solutions for an optimum.
In particular, supervised methods have been especially amenable to the prediction of Hodge numbers, where expensive and difficult computations can be avoided if statistically confident predictions indicate a candidate geometry is highly unlikely to be relevant for one's desired application.This has been shown in the Calabi-Yau three-fold (i.e. 3 complex dimensional) construction cases of weighted projective spaces [11,28], complete intersections [11,[29][30][31][32][33][34], their generalised cases [35], and via toric varieties [36,37].
Whilst Calabi-Yau three-folds are excellent candidates for superstring compactification from 10-dimensions, superstring theory also has interpretations within its parent theories of M-theory and F-theory, which are 11 and 12-dimensional, respectively.Therefore, to compactify these higher-dimensional theories down to 4-dimensions, higher-dimensional geometries are needed.M-theory compactification requires 7-dimensional manifolds [38], notably G2-manifolds exhibiting evermore elusive constructions, machine learning in this area has been initiated by recent work considering the related G2-structure geometries with success predicting their equivalent Hodge numbers [39].Alternatively, F-theory compactifi-cation requires Calabi-Yau four-folds, where machine learning methods have been effective for the complete intersection construction1 [41,42], for which an exhaustive list has been determined [43,44]; however, machine learning methods have not yet been tested for the other constructions.
Whilst the database of Calabi-Yau four-folds from weighted projective spaces has been constructed [45][46][47], the toric variety construction method is too large to be enumerated in full [48], despite new work showing machine learning methods can help search this intractable space [49].Therefore, inspired by an array of successes in machine learning Calabi-Yau three-folds, this work looks to examine the suitability of these methods to the yet untouched database of Calabi-Yau four-folds built from weighted projective spaces, whilst developing on techniques inaugurated in [28].
This paper begins by detailing the Calabi-Yau construction of interest in §2, followed by analysis of the weight system data and respective topological invariants in §3, with detail on generated complementary datasets.In §4, the machine learning methods used are introduced, followed by their application, results, and interpretation.Central to this work, in §5, is the presentation of an approximation formula for computation of Hodge numbers of Calabi-Yau manifolds constructed via weighted projective spaces, providing a tight lower bound, and enormous improvements in computation time.In §6 this approximation, as well as the other related properties, is used to construct candidate transverse weight systems of 7 and 8 weights with sum of weights up to 200, along with the topological properties of the subsequent Calabi-Yau five-and six-folds respectively.Finally, in §7, results are summarised and outlook applications discussed.
The code for this work was completed in python, with use of machine learning libraries scikit-learn [50] and tensorflow [51]; datasets and scripts are made available at this paper's respective repository on GitHub2 .

Background
The most natural appearance of Calabi-Yau four-folds is in the context of N = 1 compactification of F-theory to four dimensions, which was studied in the seminal works of [85][86][87], among others.
F-theory emerges upon geometrisation of the axio-dilaton present in Type IIB superstring theory, resulting in a 12-dimensional theory.It was first developed in the seminal work of Vafa [85].A Calabi-Yau four-fold, which is elliptically fibred, serves as the internal space for the compactification of F-theory to an N = 1 supersymmetric theory in four dimensions (see [88], for instance).As usual, the moduli space is determined by the possible deformation families, encoded in the cohomological data, which are the main subject of this work.Moreover, Calabi-Yau four-folds can also appear in the compactification of M-theory to three dimensions, leading to an N = 2 supersymmetric theory.The two reductions are linked when the four-fold X is elliptically fibred, as shown in [86].For a given fibration E − → X π − → B, a compactification of M-theory on X coincides with a compactification of F-theory on X × S 1 .The set of Calabi-Yau four-folds studied in this paper also includes spaces with negative Euler number, a feature that allows for supersymmetry breaking in M-theory compactification.

The Construction
The Calabi-Yau four-folds considered in this work are constructed as codimension-1 hypersurfaces in compact complex 5-dimensional weighted projective spaces P 5 w .A general n-dimensional weighted projective space P n w is defined by considering C n+1 , spanned by coordinates {z 0 , • • • , z n }, removing the origin to form C n+1 /{0} = (C/{0}) n+1 = (C * ) n+1 , then subjecting it to an identification given by: for all non-zero complex numbers λ ∈ C * .The integer numbers w i 's are called weights (hence the name of the construction), and the vector of weights (w 0 , w 1 , ..., w n ) is known as a weight system of n + 1 weights.For weight systems to uniquely define weighted projective spaces, the set of weights needs to be coprime, removing redundancy introduced by rescaling of the identification parameter λ.There are infinitely many coprime weight systems which thus each uniquely define a weighted projective space.However not all of these weighted projective spaces will admit Calabi-Yau hypersurfaces, in the cases they do the weight system is defined as transverse.For transverse weight systems these Calabi-Yau hypersurfaces are then homogeneous functions of specific degree, as required to satisfy the defining vanishing first Chern class property necessary to produce a Calabi-Yau.
To briefly introduce this, start by assuming we have a transverse weight system such that the hypersurface can avoid the singularities of the ambient P n w .This defining hypersurface equation p = 0 therefore has no common solutions with its derivative dp = 0. Defining T P as the tangent bundle of the ambient P n w , such that the hypersurface submanifold M has respective normal bundle N , then T P = T M ⊕ N allowing the computation of the Chern polynomial for the hypersurface submanifold from that of the ambient space and normal bundle [89].A tangent space in T P is a space of vectors v = v i ∂ ∂z i which act on functions in the ambient space (i.e.functions of the ambient space's homogeneous coordinates z i ).The homogeneous nature of the functions however leads to an identification in this space of vectors v i ∼ v i + w i z i , since i w i z i ∂ ∂z i f = mf for generic homogeneous function f of degree m, reducing the space dimension by 1 as required.The independence of the vectors (except for this identification) leads to a decomposition of the tangent bundle into line bundles T P = (O(w 0 ) ⊕ O(w 1 ) ⊕ ... ⊕ O(w n ))/O, with the trivial bundle in the denominator.
The Chern polynomial of these 1-dimensional line bundles is then c(O(w i )) = 1 + w i ω, for ω the Kähler form of the ambient space, leading to c(T P ) = Π i (1 + w i ω) . (2.2) Whereas, since the degree d hypersurface equation is codimension 1, it can be viewed as a fibre coordinate for N , such that N = O(d), and hence c(O(d)) = 1 + dω.Therefore the overall Chern polynomial for the hypersurface submanifold tangent space is where the first Chern class, c 1 , can be extracted by expansion of the above, leading to the condition: , which for the hypersurface to define a Calabi-Yau manifold requires c 1 = 0, causing Therefore Calabi-Yau manifolds can be constructed as hypersurfaces in weighted projective spaces, where the weights form a transverse weight system and the hypersurface is defined by a homogeneous equation of degree equal to the sum of the weight system weights, w tot := i w i .

Weight System Properties
Consequently, for the central focus of this work, which is Calabi-Yau four-folds, we are interested in weight systems consisting of 6 weights.For the hypersurface to be Calabi-Yau, the weight system must be transverse, which synonymously in the mathematics literature may also be referred to as quasi-smooth, in that the hypersurface has no additional singularities other than those inherited from the ambient space.The complete list of transverse weight systems of 6 weights was classified in [46], totalling 1100055.Transverse weight systems were first bulk-generated in [89] for the three-fold case of 5-weight weight systems, later extended to the full finite list of 7555 in [90,91].In general, the number of transverse weight systems was proved to be finite for any dimension in [92].These constructions, as well as those for four-folds in [46], relied heavily on the use of these hypersurfaces as potentials in Landau-Ginzburg theories [93,94], with limited direct interpretation in terms of the weights.However, in the original construction in [89], a necessary but insufficient condition for transversality was introduced in terms of the weights exclusively.This necessary but insufficient condition for a general dimensional weight system to be transverse is based on divisibility between these weights.We thus dub this property intradivisibility.A weight system is intradivisible iff such that each weight can be subtracted from the sum of the weights and the result will be divisible by a weight in the weight system 3 .This property can be computed from the weights alone, and provides a means of identifying weight systems which are certainly not transverse -where this condition does not hold.
In addition to intradivisibility, another property of a weight system is required for it to be transverse, and this property comes from the more general toric interpretation of the weighted projective spaces 4 .In this interpretation the ambient P n w are toric varieties, defined by fans in R n , which can be built from convex lattice polytopes in Z n centred on the origin.
A polytope [78] is itself defined by a collection of d hyperplane inequalities such that all points x ∈ R n are in the polytope if H • x ≥ b for some defining d × n matrix H and constant d-vector b.The constituent parts of the polytope as defined by the intersection of the hyperplanes are the polytope faces, where 0-faces are vertices, 1-faces are edges, and so on up to (n − 1)-faces which are facets.In the case where the vertices' coordinates are all integers across the polytope, the polytope is a lattice polytope 5 .If the polytope contains a single lattice point in its strict interior, then the polytope is called interior point, or IP ; due to the affine symmetry of the lattice this point can always be shifted to be the origin 6 .The respective toric fan for a lattice polytope is defined by constructing 1-cones which are lines connecting the origin to each vertex, then extending each line infinitely.The remaining higher cones of the fan are then defined by the intersections of the polytope hyperplanes 7 .The toric variety [98] is then constructed from the fan through consideration of its respective dual fan, each cone in the fan has a dual cone which is the set of all points whose inner product with points in the cone produces a non-negative number.The union of all dual cones is the dual fan.Finally, the toric variety is defined as the maximal spectrum of the generators of the dual fan's 1-cones, i.e. taking the dual fan 1-cone generators (vectors in Z n ) and treating their entries as exponents of the coordinates in some C n , each generator provides a condition on the coordinate ring C n , and the resulting spectrum of maximal ideals of this quotient ring defines the toric variety.
From this construction it has been shown how polytopes lead to toric varieties, despite the series of steps needed to go from the polytope to the variety, a surprising amount about the variety can be deduced from the polytope information alone.An example of this is that for the variety to be compact, the dimension of the polytope must equal the dimension of 3 We note a nomenclature subtlety in [28] where 'transverse' was used to depict a weight system satisfying this property, and 'Calabi-Yau' was used to depict a weight system which can admit a Calabi-Yau hypersurface.In this work, we reserve 'transverse' for the weight systems with Calabi-Yau hypersurfaces where the solutions to the hypersurface equation and its derivative are transverse, and introduce 'intradivisibility' for weight systems satisfying the property of (2.5) 4 These can be alternatively generalised to fake weighted projective spaces [95], but this is another story. 5Lattice polytopes can also be physically interpreted as toric diagrams of quiver gauge theories [96]. 6In the mathematics literature lattice polytopes with exclusively the origin in the strict interior are called Fano, since they lead to Fano varieties.Where the boundary lattice points are only the polytope's vertices the polytope is terminal Fano, whilst where there're extra boundary lattice points the polytope is canonical Fano [97]. 7To ensure a smooth toric variety the polytope can first be triangulated to resolve the singularities arising from the interior points of the facets.
the lattice it is defined on.In fact, in this vein there is a more direct construction method for the toric variety from the polytope, and one more similar to the weighted projective space construction.Whereas, where weighted projective spaces are defined through one identification of C n+1 using one weight system as in (2.1), this can be generalised to k identifications of C n+k using k weight systems.If these weights are selected to be vectors spanning the kernel of the lattice polytope's vertex matrix8 then the generated variety is the same toric variety as that constructed via the dual fan method above.In this way, the weight systems of consideration for weighted projective spaces can be generalised to include combined weight systems (with many weight systems) for toric varieties.The hypersurface equation defining the potential Calabi-Yau in the weighted projective space then becomes a generic hypersurface in the toric variety's anticanonical divisor class [99]; with alternative interpretations as non-transverse hypersurfaces in weighted projective spaces [100].Inverting the process of extracting weights from a polytope, polytopes can also be constructed from (combined) weight systems.However, before introducing this, the definition of a polytope's dual is needed.
In a similar way to how a polytope's fan has a dual fan, a polytope has a dual polytope defined as the set of points such that the inner product between any point in the polytope and any point in the dual polytope ≥ −1.By definition, the dual of an IP polytope is hence also IP [101].However the dual of lattice polytope is not necessarily lattice, and in the special cases where both a polytope and its dual are lattice the polytopes are denoted as a reflexive pair -both satisfying the reflexivity property.In fact, it is one of the astounding beauties of the toric construction of Calabi-Yau's that the hypersurfaces in toric varieties from dual lattice polytopes are in fact mirror symmetric [99].A weight system is thus reflexive if the lattice polytope constructed from it is reflexive.
Let us return to the construction of an IP polytope from an IP weight system.Here, the hyperplane equations defining the polytope include an equality for each weight system such that i w i x i = w tot , and inequalities defined by x i ≥ 0 ∀i [102].Through this construction, the point x i = 1 ∀i is manifestly contained within the polytope, since it naturally satisfies the equalities and inequalities; noting that an affine transformation can set this point to be the origin.This general polytope is hence always IP, however it may be rational and we are often more interested in its restriction to a lattice.
To be able to then define and check the IP property of a given weight system it suffices to construct the respective polytope, and consider it as existing on the crudest lattice it can (that generated by the polytopes vertices).The dual polytope can then be generated from this, and respectively the dual lattice (all real points which dot product with all points in the polytope's lattice to integers), however the vertices of the dual polytope may not lie on the dual lattice, and thus a restriction is required by taking the convex hull of dual lattice points that lie within/on this dual polytope.This restriction may slice parts of the dual polytope off, producing a smaller dual polytope, which when taking the dual again will produce an new version of the original polytope, which we define to be the integer polytope  of interest, and in doing this the new boundaries may intersect the origin.Therefore the new restricted polytope may no longer contain the origin in its interior, and would thus not be IP [48].In the cases where the origin does remains in the strict interior the respective lattice polytope is IP, and we define the weight system to be IP too.All real polytopes constructed from weight systems are simplices, since there are as many intersections of the single defining equality with the inequalities as there're lattice dimensions (and also weights); equivalently, those from the larger combined weight systems are the union of simplices.However the restriction to the relevant lattice as described above may generalise the polytope causing the lattice polytopes to be unions of simplices also.For the weight system to be transverse there must be no further unavoidable singularities than the origin, and this translates to having no interior points on the polytope facets.It is where this occurs that weight systems can be IP but not transverse.
The importance of the IP property for weight systems comes from [103], where it was shown that any transverse weight system is by necessity IP for any size weight system.However the converse is not true, and thus overall we have 2 independent necessary but insufficient conditions for a weight system to be transverse: intradivisibility and IP.Beyond this we have another weight system property: reflexivity; where its interrelation with transversality depends on the construction dimension in question [45,102].Denoting the sets of IP, reflexive, and transverse weight systems of n weights by IP (n), R(n), and T (n) respectively, with their respective sizes as |IP (n)|, |R(n)|, and |T (n)|, the relations between, and frequencies of, weight systems with each property are shown in Table 2.1.
Due to the need for an identification, weight systems are not defined for 1 weight, and since the transverse property requires taking a codimension-1 hypersurface this property is also not defined for weight systems of 2 weights.For 2 weights the single weight system is (1,1), which is equivalent to the single 1-dimensional IP and reflexive polytope with vertices a distance 1 either side of the origin.For 3 weights there are 3 IP weight systems {(1, 1, 1), (1, 1, 2), (1, 2, 3)} which are all both reflexive and transverse, corresponding to 3 of the 5 reflexive triangles.Stepping to 4 weights the number grows to 95 [103].Whilst for 5 weights there is no longer an equality between all these sets of weights, with set sizes computed in [90,91,103].The weight systems of central focus in this work have 6 weights, and it is at this stage that each set of weight systems becomes distinct, where the IP (6) and R(6) sets were computed in [48], and the T (6) set in [46].Beyond systems with 6 weights the set sizes are unknown, as constructions have not yet been attempted (until this work as detailed in §6).The intradivisibility property is not believed to have a finiteness bound, which is why it is not included in these count considerations.The interrelation of this property with the others is discussed in more detail in §3.2.

Topological Properties
As previously mentioned, the cohomological data of the Calabi-Yau used in compactification determines the moduli space of the resulting compactified supersymmetric theory.Since Calabi-Yau manifolds are manifestly complex and Kähler [101], the complexity allows use of the decomposition of the complexified cotangent bundle into holomorphic and antiholomorphic parts via eigenspaces of the complex structure: , such that the (p, q)-forms are sections of each sum component with vector space Ω p,q (M).From here the cohomology arises using the decomposition of the exterior derivative operator d = ∂ +∂, allowing definition of the Dolbeault cohomology groups These are defined for all p (arbitrarily they may instead be defined for all q using ∂), and the dimension of these groups defines the Hodge numbers These Hodge numbers may be arranged into a Hodge diamond, where the symmetries of complex conjugation (h p,q = h q,p ) and Serre duality (h p,q = h n−p,n−q for dim C M = n) become clear.The Kählerity of the Calabi-Yau manifolds relates these Hodge numbers to the complexified de Rham real cohomological Betti numbers, b k , since then also allowing for the manifolds' Euler number, χ, to be computed from these Hodge numbers via χ = k (−1) k b k .Furthermore, for the specialised Kähler case of Calabi-Yau manifolds, there are even further restrictions on these Hodge numbers.One of the defining properties of a Calabi-Yau manifold is a unique holomorphic top form, which then sets h n,0 = 1, hence also setting to 1 the other Hodge diamond corners (via the conjugation and duality) [104].Additionally, as the Calabi-Yau's are simply connected, they have trivial first fundamental group and therefore also trivial first homology group, setting h 1,0 = 0 and respectively the remaining boundary components of the Hodge diamond [105].The final Hodge diamond therefore takes the form: showing that there remains few non-trivial components for consideration in the subsequent string compactification.For Calabi-Yau four-folds, the non-trivial Hodge numbers are {h 1,1 , h 1,2 , h 1,3 , h 2,2 }.Noting that in this dimension there exists a further constraint on the Hodge numbers [106], which reads allowing h 2,2 to be eliminated from the above list.These Hodge numbers, as well as the Euler number, are of particular interest to physicists; and work characterising and classifying Calabi-Yau's beyond these topological properties has seen insightful early progress [107,108].For the weighted projective space construction of Calabi-Yau manifolds there are direct formulas for these topological properties from the weights alone [109,110].Specifically these are for weights w i , normalised weights q i = w i /w tot , and u, v as dummy variables of the Poincaré polynomial Q(u, v) := p,q h p,q u p v q .For Q(u, v), θi (l) is the canonical representative of lq i in (R/Z) 5 , age(l) = 4 i=0 θi (l), and size(l) = age(l) + age(w tot − l).Note also for χ, where ∀ i lq i or rq i / ∈ Z then the product takes value 1.These components are reintroduced and explained in more detail in §5.
These formulas, as can be seen from the equations (2.11), are especially complicated, requiring factorially many integer divisibility checks as the weights in the weight system increase in value, as well as numerous extremely expensive polynomial divisions.It is with this in mind, that this work is motivated to investigate the efficacy of machine learning methods at approximating these formulas in §4, with the aim of distilling physical insight to form a suitable approximation, as discussed in §5; focusing on the more demanding Poincaré polynomial formula for Hodge numbers -from which the Euler number can be computed.

Data Analysis
In this section, the database of transverse 6-vector weight systems, used to construct Calabi-Yau (CY) four-folds via P 5 w spaces, is analysed from a general data science perspective.Databases of weight systems satisfying different combinations of the considered properties for transversality are then generated and discussed, with further data analysis.

The Four-folds Dataset
Here, the global properties of the primary dataset under investigation are summarised.It was first presented in [46], where some patterns in the Hodge numbers arrangement were discussed and illustrated by scattered plots; with further preliminary plots available in [111].Our work, on the other hand, is a natural extension of the investigations on the analogous manifolds in three complex dimensions, performed in [28].As such, we focus on the features that are most relevant for machine learning purposes, and we start from the distribution of the invariants, which is shown in Figure 3.1.We observe that, by using the logarithmic scale for the frequency, all histograms display a similar behaviour.The majority of samples is always concentrated around low values, and the ranges span several order of magnitudes.
The key features of the distributions in Figure 3.1 can be summarised as where we borrow the notation mean max min from [44].The same range and similar mean values of h 1,1 and h 1,3 are a hint of mirror symmetry, which is indeed present in this dataset, as noted in [46]; Figure 3.2a provides an illustration of it.This plot should be compared with the famous three-fold version, where h 1,1 + h 1,2 is plotted against the Euler number χ [89].Quantitatively, the degree of mirror symmetry is around 70%, as reported in [46].This feature was discovered in generating the set of Calabi-Yau's constructed as hypersurfaces in weighted projective spaces, derived from their embedding within toric varieties, and notably does not apply to the CICY construction [44,112].
Figure 3.2b shows the relation between the two highest Hodge numbers (note that, due to mirror symmetry, this would look almost identical if we were to plot h 1,1 instead of h 1,3 ).The orange line corresponds to h 1,1 , h 1,2 << h 1,3 , as can be seen from the relation (2.10), and a good amount of data clusters along this line.This feature was noted in [44], where they analysed the less symmetric set of complete intersection Calabi-Yau four-fold Hodge numbers, and found that the data only showed the linear behaviour depicted in the plot (in orange).The distribution of the non-trivial Hodge numbers h 1,• is shown in Figure 3.2c.By virtue of (2.10), this contains all of the cohomological information of the manifold.As we expect, the h 1,1 -h 1,3 plane at h 1,2 ≈ 0 displays the mirror-symmetric behaviour shown in Figure 3.2a (with a 45 • rotation).
Another interesting feature of this dataset is that, analogously to what was observed in [28], an evident linear forking behaviour in the plot of h 1,1 vs highest weight of the system can be observed.It is shown in Figure 3.3, where the dataset was also partitioned into reflexive and non-reflexive weight systems.This partitioning is discussed in more detail and put in a broader context in §3.2.For now, we just note that for highest weights larger than ∼ 5 × 10 4 , the h 1,1 values fall neatly into linear clusters.Motivated by the findings presented in [28], we explore this behaviour of the dataset at hand with similar techniques.As it is evident from Figure 3.3, for large weights, there are eight peaks in the h 1,1 /w max distribution (where the largest weight is the final weight in the weight system, such that w max = w 5 ).These are linear clusters in the h 1,1 vs w max plane, as shown in Figure 3.4a.To neatly illustrate the clusters, we only considered systems with largest weights w max ⪆ 3 × 10 5 , which can be seen from Figure 3.4b.The gray lines are the clusters obtained via the K-Means algorithm as used in [28], which is now described.The statistical confidence of a clustering behaviour can be quantified by the inertia measure.Running the K-Means algorithm on an input dataset with a prespecified number of clusters, the cluster centres/means µ C are randomly initialised and the datapoints are allocated to the clusters they are closest to.In each cluster, the centre is then updated to the mean of the datapoints allocated to it, from which all datapoints are then reallocated to the clusters they are closest to with respect to these new means.This process is iterated until convergence.Given a final set of clusters C , with associated means µ C , across the clustered dataset which here is on inputs r i = h 1,1 /w max , then the inertia is defined as We are implicitly assuming that any r i belongs to the cluster who's mean it is closest to.The number of clusters for the problem at hand was found to be 8, deduced by eye from Figure 3.4.Furthermore, two normalised versions of (3.2) may also be introduced, which have a nice statistical interpretation: They are normalised with respect to the number of samples, and with respect to both the number of samples and the range of the samples, respectively.For the clustering analysis of the h 1,1 , w max data reported in Figure 3.4, we find: In words, these value show that, on average, the ratios h 1,1 /w max that we considered are 0.0038% of the total range from their nearest cluster (for range shown in Figure 3.4a to be ≈ 0.25).These results strongly corroborate the liner clustering behaviour observed.

Additional Weight Datasets
The exact conditions for transversality of a weight system are derived from the use of the transverse polynomials as potentials of Landau-Ginzburg string vacua [90].These conditions arise from the necessity for the central charge of these theories to be 9 and a subtle application of Bertini's theorem allowing deformation of polynomials to reduce the singularity structure to exclusively an isolated singularity.
The direct combinatoric interpretation of these conditions in terms of exclusively the weights is unclear, and as demonstrated in [89] a first step towards a complete list of necessary and sufficient conditions is provided by the property we dub intradivisibility.Due to the necessary but insufficient nature of this property, whilst all transverse weight systems will satisfy it, there are many examples of weight systems satisfying it which induce further singularities on their subsequent hypersurfaces preventing them from being Calabi-Yau in nature and hence the weight system transverse.
As well as the intradivisibility property, via works in [99,113], there is a further necessary but insufficient property required for a weight system to exhibit a Calabi-Yau hypersurface.This property comes from the interpretation of a weight system as a lattice polytope and respectively the weighted projective space as a compact toric variety.As described in §2, the respective lattice polytope must hence have a single interior point (denoted as the IP property) for the subsequent toric variety to exhibit a Calabi-Yau hypersurface.Additionally, at this dimensionality (n > 4) the IP polytope no longer needs to be reflexive to exhibit a Calabi-Yau hypersurface, relaxing the necessity for this condition which is essential for the Calabi-Yau construction in lower dimensions.
As demonstrated, for higher-dimensional Calabi-Yau constructions, the relative importance of the previous essential properties becomes less clear; as well as their interrelations.Therefore, to graphically represent the dependencies of these properties, a Venn diagram is presented in Figure 3.5a.This in essence classifies the relevant ambient weighted projective spaces, which are defined uniquely by coprime weight systems 9 .Principally, for a 6-vector weight system to be transverse and hence exhibit a Calabi-Yau four-fold hypersurface, it must be both IP and intradivisible 10 .
It is therefore interesting to probe this relative importance amongst the necessary conditions using equivalent datasets of weight systems satisfying different combinations of these properties.Specifically, a dataset of weight systems is constructed for every combination of these properties.The partition of these weight systems is described by starting with coprime weight systems satisfying neither IP or intradivisibility (CnIPnD); then weight systems satisfying either intradivisibility (DnIP), or IP, which can then be non-reflexive (IPnRnD) or reflexive (IPRnD); then weight systems satisfying both intradivisibility and IP, but still not transverse, hence still split into non-reflexive (DnR) and reflexive (DR); and then, finally, the transverse weight systems exhibiting Calabi-Yau hypersurfaces which are again either non-reflexive (CYnR) or reflexive (CYR).This becomes a full partition of weight systems into subdatasets with respect to these properties, and these subdataset labels are chosen as acronyms to reflect the satisfied properties of coprime → C, IP → IP, intradivisible → D, reflexive → R, and where relevant the absence of a property is denoted with a 'n' before the respective property notation (i.e.non-reflexive → nR).This partition is represented on the property interrelation Venn diagram in Figure 3.5b, spanning all unique parts of it.
In generating these datasets, firstly the database of transverse weight systems of [46] was partitioned into reflexive and non-reflexive to produce the CYR and CYnR subdatasets, respectively.Then the publicly available sample of 1000000 5-dimensional 6-vector IP weight systems [48] was partitioned into reflexive and non-reflexive as well as intradivisible and non-intradivisible to initiate the respective DR, DnR, IPRnD, IPnRnD subdatasets, explicitly ensuring no overlap with the CYR and CYnR datasets.New to this work, we generate all intradivisible 6-vector weight systems with sum of weights a maximum value of 400 (using functionality in this paper's GitHub), partitioning off those which are not in the CY property subdatasets and then checking the IP and reflexivity properties (using PALP functionality [114]) to supplement the DR, DnR, DnIP subdatasets.Finally, coprime weight systems were generated stochastically 11 and checked for intradivisibility and IP.These coprime weight systems were then partitioned into IPRnD, IPnRnD, CnIPnD subdatasets, omitting any which were intradivisible to keep the sum of weights maximum value fixed for the DR, DnR, DnIP datasets.These datasets were then combined with the above, and any repetitions of weight systems removed.This substantially increased the subdataset sizes, producing a final partition with class sizes as shown in Table 3.1.
As can be seen in Table 3.1, the subdatasets are not balanced in size.In some cases this is particularly natural where the CY subdatasets are exhaustive in their partition between CYnR and CYR, including the entire finite list of possibilities.many IP weight systems which can be split amongst the appropriate properties; of which only a sample is publicly available which we supplement with statistical searches.Conversely, the intradivisibility property is not expected to enforce finiteness on the dataset of satisfying weight systems.Therefore this set has not been generated exhaustively in previous work, and is completed exhaustively here for a sum of weights up to 400 12 .These class sizes are hence well motivated from a viewpoint of exhaustive consideration and analysis, as well as due to computational limitations.Conversely, there are infinitely many coprime weight systems satisfying neither intradivisibility or IP, which we hence sample stochastically until a suitable order of magnitude matching the other class sizes was achieved.
The difference in subdataset sizes provides concrete stochastic information about the overlap of these weight system properties, and one could then crudely infer probabilities of a generic coprime weight system satisfying each property combination using these dataset sizes.The later machine learning architectures implemented have generic adaptability to accommodate variable class sizes, as described in §4.2, and appropriate performance measures are used to avoid bias misinterpretations of learning.

Principal Component Analysis
Linear behaviour in distributions can be analysed through principal component analysis (PCA).This unsupervised machine learning technique extracts an orthonormal basis for the dataset in question, with basis vectors ranked according to their degree of contribution towards the variance in the data's distribution.The basis is computed as eigenvectors of the dataset's covariance matrix, where the symmetric nature of the matrix ensures real eigenvalues that can be ordered decreasingly and then used to rank the basis.The normalised eigenvalues are named the explained variance, and provide a measure of relative importance of each eigenvector (the larger the explained variance the more important the respective eigenvector).For a prespecified desired degree of representation, a dataset can be projected onto the first i eigenvectors in the ranked basis such that the sum of the respective first i normalised eigenvalues exceeds the desired proportion of representation.In this sense, PCA is often used as a dimensionality reduction technique.
In this work, the union of all subdatasets of 6-vector weight systems were analysed with PCA, as one large dataset, to probe the capacity of linear structure being used for simple classification between the subdatasets of the partition.In this PCA, the explained variances were: (0.999999498, 0.000885369, 0.000405971, 0.000233533, 0.000010095, 0.000001597) , demonstrating a clear dominance in the first principal component.Due to the nature of representation of the weight systems, where the entries are sorted in increasing size, it is expected that the latter parts of the vector will dominate the most significant principal components 13 .This is the case, as shown by the components of this eigenvector for the first principal component: (0.000220159, −0.002178435, 0.007423314, −0.006743665, −0.342499831, 0.939461808) .
However the final two components are still of the same magnitude such that the projection is not trivial.The dominance of the first principal component motivates a 1-dimensional projection of the data, using the above eigenvector.The mean max min values for the 1-dimensional projections of each subdataset were: given to the nearest integer.They display similar lower bounds throughout, whilst higher mean and maximum values for non-reflexive subdatasets, and substantially larger ranges for the CY data.
To explore the distributions of these projections, their nearest integer values of the 1-dimensional projections for each subdataset in the partition were plotted according to their frequencies of occurrence in the histogram of Figure 3.6a.
Plotted according to a log-log scale, the distributions show a surprising approximate continuity of the lines.The projection distributions all experience significant overlap in the values they can take, and the overlap of the distribution lines shows that each subdataset experiences regions of the data space where the constituent vectors are distributed similarly to another subdataset (since frequencies are the same in that range of the projections), making classification difficult in each case of comparison.
Interestingly, the distributions of the CY subdataset projection interpolates between the IP and D datasets in the region of highest frequency, respecting this overlap behaviour in the full weight system generation where CY weight systems must be both D and IP.Alternatively, in each case a property's subdataset is split into R and nR the subsequent subdatasets exhibit distributions of a similar shape.The non-reflexive cases then have a higher density of high frequencies matching their usually more populous subdatasets distributed over smaller ranges as demonstrated in (3.5).
The similarity between the R and nR subdatasets' PCA projections indicates classification architectures will likely find identification of this property harder.Whilst the higher skewed values of PCA projection, as well as the far larger maximum values, for the CY datasets will perhaps be used by the architectures to aid learning.
In addition, PCA is performed independently for the dataset of transverse CY weight systems (i.e.union of the CYR & CYnR subdatasets); exhibiting comparable explained variances and dominant normalised eigenvector.The 2-dimensional projection of this data is presented in Figure 3.6b, and shows a similar forking structure to Figure 3.3, equivalently seen for CY three-folds in [28].Furthermore, as seen in the plots against h 1,1 , the reflexive weight systems dominate the tails of the forks.All these comparisons corroborate the suggested intimately linear relationship between the weights and h 1,1 , priming the data for machine learning application.

Machine Learning
In this section we present the results of various investigations performed through supervised machine learning (ML).Neural networks (NNs) are employed to predict the Hodge numbers of Calabi-Yau four-folds and to identify weight systems with specific properties.These two applications are different in nature, and for this reason, despite using the same NN architecture, some of the meta-data choices differ.
NNs are high-dimensional non-linear function fitters, they are built from constituent neurons which receive a vector input, act linearly on that vector to produce a number, then act non-linearly on that number with an activation function: x → act(w • x + b), for NN weights w (not to be confused with weight system weights w i ), bias b, and activation act(•).The neurons are organised into layers, such that the output numbers of each neuron in a layer are concatenated into a vector to pass to all the neurons in the next layer.Over training the optimiser compares output predictions of the NN function to true values of training data through a loss, updating the (w, b) parameters to optimise the fitting.After training is complete the trained NN is used to predict output values on independent test data, from which performance measures can then be calculated [8].
For the prediction of cohomological data, which has a very wide range of possible values, a NN regressor was used.Since the input data are small (just six integers), a simple architecture with few layers was enough for this problem.Specifically, we used the built-in multi-layer perception regressor from scikit-learn, with the following features: (16,32,16) layer structure, ReLU activation, M SE loss, and Adam optimiser.We chose a training-test split of 80 : 20, and performed a 5-fold cross-validation for each investigation.The batch size was set to 200, and we imposed an upper bound of 250 epochs (the network could stop before that if it reached convergence).Regarding the performance measures, we focused on the following three: for outputs y, where the bold numbers indicate the optimal values, i.e. those corresponding to perfect prediction.Conversely, for the identification of weight system properties a NN classifier was employed.This was built and implemented with the same architecture as the regressor (using tensorflow [51]), however changing the loss and performance measures to match the classification problem style.The loss function was categorical cross-entropy, and performance measures were functions of the confusion matrix.A confusion matrix, CM ij , counts the number of test data inputs in class i that the trained NN classifies into class j, this can then be normalised, and the performance measures defined: again where bold values indicate optimal values for perfect learning.Note that where accuracy is very interpretable as the proportion of correctly classified inputs, the Matthew's correlation coefficient (MCC) is in general a more representative measure as it accounts for off-diagonal terms and hence generalised Type I and II errors.

Regressing Hodge Numbers
Following the promising performances presented in [28], we employ a supervised ML technique on the Calabi-Yau weight system dataset under investigation.While for the threefolds case both h 1,1 and h 1,2 could be learned to high levels of precision, we find that the same is not true for four-folds.This does not come as a surprise, since the underlying geometric structure becomes richer and more complicated by going up in complex dimensions.
In fact, h described above proves less adequate for h 1,2 and h 1,3 .This is partially shown in Table 4.1, where we also observe a trend that is common to all of our findings.We note that the small-weights regime is essentially different from the large-weights regime in terms of ML performance.The neural networks yield consistently better results when restricted to the first half of the dataset, compared to the second half.This suggests that there are some features, associated to the large-weights behaviours, which are harder to learn with our architecture 14 .Moreover, we observe another drop in accuracy when investigating the whole dataset, showing that the NN struggles to deal with these two regimes at once.For reference, we report the properties of the two halves in Table 4.2.In order to probe the performance of ML on the full problem, i.e. determining the complete Hodge diamond, we also focused on learning h 2,2 ; both on its own and together with the two Hodge numbers 14 One might think that this is motivated by the fact that the second half of the dataset contains a wider range of cohomological numbers, since it has a wider range of weights.However, this is not the case, as shown in above.As shown in §2, such a triple is enough to contain all the cohomological information.The results of these investigations are shown in Table 4.3.We again see that the accuracy drops from left to right, according to the chosen subset.Finally, for completeness, we also present our results on h 1,2 and χ in Table 4.4.Since both of them can be zero, the MAPE measure does not apply to these cases, and therefore we omit it.The fact that higher cohomologies in Calabi-Yau four-folds are harder to learn with neural networks has already appeared in the literature, in [41].Although they analysed a different construction of four-folds, i.e. complete intersection Calabi-Yau's (CICY), their results also indicate that h 1,1 is the only Hodge number that can be successfully learnt to high levels of precision with fully connected networks.Convolutional neural network variants have exhibited the highest accuracies on the CICY matrix inputs [32], however due to the permutation symmetry of the configuration matrices, as well as the weight system vectors for the construction considered here, the benefits of the convolutional architecture's focus on local properties is lost, we therefore stick to the more general dense feed-forward architectures.

NN Gradient Saliency
Some first steps towards interpretability of these NN results starts with gradient saliency analysis.The trained NNs are (highly-nonlinear) functions from inputs to outputs, differ- entiating these functions with respect to each of the inputs can give some indication of the dependency of the output classification on each part of the weight system.
In the saliency analysis performed here, each NN is differentiated with respect to each of the inputs, and the differential evaluated at each of the test data inputs.The absolute values of these gradient components are then averaged over the test dataset, as well as averaged over the run repetitions -here repeating the investigation with randomised 80:20 train:test splits for 100 independent NNs of the same architecture.Since function scales can vary through the NN layers, the relative saliency values are the features of interest; they are represented, for the NNs predicting h 1,1 , in Figure 4.1.The 6 weights of the input weight systems are represented by 6 boxes, where lighter colours indicate higher saliency values and larger relative importance.These results show that the NNs focus on the weights in each system according to their size.They prioritise the information encoded in the lower weights, while the largest weights seem not to play an as important role.This implies that the networks are not exploiting the clustering behaviour shown in Figure 3.4, previously discussed.Perhaps, to be expected if we consider that the vast majority of weights actually lie in the "bulk" of the scatter plots, while the linear behaviour is only evident for systems with extremely large weights.

Symbolic Regression
Whilst NNs have limited interpretability due to the large number of constituent functions being concatenated, there are other methods of supervised learning that are more directly interpretable for extracting mathematical insight.
With the knowledge that NNs can well predict h 1,1 values of Calabi-Yau four-folds from the ambient weighted projective space weights alone, there is hence experimental evidence for approximate formulas connecting directly these integers.Motivated by this, in this section techniques of symbolic regression are implemented via the gplearn library to search for candidate approximation formulas.
Symbolic regression is a method of supervised learning implemented via a genetic algorithm.Initially a basis of functions is provided to the agent, here we will restrict ourselves to the standard normal division algebra basis: {+, −, ×, ÷}.Then a population of candidate expressions is randomly initialised as expression trees; where expression trees diagrammatically represent formulas as demonstrated in Figure 4.2.The population of expressions is then evaluated on the training data, noting a parsimony factor rewarding simpler expressions, and many of the best performing expressions are selected for breeding by swapping randomly selected subtrees.The output of the breeding is a new population of expressions which are then randomly mutated in a variety of ways to produce the next generation.This process of evaluating, breeding, mutation is then iterated for a fixed number of generations, where the best expression is then selected from the final generation's population as the output.This output expression is then tested on the test data to produce the final performance measures, here using the same as for the NNs.After 50 generations of 1000 expressions, with the gplearn recommended breeding and mutation factors and a parsimony of 0.8, the final output candidate expressions as well as performance measures for 3 independent runs were as given in Table 4. 5.
With the goal of extracting just the first order behaviour for the NN approximate formula, the high parsimony and simple function basis used has limited this regression performance, leading to expected lower performance scores relative to the NNs.Generalisation to a broader basis with lower parsimony can provide expressions with higher performance, well demonstrated by a run using the full gplearn basis producing the expression with R 2 score 0.896.However due to the far higher equation complexity and this relatively minimal increase in performance, the lack of interpretability puts motivation on consideration of the initially specified simple basis, with candidate expressions quoted in Table 4.5.Where these three independent expressions have similar performance, and some similar structure.
The first thing to note is that in each equation there are three summed terms, which are each positive functions of weights.More specifically, each has a term equal to w 2 , and another term either equal or proportional to w 1 .The occurrence of these earlier weights somewhat corroborates the importance of earlier parts of the weight system seen in §4.1.1,however without the w 0 factor -which may be related to w 0 having a significantly smaller range.In each case there is one further term involving higher weights, and across these expressions all additional weights do occur in this term.Overall, it is quite surprising how well such simple expressions can perform at predicting the h 1,1 values, and the simple linear sum behaviour does support there being an approximate linear relationship as observed in Figure 3.3.

Classifying CY Property
The generation of weight system subdatasets for each property combination, as described in §3.2, enables the design of ML experiments to distinguish these properties in weight systems.In these cases, the problem is setup as supervised classification, again using the same NN architecture throughout these subinvestigations for consistency and ease of comparison 15 .
To investigate the stability of the partition, a multiclassification investigation is carried out between all 8 subdatasets.Subsequently, a binary classification investigation is then carried out to probe the ability of ML architectures to identify each considered property: IP, Intradivisibility, Reflexivity, Transversality (i.e.CY); for each of these the datasets in each of the 2 classes were formed by taking appropriate unions of the partition subdatasets.To avoid problems caused by unbalanced datasets, during training class weights were fed into the NN such that it is proportionally more rewarded for correctly classifying weight systems in smaller classes; furthermore the MCC performance measure was used, which The first investigation is multiclassification between all 8 partitions of the weight system data: {CnIPnD, DnIP, IPnRnD, IPRnD, DnR, DR, CYnR, CYR}, the remaining investigations are binary classifications between unions of these non-overlapping datasets as labelled by the index in the stated list of weight system partitions.The class sizes are also given for reference (where the second class exhibits the investigated property), many are approximately balanced classifications but where they are not the MCC is a more appropriate non-bias measure.
is known to be unaffected by unbalanced class sizes -in this sense the MCC is the more appropriate measure of learning.The investigations, with the appropriate partitions of the partition subdatasets, as well as class sizes, and finally the averaged learning results over the 5-fold cross-validation are presented in Table 4.6.
These classification results are all considerably strong.For the multiclassification problem, an untrained NN would have null performance expressed by an accuracy ∼ 0.125 and MCC ∼ 0, however both performance measures are substantially higher than these scores.Therefore despite the weight systems being generally indistinguishable by eye, the NNs can learn to extract the appropriate property information sufficiently enough to classify well.Examining further the classification output, the averaged normalised confusion matrix for this multiclassification investigation is given by 0.084 0.000 0.012 0.004 0.000 0.000 0.009 0.000 0.000 0.000 0.000 0.001 0.001 0.000 0.000 0.000 0.003 0.000 0.263 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.249 0.006 0.000 0.005 0.000 0.000 0.000 0.000 0.032 0.012 0.000 0.001 0.000 0.000 0.000 0.000 0.015 0.006 0.000 0.001 0.000 0.009 0.000 0.001 0.024 0.005 0.000 0.185 0.002 0.003 0.000 0.000 0.007 0.001 0.000 0.054 0.002 to 3 decimal places, where the row is the true class and column the predicted class.The matrix diagonal represents correctly classified weight systems, and as can be seen the NNs prioritise the first, third, fourth, and sixth classes such that the coprime weight systems with none of the properties are well distinguished from IP and CY subdatasets -these are also the most populous classes.The off-diagonal terms are mostly zero, indicating good learning.However, the demands of this multiclassification problem are high, the architectures must learn to identify many quite different properties simultaneously.Despite this, the surprising success motivates the binary classification of each property individually.Each of the binary classification investigations exhibits higher performance measures than multiclassification, indicating that the architectures unsurprisingly perform better when learning one weight system property at a time.Theoretically these trained NNs could then each be used in turn to identify the properties of a new candidate weight system, and which part of the partition it probably lies in.The benefit of this is the computation of, particularly, IP and reflexivity becomes especially expensive for larger weight systems where the respective polytope is large and then calculating the dual polytope to check these properties takes increasingly more memory and time.With the trained NNs, candidate weight systems could be fed into these NNs allowing quick elimination of weight systems unlikely to satisfy these properties.Then the expensive analytic checks can be performed for the filtered weight system database, producing a far higher proportion of weight systems with the desired properties.
Focusing on the MCC scores, the architectures struggle most with identifying reflexivity, especially in the CY case.Considering the number of steps required to compute this analytically, via construction of the respective lattice polytope, taking the dual polytope, and then performing many integer checks of the corresponding vertices, this is perhaps not surprising.Conversely, the performance for identifying the IP property is then surprisingly high, which still requires generating the polytope.Therefore it is likely the NNs can approximate the polytopes within their architectures but struggle with the integer checks of vertices -a property notoriously evasive for ML [115].Respectively, the NNs can well learn the intradivisibility property, a more direct computation with the weight data, however still with a number of necessary checks.Finally, and most pleasingly, the CY property can be well learnt for four-fold weight systems, a result observed for three-folds in [28].The ability to so successfully predict the existence of further singularity structure in the respective hypersurfaces beyond that of the ambient weighted projective space remains astounding, for a  method which is still unclear how to perform directly without using the Landau-Ginzburg string interpretation.

NN Gradient Saliency
The relative saliency values for each of the classification tasks are represented in Figure 4.3.For these investigations the saliency scores only show significant dependence on the input features for the reflexivity identification, where the earlier weights in the sorted weight systems (and hence smaller weights) are more important in determining the classification.This is accentuated for the CY reflexivity investigation.This is likely related to the distribution in Figure 3.3, where reflexive weight systems appear to be skewed towards lower weights.

The Approximation
The computation of the Hodge numbers for Calabi-Yau four-folds as hypersurfaces in weighted projective spaces was performed in [46] via the Landau-Ginzburg model.Such a calculation involves constructing a number of Poincaré-type polynomials and summing their contributions.To do so, one has to perform polynomial multiplications and divisions, which become computationally expensive when the sum of the weights takes large values.This is also the regime where the linear clustering behaviour is more manifest.In the present section, we introduce an approximation for the Hodge numbers, which is well-defined for all Calabi-Yau weight systems, always provides a lower bound, and is significantly faster to compute.
(5.3) Given these ingredients, then the full formula for the Hodge numbers h p,q reads: It is evident that the polynomial products and divisions within the square brackets are what takes most of the computational resources.Just for reference, we note that for high weights, the software sagemath cannot perform the calculation due to the high number of terms.For this reason, the algorithm had to be hardcoded directly.One way to go about simplifying this formula is to identify for which values of l the terms in the sum contribute the most.Just by empirical observation, we find that the main contributions come from the element of zero age and the ones with maximal size 16 .As we are about to argue in detail, it turns out that including only terms of these types provides a very efficient approximation for the Hodge numbers.It is both very accurate and much faster to implement.Moreover, it bounds the exact results from below.Explicitly, the approximated Hodge numbers h p,q A can be computed as: . (5.5) We tested this approximation against the two relevant datasets: the Calabi-Yau four-folds considered in this article and the smaller set of Calabi-Yau three-folds.We start by presenting our findings for the latter case17 .5.1.This table reports performance measures for the approximation applied to Calabi-Yau three-folds.Specifically, the R 2 score, the mean absolute percentage error, the mean absolute error, and the percentage of exact results where the approximation matched the true value.

Application to Calabi-Yau Three-folds
As just mentioned, (5.4) becomes more and more involved to compute as the weights in the system become larger.Since the dataset for Calabi-Yau manifolds in weighted projective spaces are already ordered according to the sum of the weights, we conveniently divided the 7555 three-folds' weight systems (of 5 weights) into 11 groups, from lowest to highest.The plot in Figure 5.1a shows the mean percentage error of the approximation formula (5.5) for each of those groups, both for h 1,1 and h 1,2 , showing that the approximation becomes more precise as we go to higher weights.For reference, we also include a measure for the mean sum of weights in each of the groups ⟨5/w tot ⟩.We observe that the average percentage error for large weights is remarkably small, lying somewhere between 3% and 5% for both Hodge numbers.
The plot of Figure 5.1b shows the ratio of the computational time against the sum of weights w tot .As anticipated, the computational time needed to evaluate (5.5) is considerably smaller than the time taken by the full version (5.4).In fact, their ratio gets to values in the order of 10 −2 for the largest weight systems in the dataset 18 .Some other useful figures for this approximation are reported in Table 5.1.These results show that the approximation, even though it excludes the vast majority of the 18 One might argue that our implementation of the formula (5.4) could be further optimised, reducing the time needed to compute the Hodge numbers exactly.However, such an optimisation would lead to a quicker implementation of (5.5) as well, so that we do not expect the ratio to change significantly.This histogram plot shows that the approximated Hodge numbers also cluster for certain values of the ration h 1,1 /w max .Only weight systems with w max > 250 are plotted here, since this corresponds to the regime where the forking behaviour is most visible.This plot should be compared with the one appearing in [28].
terms that appear in (5.4), is still able to match the exact values a significant number of times.Moreover, we find another crucial feature of the truncated sum (5.5): Thus, since this is also the case for four-folds, the approximation presented in this work offers a quickly accessible tool for extracting tight lower bounds of the Hodge numbers.As a final feature, we note that the dataset built from the approximated Hodge numbers h 1,1/2 A correctly reproduces the clustering behaviour observed in [28].This is best shown with a histogram plot, in Figure 5.2, where we can clearly see various peaks in h 1,1 /w max , corresponding to the slopes of the clustering lines.They overlap almost completely, showing that the clusters are essentially the same for both datasets: the exact Hodge numbers and the ones obtained via our approximation.Consistently with (5.6), the peaks associated to h 1,1 A are slightly shifted to the left.Thus, by narrowing down the full formula from the Landau-Ginzburg model (5.4) to a small number of terms, we obtained a much simpler expression, which still reproduces the same behaviour for large weights.This might be a step forward towards the understanding of the linear clustering that characterises the cohomological numbers of Calabi-Yau's in weighted projective spaces.

Application to Calabi-Yau Four-folds
We now move to the case of four-folds in weighted projective spaces.This dataset is considerably bigger compared to three-folds, with 1100055 spaces, and it contains systems with very large weights.A natural consequence is that the computational times are much longer, which makes the advantages of the approximation even more evident.To give a concrete example, we focused on the millionth weight system in the dataset, which reads [45,74,2460,12792,17876,33173].Our implementation of (5.4) takes roughly 40 hours, while the approximated version (5.5) is computed in 49 seconds.The ratio between the shows the mean accuracy for groups of 2000 weight systems ordered according to their sum of weights.We used 20% of the dataset for this plot, sampled uniformly.The computational efficiency is analysed in the right plot (b), where the computational time of the approximation is compared to the one associated with the exact formula (5.4).We chose roughly 50000 samples randomly from the first half of the dataset (blue dots), and used these data to extrapolate the behaviour for the second half.As discussed in the text, the best fit function shown predicts very accurately the ratios for systems with larger weights.two is 0.00034.Moreover, the approximated results are very accurate: the exact ones are (h 1,1 = 10718, h 1,2 = 0, h 1,3 = 986, h 2,2 = 46860), while the approximated ones read (10683, 0, 986, 46500) 19 .
Regarding the precision of the approximation, we illustrate it in Figure 5.3a, with a similar plot to the one used for three-folds.We omit h 1,2 from the picture because it can take the value of zero, making the percentage error not well-defined.We sampled randomly and uniformly 20% of the dataset, and then divided it into groups of 2000 samples.As before, all samples were ordered according to the sum of weights, then dividing into the groups, as shown by the gray points.We observe once again that the percentage error gets smaller as the weights become larger, reaching roughly 1% for the samples with largest weights, for all three Hodge numbers.The plot in Figure 5.3b, on the other hand, shows the comparison between the computational resources employed by the exact expression from the Landau-Ginzburg model and by our approximation.It is evident from the example just discussed that systems with large weights necessitate a very long computation time for the full formula (5.4).Thus, we collected data within the first half of the dataset, and then extrapolated our findings to the second half, containing very large weights.The red curve provides a good interpolation of the data, and it turns out to give very accurate predictions as well.This can be confirmed by plugging w tot = 66420, which corresponds to the example weight system discussed above, into the expression for the best fit function.The result is t app /t = 0.00035, which is remarkably close to the actual ratio (0.00034) obtained from the explicit computation.
Finally, let us report the main properties of the approximation.They are shown in  Table 5.2, and they are extracted from the same data plotted in Figure 5.3a, i.e. from a set of roughly 220000 random Calabi-Yau weight systems.We end this section by making a final remark.The first one is that, analogously to the three-folds case, the approximation provides a lower bound also for four-folds.However, this is trivially satisfied for h 1,2 , since our approximation always yields zero for this case.Summarising, we have that: Therefore use of this approximation in practical computations of Hodge numbers not only provides a significant speed improvement, but also will always be a lower bound.Therefore, in designing string effective theories where the topology of the chosen Calabi-Yau manifold for compactification intrinsically sets many properties of the resulting theory, this approximation allows for incompatible manifolds to be confidently and quickly discarded where any h p,q A is larger than the desired values for the desired theory being built.Additionally, it also provides a good approximation for the remaining candidates, allowing them to be sorted prior to search with the full formula, such that many less manifolds will need to be checked before finding the correct topology for the desired theory.

Higher Weight systems
The approximation has been tested on both three-folds and four-folds.These are the only two existing datasets of Calabi-Yau manifolds built as hypersurfaces in weighted projective spaces.Here, we present the first efforts towards the understanding of these spaces in higher dimensions, by generating a partial dataset of candidate transverse weight systems of 7 weights, and the respective Calabi-Yau five-folds' Hodge numbers.We discuss how the approximation can be used to quickly extract information about such a dataset, and we additionally generate a first partial dataset of candidate transverse weight systems of 8 weights, then using (5.5) to construct an approximated list of six-folds' Hodge numbers.

Calabi-Yau Five-folds
Calabi-Yau five-folds appear in a number of dimensional reductions in the literature.For instance, it was found that M-theory compactified on a Calabi-Yau five-fold results in an exotic N = 2 supersymmetric quantum mechanics [116].Moreover, Calabi-Yau five-folds play a role in F-theory, where upon compactification, they provide a way to systematically construct N = (0, 2) CFTs [117,118], which may lead to their classification.Therefore, an extended dataset of such Calabi-Yau manifolds would make it possible to explore the landscape of such CFTs.Additionally, in [119], a 3-dimensional string vacua with N = 1 supersymmetry has been found, which can be interpreted as a compactification of S-theory on a Calabi-Yau five-fold.Despite their role in the construction of low-dimensional theories, examples of five-folds have not been systematically constructed, until the recent effort in [120], which focuses on the CICY construction.Here, we present a second in that direction, i.e. we generate, for the first time, a subset of Calabi-Yau five-folds obtained as hypersurfaces in P 6 w .Specifically, we generate all 7-weight weight systems whose sum of weights w tot ≤ 200.To efficiently identify those which have the required property to describe a Calabi-Yau (for more details, see §2), we employ a two-step approach.We first systematically search all partitions of each sum of weights up to 200 as weight systems, extracting those which are coprime, IP 20 , and intradivisible; performed with our code functionality.Then we use the approximation as a tool for establishing the Calabi-Yau property and select all the weight systems that are well-defined with respect to (5.5), or equivalently all those such that the polynomial division i (uv) q i −uv 1−(uv) q i gives no reminder.The candidate weight systems identified this way were then all checked with respect to the exact formula (5.4), and all turned out to be well-defined, yielding the full exact Hodge diamond.To provide confidence in the generated data, two non-trivial checks on the cohomological data were performed.First, computing the Euler number from the weights alone with (2.11), and then verifying it agrees with the identity in terms of the computed Hodge numbers.Moreover, we also checked that the Hodge numbers satisfy the constraint derived from the Atiah-Singer index theorem: Both checks were passed by all the weight systems, which describe new Calabi-Yau geometries in complex dimension five.For reference, let us present five examples of such spaces, shown in Table 6.1.We point out a small difference in definitions between this section and the previous ones.In §5.2 and §5.3, we compared our results with the existing datasets (which can be found at [121]), whose conventions are slightly different from ours.Namely, for Calabi-Yau three-folds, h 1,1 and h 1,2 determined using (5.4) have to be exchanged in order to match [121].Analogously, h 1,1 and h 1,3 should be swapped for four-folds to be consistent with the existing list.For the reminder of this paper, we present our results as they are obtained from (5.4).
Having mentioned this subtlety, we note that a quick consistency check of our results comes straightforwardly from considering the first entry in Table 6.1.This weighted projective space is trivial (i.e. is not actually weighted), so that it gives rise to the simplest Calabi-Yau five-fold defined by a degree seven polynomial in P 6 .The associated cohomology matches the result reported in appendix of [122].
The global properties of Calabi-Yau five-folds as hypersurfaces in weighted projective spaces, with sum of weights up to 200, read: The dataset of the 274730 weight systems with their topological properties is made available on GitHub, presented in the format [[w i ], w tot , [h p,q ], χ], where h p,q are written in the same order as in (6.3) above.With some further analysis in Figure 6.1.
As it can be guessed by looking at h 1,1 and h 1,4 , the subset of spaces considered here does not show mirror symmetry.This is due to the fact that we only restricted ourselves to a small sum of weights, whose mirror-symmetric pairs lie in the large-weights regime.We expect the cohomological data in that regime to be practically inaccessible, due to the large computational times associated with (5.4).For this reason, we believe that the approximation presented in §5, which proved to be extremely accurate for large weights in the four-folds investigation, could be a key tool for attempting such a task.Moreover, we also expect the list of all possible Calabi-Yau 7-weight weight systems to be astronomical in size.Once again, the truncated formula (5.5) provides a quickly computable tight lower bound for the Hodge numbers of all those yet undiscovered manifolds.

Calabi-Yau Six-folds
While their role in physics is marginal (they could only be employed for compactifications of S-theory), Calabi-Yau six-folds have their own relevance directly within mathematics.These spaces could provide additional information about the -still very mysterious to this day -landscape of Calabi-Yau geometries, as they are the second non-trivial family of Calabi-Yau manifolds in even complex dimensions.Their construction as hypersurfaces of weighted projective spaces involves using 8-weight weight systems, which are both more numerous and effectively infeasible to run bulk computation of exact topological parameters.For these reasons, we find the truncated approximation formula to be especially pertinent, allowing computation of approximated Hodge values for all the generated candidate transverse weight systems with w tot ≤ 200.Once again, we first identify the IP intradivisible weight systems, and then select the ones which are well-defined with respect to (5.5), numbering 1482022 candidate transverse weight systems of 8 weights (accessible at this work's invariants that we computed:

Summary & Outlook
This work was focused on, but not limited to, the analysis of Calabi-Yau four-folds obtained as hypersurfaces in weighted projective spaces.By restricting to systems with large weights, a linear clustering behaviour analogous to the one found for three-folds in [28] was observed and quantitatively corroborated through the K-Means clustering normalised inertia.By gradually relaxing the conditions on the weights, we were able to produce a partition of coprime weight systems according to the most relevant properties: IP, reflexivity, intradivisibility, and transversality (Calabi-Yau).Generating datasets for each subset in such a partition.While all of the above was performed using concrete analytic algorithms, statistical machine learning techniques were also applied both to the dataset of Calabi-Yau four-folds and to the partitioned set of more general weight systems.Regarding the former, a fully connected regressor network was shown to predict the cohomological Hodge data, and the Euler number, from the system weights.We found particularly good results, with R 2 ∼ 0.91, for h 1,1 , on the whole dataset.For the other invariants, we observed very different results for systems with small weights as opposed to systems with large weights.For instance, h 1,3 and h 2,2 showed results with R 2 > 0.90 for the half of the dataset containing lower weights, while the accuracy dropped significantly for the other half.These three numbers provide sufficient information to determine the full Hodge diamond, however results were also reported associated with h 1,2 , which had a poor performances since it is zero 48% of the time, and χ, which showed similar results to h 1,3 and h 2,2 .
The partition of weight systems according to their respective properties within {IP, reflexive, intradivisible, transverse}, where transversality implied the existence of a Calabi-Yau hypersurface, was classified with the respective fully connected classification architecture.Multiclassification results were surprisingly high between all parts of the partition, reaching MCC scores of 0.740.Separately, binary classification investigations managed to well identify each property respectively from unions of the partition subdatasets, struggling most with reflexivity.
Motivated by the strong performances of the neural networks, and inspired by the interpretability of the gradient saliency analysis and symbolic regression, we explored a simpler truncated version of the formula coming from the Landau-Ginzburg model used for calculating the Calabi-Yau Hodge numbers from the ambient P n w 's weight system.This approximation drastically reduces the number of terms involved in the computation, making it easier to study analytically, and substantially faster to compute numerically.Its main features are: it provides a tight lower bound for the Hodge numbers; it is especially accurate for systems with large weights (average MAPE of < 1% for the 10000 systems with largest weights); it is dramatically faster than the exact formula (up to 10 4 times quicker); it reproduces the observed linear clustering behaviour for large weights.
Finally, motivated by the speed improvements available from this approximation, transverse weight systems (satisfying the necessary intradivisible and IP properties, and welldefined with respect to both the approximation and exact Landau-Ginzburg formula) were generated for a sum of weights w tot ≤ 200, for systems of 7-weights producing Calabi-Yau five-folds.Additionally, where the exact Landau-Ginzburg formula computation time was infeasible for systems of 8-weights, a complementary dataset of candidate transverse weight systems (satisfying the necessary intradivisible and IP properties, and well-defined with respect to just the approximation) was generated, again for a sum of weights w tot ≤ 200, leading to candidate Calabi-Yau six-folds.Some preliminary analysis of this data, and the respectively computed topological properties is provided, with a thorough analysis, and its full generation for w tot > 200 left for future work.
These datasets, the respective code for analysis and ML, and an example notebook illustrating functionality to check intradivisibility, compute Euler number, and compute exact and approximated Hodge numbers of an input weight system of any size, are all available at this work's respective GitHub repository.

Figure 3 . 1 .
Figure 3.1.These plots illustrate the distribution of each of the invariants for Calabi-Yau fourfolds in weighted projective spaces.Each bin contains 500 samples.

Figure 3 . 2 .
Figure 3.2.Scattered plots of the Hodge numbers of Calabi-Yau four-fold's in weighted projective spaces.Plot (a) illustrates that the spaces are mirror symmetric to a high degree.Whilst (b) shows the relation between the two highest Hodge numbers, also compared with the constraint (2.10).Finally, the 3D plot (c) illustrates the relation between the three independent Hodge numbers.

Figure 3 . 3 .
Figure 3.3.This plot of h 1,1 as a function of the highest weight, w max , shows that the linear clustering observed in[28] for Calabi-Yau three-folds in weighted projective spaces is also manifest in the four-folds dataset.Both reflexive and non-reflexive systems display the same behaviour, although the regime h 1,1 > 200000 is dominated by reflexive ones.This is confirmed by the principal component analysis shown in Figure3.6.

Figure 3 . 4 .
Figure 3.4.These plots focus on the clustering observed in Figure 3.3 for large weights, and show that the clustering analysis correctly reproduces the multi-linear behaviour.The clusters are shown as peaks in the h 1,1 /w max histogram in (a) and as lines in the h 1,1 vs w max plane in (b).

Figure 3 . 5 .
Figure 3.5.Venn diagrams displaying (a) the conditional dependencies of the considered 6-weight weight system properties, and (b) the partition of the weight system data into non-overlapping subdatasets.

Table 3 . 1 .
The sizes of the subdatasets of weight systems of 6 weights in each part of the partition; along with the means and ranges across all weight values in each subdataset.

Figure 3 . 6 .
Figure 3.6.PCA of the classification partition of weight system data; explained variance demonstrated a single dominant principal component.In (a) the frequency distribution for the 1dimensional projections of the weight systems are shown for each part of the partition (on a log-log scale).Equivalently (b) shows a 2-dimensional projection of the Calabi-Yau data, corroborating the forking behaviour observed.

Figure 4 . 1 .
Figure 4.1.NN gradient saliency scores for the h 1,1 supervised learning on input weight systems.The lighter colours indicate a larger normalised absolute gradient for that weight in the input 6-vector weight systems, where the saliency scores are averaged over the full test sets of each investigation and each of the 100 repetitions of the investigations.

Figure 4 . 3 .
Figure 4.3.NN gradient saliency scores for the property classification supervised learning on input weight systems.The lighter colours indicate a larger normalised absolute gradient for that weight in the input 6-vector weight systems, where the saliency scores are averaged over the full test sets of each investigation and each of the 100 repetitions of the investigations.

Figure 5 . 1 .
Figure 5.1.These plots summarise the main features of the approximation (5.5), both in terms of accuracy and in terms of computational efficiency, compared with the exact formula(5.4).The data refer to Calabi-Yau three-folds as hypersurfaces in weighted projective spaces.

Figure 5 . 2 .
Figure5.2.This histogram plot shows that the approximated Hodge numbers also cluster for certain values of the ration h 1,1 /w max .Only weight systems with w max > 250 are plotted here, since this corresponds to the regime where the forking behaviour is most visible.This plot should be compared with the one appearing in[28].

Figure 5 . 3 .
Figure 5.3.These plots summarise the main features of the approximation (5.5), applied to the dataset of Calabi-Yau four-folds as hypersurfaces in weighted projective spaces.The left plot (a)shows the mean accuracy for groups of 2000 weight systems ordered according to their sum of weights.We used 20% of the dataset for this plot, sampled uniformly.The computational efficiency is analysed in the right plot (b), where the computational time of the approximation is compared to the one associated with the exact formula (5.4).We chose roughly 50000 samples randomly from the first half of the dataset (blue dots), and used these data to extrapolate the behaviour for the second half.As discussed in the text, the best fit function shown predicts very accurately the ratios for systems with larger weights.

Table 4 .
1.This table shows the performances of the fully connected neural network on h 1,1 and h 1,3 for the full dataset, the lower half (which contains smaller weights) and the upper half (containing larger weights).
1,1is still learnt with very high precision and accuracy, while the architecture

Table 4 .
2. These tables show the ranges of the topological quantities under investigation for the two halves of the dataset, and their mean value.The dataset is ordered according to the sum of weights (see bottom of the right table), and not according to any of the cohomological properties.Hence, we see that the range of the invariants does not split into two disjoint sets among the two halves.

Table 4 .
3. This table shows the performances of the fully connected neural network on h 2,2 and on the triple (h 1,1 ,h 1,3 ,h 2,2 ), which specifies all the cohomological information.Again, we report results associated to the full dataset, to the lower half only (which contains smaller weights) and to the upper half only (containing larger weights).

Table 4 .
4. This table shows the performances of the fully connected neural network on χ and on the triple h 1,2 .Both invariants can be zero, so we omit the MAPE measure, which is not welldefined.

Table 4 .
5. Candidate expressions for h 1,1 as functions of the 6 weights (w 0 , w 1 , w 2 , w 3 , w 4 , w 5 ) in the input transverse weight systems from independent symbolic regression runs, with respective performance measures.

Table 4 .
6. Classification results for various partitions of the weight system data.The table shows the mean Accuracy and MCC scores, to 3 decimal places, with standard error, across the 5 cross-validation runs, for the respective investigations labelled by the property being distinguished.

Table 5 .
2. This table reports performance measures for the approximation applied to Calabi-Yau four-folds.Specifically, the R 2 score, mean absolute percentage error, the mean absolute error, and the percentage of exact results where the approximation matched the true value.

Table 6 . 1 .
Examples of weight systems with 7 weights, describing Calabi-Yau five-folds, together with the associated invariants.