Non-classical nucleation pathways in stacking-disordered crystals

The nucleation of crystals from the liquid melt is often characterized by a competition between different crystalline structures or polymorphs, and can result in nuclei with heterogeneous compositions. These mixed-phase nuclei can display nontrivial spatial arrangements, such as layered and onion-like structures, whose composition varies according to the radial distance, and which so far have been explained on the basis of bulk and surface free-energy differences between the competing phases. Here we extend the generality of these non-classical nucleation processes, showing that layered and onion-like structures can emerge solely based on structural fluctuations even in absence of free-energy differences. We consider two examples of competing crystalline structures, hcp and fcc forming in hard spheres, relevant for repulsive colloids and dense liquids, and the cubic and hexagonal diamond forming in water, relevant also for other group 14 elements such as carbon and silicon. We introduce a novel structural order parameter that combined with a neural network classification scheme allows us to study the properties of the growing nucleus from the early stages of nucleation. We find that small nuclei have distinct size fluctuations and compositions from the nuclei that emerge from the growth stage. The transition between these two regimes is characterized by the formation of onion-like structures, in which the composition changes with the distance from the center of the nucleus, similarly to what seen in two-step nucleation process.


I. INTRODUCTION
Nucleation is a discontinuous transition in which clusters of molecules self assemble due to fluctuations that are very localized in space and time to form a growing nucleus. It is a crucial phenomenon in many fields of natural science [1][2][3], going from planetary-to nanoscale. During the nucleation process of many materials, including several metals, minerals and polymers, different crystalline phases, called polymorphs, can nucleate. The structure of the growing nucleus in such materials can depend on many, eventually size-dependent [4,5], effects, such as energy and entropy competition, or frustration. Understanding the selection mechanism of polymorphs is fundamental to predict the structure of the growing nucleus, with applications ranging from Earth's weather and climate forecast, especially in relation to the formation of nanometer-sized ice crystallites in clouds [6][7][8][9][10][11], to the pharmaceutic industry, where the physical and chemical properties of the drug molecules can change with the eventual crystallization of unwanted polymorph forms [12]. For example, the molecule for aspirin (acetylsalicylic acid), one of the most widely consumed medications, has two polytypic crystalline forms [13].
Here we study the nucleation of polytypes, a specific type of polymorphs where the crystalline structures have the same projection along a specific direction, and only differ in the way the planes perpendicular to that di- * fabio.leoni@uniroma1.it † john.russo@uniroma1.it rection are stacked onto each other. Some of the most common crystalline structures formed in metals are polytypic, notably the hcp (hexagonal close packed) and fcc (face centered cubic) crystalline structures, and the hexagonal and cubic diamond forms.
We consider the formation of polytypes in two important systems: the hard-sphere (HS) model and the coarse-grained mW model of water. They are representative of a wide class of materials, repulsive colloids and dense liquids [14,15] for HS, and tetrahedrally-bonded materials (like water and group 14 elements such as carbon and silicon) for mW [1,16]. They crystallize in two different polytypes: either fcc or hcp for HS, and either cubic ice (I c ) or hexagonal ice (I h ) for water. Importantly, in both cases the difference in all thermodynamic relevant quantities (such as free-energy difference, nucleation barrier, and solid/melt surface tension) between the competing polytypes are negligibly small (within 10 −3 k B T per particle for all cases) [1,5,[17][18][19][20][21]. For example, in the mW water model [16] the stacking fault between the ice I c and I h has been estimated as low as 0.16 ± 0.05 mJ m −2 at T = 218 K [4]. In this way, the nucleation mechanisms for both systems are determined not by bulk free energy properties, or by details of their interactions, but by general principles, which we aim at elucidating in the present work.
One of the main difficulties in studying polymorph composition is assigning the local environments surrounding a particle to a particular phase, distinguishing between amorphous (liquid) structures and crystalline ones [22]. Several methods for local structure identification have been developed so far. Contrary to common belief, the method employed to classify a single particle as belonging to a specific polymorph can sensibly alter the measured composition of the nucleus [23,24]. In the present work we compare some of the more representative methods found in the literature and introduce new methods which allow us to find fundamental properties of the nucleus growing during the early stage of homogeneous nucleation, and in particular to find evidences of a two-step nucleation (TSN) pathway.
Two-step nucleation mechanisms involve one disordered phase (the melt) and (at least) two ordered crystalline phases. Two-step nucleation often produces layered structures, where the composition changes radially within the nucleus, in such a way that the most stable polymorph form is closer to the centre, and is "wetted" by the metastable form on the surface. This structure, often referred to as the onion structure, is one of the hallmarks of two-step nucleation pathways, that have been theoretically predicted via classical nucleation theory (CNT) [25,38], density functional theory [39][40][41][42], phase-field models [43], two-dimensional lattice models [44], and molecular simulations [45].
These two-step nucleation pathways are generally explained via free energy differences between the two crystalline phases, and in particular with the different surface free energy of the crystals with respect to the melt [44]. Instead, in the systems under consideration the polytypes in competition have the same bulk free energy properties, and classical theory would predict in this case a homogeneous composition of the nuclei. We will observe that, due to finite-size fluctuation effects, onion-like structures are formed also under these conditions and that well separated free energy channels, corresponding to the competing polymorphs, can be distinguished, extending the phenomenology of structured nuclei to this large family of crystals.
The outline of the article is the following: in Sec. II we describe the methods for local structure identification employed in the present work to study the properties of nuclei forming during the homogeneous nucleation of HS and mW water. In Sec. III we describe the model systems we simulated, HS and mW water. In Sec. IV we compare the properties of nuclei as obtained by using the methods described in Sec. II. Finally, in Sec. V we present concluding remarks.

A. Order parameters
Common widespread methods used in the literature to identify local structures usually employ one-or two-dimensional order parameters (OP) maps, which involve the comparison of the local environment of a particle with different reference structures. For this reason, thresholds are usually introduced to establish which reference structure the particle under investigation belongs to. Steinhardt or bond orientational order (BOO) parameters in their averaged formq l [46] (see Appendix A) are the standard choice as OP, where four-fold (l = 4) and sixfold (l = 6) are often the only symmetries considered. Other methods involve the study of topological properties of the bond network, such as the Common neighbor Analysis (CNA) method.For water-like systems, the CNA method considers also second-nearest neighbours, and is named Ext-CNA. In Appendix A we describe some of the most representative low-dimensional OP employed in the present study for local structure identification (for a comprehensive review on common OP see Ref. [22]).We also present some tests aimed at determining the accuracy of the different methods in controlled situations (Appendix A 8). Since previous low-dimensional OP have produced different results when applied to the model systems studied here [5,23,24,[47][48][49][50][51], we consider a highdimensional OP based on 30 BOO (see Appendix A). In the following we drop the number 30 and use only the acronym BOO to refer to this method. However, the degeneracy of OP like BOO, for which the same OP value can correspond to different local environments [52], could result, in some specific application, in a suboptimal performance due to misidentification. For example, the use of Steinhardt OP, especially of those related to the spherical harmonic with angular momentum l = 3 and m = 2 (Y 32 ), which is the only one with tetrahedral geometry, to distinguish I c from I h in water has been already questioned in previous works [53]. In order to resolve also the issue related to the degeneracy of the OP, in the following section we introduce a novel lossless order parameter for the characterization of local environments.

B. Local inter-distance (LID)
Here we introduce a novel order parameter for the characterization of local environments that is built according to the following two principles. Firstly, the OP is highdimensional : increasing the dimensionality of the order parameter space allows to easily increase the separation between the different populations of local environments we want to discriminate between. Secondly, the OP is lossless: with this we mean that no information is lost by going from the real-space coordinates of the particles in the environment under consideration to its orderparameter representation; in other words, from the OP it is possible to reconstruct the original positions of the particles, except for translations, rotations, or particleindex permutations. Indeed, this method is based on the distances between all possible pairs obtained from a particle and its neighbors, and the problem to establish whether to the set of all possible inter-distances between a number of points corresponds only one points configuration dates back to the problem of the uniqueness in the x-ray analysis of crystal structures [54], in which case only very few specific exceptions are known.
The new order parameter is inspired by the permutation invariant vector of Ref. [55,56] and the Deep Potential Molecular Dynamics method of Ref. [57], and is constructed in the following way: for each particle i we make a list of its first (f j i ) and second (s k i ) nearest neighbors, with j = 1 · · · N and k = 1 · · · M , where N and M is the number of first and second nearest neighbors respectively. We then compute all the (N + M + 1)(N + M )/2 possible distances d pq = | r p − r q | between particle p and particle q with p, q = 1, 2, ...., N + M + 1 and p = q and subdivide them in the following groups. For HS we group the d pq in 5 categories: . In mW water we group the d pq in 6 categories where now f j i and s k i are the first and second energetic neighbors of particle i. The 6 categories are: , and (f j i , s k i ) (36 terms) where s kj i is a second neighbor of particle i which is also first neighbor of particle f j i . The number of terms in each category is obtained by considering N = 12 and M = 6 for HS, while N = 4 and M = 12 for mW water. These values for N and M are related to the number of first and second neighbors in the crystalline structures forming in these models.
The distances in each group are then sorted in ascending order. This makes the OP invariant under particleindex permutations. Since in the NN we use the Sigmoid as activation function (see Sec.II C), which works better with inputs between −1 and 1, we normalize the grouped and sorted distances d g,s pq for the average local environment radius r 0 (considering the first neighbors shell for HS and up to the second shell for mW water), and subtract from it the total normalized inter-distances d g,s pq /r 0 out (considering all outputs of the NN). Finally, the order parameter we introduce here, which we name LID (local inter-distance), is the vector obtained from the union of all the groups: d g,s,n pq = d g,s pq /r 0 − d g,s pq /r 0 out . To emphasize the advantages of LID we will compare its results with the ones obtained via either a lowdimensional method, i.e. Common neighbor Analysis (CNA), or via a high-dimensional (but not lossless [53]) order parameter constructed as an array of 30 different bond orientational order (BOO) parameters (built from spherical harmonic invariants of order up to l = 12, see Appendix A).

C. Neural Networks classification scheme
To partition a multidimensional OP space in different volumes, each one associated with the local environment of a crystalline structure or liquid phase, we use artificial neural networks (NN). In condensed matter NN have been used for potential energy surface calculations [57,58], to construct accurate molecular force fields [59], to improve potential energy of coarse grained models for water [49], or for identification and classification of local ordered or disordered structures using supervised [60][61][62] and unsupervised [63][64][65][66][67][68] methods. Ideally, unsupervised learning allows to cluster high-dimensional OP space into sets corresponding to different structures before they have been identified [63][64][65]67]. If all possible structures present in the system are known a priori, supervised learning is a powerful method to identify local structures not requiring any threshold chosen ad hoc and being less sensitive to hyper-parameters. We choose here supervised training, in which the NN is first trained against sample configurations of the phases we are interested in identifying. For the training we use bulk configurations prepared at coexistence conditions, where thermal fluctuations in the solid phases are maximized. For the HS system this corresponds to preparing bulk fcc, hcp, bcc, and fluid configurations at pressure P = 11.54 (in conventional reduced units) [69], and running event driven molecular dynamics simulations, and using each local environment as a training sample. In detail, the training set for HS is obtained by 10 different realization of fcc, hcp and bcc crystals at the melting point (φ = 0.545), and 20 different realization of the liquid phase at the freezing point (φ = 0.494), all composed of N ∼ 10000 particles. The training set for mW water is obtained by running Monte Carlo simulations at ambient pressure of 10 different realization of I c , I h at the melting temperature T m = 275K, and of ice 0 at its melting temperature T m = 244K (being metastable it has a lower melting temperature), and 20 different realization of the liquid phase at the melting temperature T m = 275K, all composed of N = 5376 particles. We notice here that we don't train the NN against surfaces, as these would require an external criteria in order to be defined (such as the Gibbs dividing surface), and we are only interested in bulk-like local environments. We choose a single-layer feed-forward network topology. As descriptors we consider the 30-dimensional (both for HS and mW water) BOO OP, and the 171-dimensional (for HS) and the 136dimensional (for mW water) LID OP. The hidden layer (HL) for BOO (both for HS and mW water) is composed of 8 nodes. We obtained the same performance varying the number of nodes in the HL from 4 to 20, indicating that the network is quite robust. The HL in LID is composed of 30 nodes (both for HS and mW water). Also in this case we observe the same performance of the network for a wide range of nodes in the HL. We initialize the weights following the Xavier method [70], consisting to set random weights from a normal distribution with zero mean and variance equal to 2 divided by the sum of the number of nodes in the input layer and the output layer. We consider the Sigmoid, or logistic function, as activation function for both IL-HL and HL-OL, where IL and OL are the input and output layer, respectively. The OL is composed of 4 nodes, which correspond to the 4 possible phases identified during the homogeneous nucleation of HS and mW water at the thermodynamic conditions considered here. As error or loss function we take the overall mean square error between the actual and the target output. We minimize the error using the stochastic gradient descent (for a critical discussion see [71]) and update the weights following the back-propagation approach [72]. The performance of the NN is higher than 98% in all cases. The absence of overfitting is verified by obtaining the same performance considering both the test and the training set. For all cases we set the learning rate to α = 0.01, while the number of epochs is 50 for BOO and 100 for LID for both HS and mW water.

D. Nucleus identification
After all particles in the system have been classified as belonging to a specific phase, in order to identify clusters of solid particles we use the same method employed in Ref. [74]: two solid particles are considered to belong to the same cluster if their distance is smaller than the value of the first minimum of the radial distribution function of the liquid (which results to be ∼ 1.5σ HS in HS, and ∼ 1.5σ mW in mW water). After a solid particle has been added to a cluster, the enumeration needed to distinguish the different clusters is obtained by using the Hoshen-Kopelman algorithm [75].
Other methods used to identify neighbors are the Voronoi construction, which is parameter free, but computationally expensive and sensitive to thermal fluctuations [76,77], and the solid-based nearest-neighbor (SANN) algorithm [76], which is parameter free and more robust against thermal fluctuations respect to the Voronoi construction, but occasionally can include second shell neighbors in the first shell neighbors [66].

A. Homogeneous nucleation of hard spheres
Here we consider non-overlapping hard spheres, the reference model for systems with excluded volume interactions [78]. We perform event driven molecular simulation of N=100000 hard spheres at constant volume V using the open-source event-driven particle simulator DynamO [79]. The phase diagram of hard spheres is a function of φ = N v/V , which is the fraction of the box volume V covered by the N spheres, each sphere having volume v = (π/6)σ HS . σ HS is the sphere diameter. We consider 100 different trajectories simulating different initial configuration of supersaturated fluids at volume fraction φ = 0.535, between the freezing φ = 0.494 and the melting φ = 0.545 value. Each configuration of supersaturated fluid is obtained using a Monte Carlo method which moves, consisting on expansion of spheres diameter, are rejected if at least two spheres volume overlap. φ = 0.535 is close enough to the melting value such for the supersaturated fluid to nucleate easily, but far enough such to avoid multiple critical nuclei growing and eventually merging together. Indeed, in all the 100 different trajectories simulated, we always observe one critical nucleus growing within the maximum number of collisions simulated, which is 10 10 , i.e. an average of 2 × 10 5 collisions per particle. We also performed simulations for φ = 0.54 and observed multiple nuclei growing and merging during the nucleation process.

B. Homogeneous nucleation of mW water
The mW model of water is a popular coarse grained representation of water, where the molecule is replaced by a single site having both two-body and three-body interactions [1,16]. We perform Monte Carlo simulations of N=4000 mW particles in the NPT ensemble at pressure P=0 Pa and temperature T=204 K. At these thermodynamic conditions the mW water model spontaneously nucleates within the maximum time simulated. We consider 100 different trajectories simulating different initial configuration of supercooled fluid. Each supercooled fluid configuration is obtained using the same Monte Carlo method employed to get supersaturated HS fluid.
A system with directional tetrahedral interaction has the potential to offer additional insights on nucleation pathways, as, in principle, it can involve many polymorphic structures [80]. We focus here on the stable ice I polytypes (the cubic form I c , and the hexagonal form I h ), and on the metastable ice 0 structure [81,82]. We choose this polymorph as it is currently the only known structure to satisfy all the following criteria: it has the lowest free energy outside the stable cubic and hexagonal (ice I) structures [83]; it is the simplest structure that can be built by deformation of the diamond crystal while preserving to a large degree a highly regular fourfold coordination for the sites [84]; it can stack coherently (without breaking of bonds between grains) with the diamond crystal [4]. These structures have never been observed as fully formed crystals, and instead we focus on clusters of molecules whose nearest-neighbor environment is close to those found in the bulk ice 0 crystal. It has been recently shown that these clusters have a lower energy than their stable ice I counterparts up to cluster sizes of around 40 water molecules [4]. Particle structures from the same configuration have been identified using the following methods: CNA (a), BOO (b), and LID (c). In all panels the colors associated to fcc, hcp, and bcc structures are blue, green, and red, respectively. We show average composition of the main cluster as identified by CNA (d), BOO (e), and LID (f). Insets in (d), (e) and (f) show a typical nucleus composed of 188, 398 and 502 particles, respectively. The inset to the right in (f) shows the average ratio r = n f cc /n hcp between the number of particles composing the nucleus in the fcc and hcp phase using LID for local structure identification. (g) Average radial fractional composition of the main cluster (for clusters of size 500 ≤ n ≤ 550) as identified by LID. dcm is the distance from the center of mass of the cluster and σHS the hard sphere diameter. Dashed fitting lines are a guide for the eyes. Snapshots obtained by using Ovito [73].

IV. RESULTS
A. Hard spheres

Nucleus composition
In Fig. 1 we show the results from the homogeneous nucleation of the HS system obtained from 100 independent event-driven molecular dynamics [79] trajectories of 10 5 particles at the volume fraction φ = 0.535. The snapshots in panels a, b, c compare the same simulation configuration of large-scale grains, which are colored according to the classification output of CNA, BOO, and LID order parameters respectively. The color indicates the detected phase: blue, green, red respectively for the fcc, hcp, and bcc local environments. Already from a quick visual inspection, we see that both the CNA and BOO methods have a lower resolution of grains respect to LID whenever there is a high degree of hcp and fcc stacking. On a quantitative level, panels d, e, f report, for the same order parameters, the average fraction of the different polymorphs within the largest nucleus as a function of the nucleus size n. All methods do not detect any bcc in the nucleus, as was already found in Ref. [85]. Both the fcc and hcp fractions instead grow linearly (volumescaling) with n. If we define the ratio r = n f cc /n hcp (where n f cc and n hcp are the number of particles in the fcc and hcp phase, respectively) we see that the lowdimensional method CNA gives a value (r = 1.31 ± 0.05) that is incompatible with both multidimensional methods: r = 1.07±0.05 for BOO and r = 1.00±0.05 for LID. A ratio r ∼ 1 is indeed expected during the growth stage given the low free energy difference between the fcc and hcp phases and the fact that the crystals are polytypes, i.e. they can stack onto each other with considerable entropy gain [5]. Both multidimensional methods agree within the error. The snapshots in panels d, e, f show a nucleus identified by the different order parameters from the same configuration. We note that the number of particles identified as crystalline varies considerably depending on the method: 188 particles for CNA, 398 for BOO, and 502 for LID. The multidimensional methods that use the NN detect larger nuclei as they have been trained with configurations of crystal structures at melting, thus including as much thermal fluctuations as possible without breaking the crystal order. LID, as we will confirm below for the mW model, is particularly effective even for distorted local environments.
In Fig. 1g we focus on the LID method and show both the composition (full symbols) and density (open symbols) profiles, as a function of the distance from the centre of mass of the nucleus, averaged over nuclei of size 500 ≤ n ≤ 550 . This size was chosen to be well above the critical nucleus size: from a mean-first passage time of the nucleating trajectories (see section IV B 1 for a theoretical description in the case of mW water) we estimate the critical size to be n c ∼ 180 for the LID order parameter, meaning that the profiles in Fig. 1g are for nuclei which are three times this size. The figure reveals two important characteristics of two-step nucleation pathways. The first is the decoupling between the density and structural order fields. The open symbols represent the normalized density ρ * = (ρ − ρ f )/(ρ x − ρ f ), such that the values 0 and 1 are assigned respectively to the bulk density of the fluid and crystal phases; ρ = 1/ v and v is the average specific volume computed via a Voronoi tesselation. As is seen here, and contrary to CNT assumptions, the nucleus reaches only about 80% of its bulk density close to the centre. Recently, for HS it has been confirmed that using CNT in combination with bulk quantities yields inaccurate results in the description of nucleation [86]. The second characteristic is the difference in profiles for the fcc (blue symbols) and hcp (green symbols) polytypes. While the fcc phase is found more abundantly near the centre of mass of the nucleus, hcp has a relative higher concentration towards the surface with the fluid. This is the onion structure mentioned before. In the next section we examine in more detail the nucleation pathway of these structures.

Onion-like structures
The imbalance between the two polytypes, fcc and hcp, is measured with the ratio r = n f cc /n hcp , which we plot in the right-inset of Fig.1(f) as a function of the cluster size n. The ratio is not constant: it shows a predominance of hcp for small values of n, which then converges towards an homogeneous composition as the size n increases. As we noted in Fig. 1g, at sizes above the critical value, nuclei are also not homogeneous, with the fcc phase being more abundant on average towards the centre of the nucleus.
To understand the appearance of onion-like structures, in Fig. 2 we plot the average radial fractional composition of crystalline clusters for different sizes, ranging from pre-critical nuclei to nuclei just above the critical size. The figure confirms that there is a transition between spatially uniform nuclei (n 200) where hcp is the majority component, to larger nuclei where the core becomes more abundant in fcc and the outer layers in hcp. Visual inspection of these nuclei reveals that the presence of an fcc-rich core surrounded by stacking faults. There are two reasons for the size-dependent stability of fcc cores. The first one is that fcc is a cubic crystal, and thus can form stacking disorder along 4 independent directions (along the 1,1,1 planes) instead of only one direction as in the case of the hcp crystal (which has hexagonal symmetry, and where the only stacking direc-tion is the one perpendicular to the basal plane). The inset of Fig. 2 shows a snapshot from the formation of these structures: an fcc core (blue particles) developing stacking faults in two directions (green hcp particles). The second reason is that the intersection of the stacking planes growing in different directions creates five-fold coherent grain boundaries from which the crystal can go radially maintaining an fcc-rich core. These grain boundaries were first observed in Ref. [87] for HS particles.
These observations are confirmed by looking at the radial fractional composition for large clusters, plotted in Fig. 3. With increasing size, the core of the nuclei retains the fcc-rich character, while the surface develops an intermediate plateau with equimolar composition. This region is due to random stacking along one or multiple 1,1,1 planes that emanate from the nucleus core. An example of this process is shown in the inset of Fig. 3. The preference for hcp in the outer-most part of the surface of nuclei shown by clusters of all sizes can be explained with the preference of clusters of tetrahedra in the liquid phase to coalesce via their faces in order to form locally dense aggregates [88], and this prevalent tetrahedral arrangement is compatible with the hcp phase.
We now investigate the transition between pre-critical homogeneous nuclei and onion-like structures. To characterize the change in structure we compute the gyration tensor S αβ where α, β = x, y, z, and r i α is the α component of the position vector of the particle i belonging to the cluster. The eigenvalues of S αβ are also called principal moments and can be written as the ordered elements λ 2 x ≤ λ 2 y ≤ λ 2 z , and the radius of gyration is defined as R g = T r(S) = λ 2 x + λ 2 y + λ 2 z . In Fig. 4 we plot the normalized histograms where P (n, x) is the reduced probability distribution function taken from our simulations data, with n being the size of the nucleus, and x = R g (radius of gyration) in the left panel and x = f = n c f cc /n c (see definition in the following) in the right panel.
We first examine x = R g (left panel). Up to the critical nucleus size, F (n, x) coincides with the potential of mean force for the two reaction coordinates n and x = R g . The dashed blue line indicates the critical nucleus size, while the red dashed lines are power laws with R g ∼ n 2/3 (surface scaling) and R g ∼ n 1/3 (volume scaling) respectively. The figure shows that there is a clear distinction between pre-critical clusters, with large surface fluctuations, and post-critical clusters. Large surface fluctuations for small nuclei are compatible with previous experimental observations on repulsive colloids [33,89]. The majority of precritical nuclei are not compact enough for barrier crossing, and the path with the smallest barrier selects nuclei from the population with small radius of gyration (compact nuclei). This transition occurs in correspondence of the nucleus size where onion-like structures start to appear. Indeed, stacking and defects like grain boundaries, which favour the formation of fcc in the inner part of nuclei, can take place only when they are compact enough.
In the right panel of Fig. 4 the reaction coordinate x = f = n c f cc /n c is given by the fraction of fcc particles in the core of a nucleus, where the core is defined by a sphere of radius 3σ HS centred in the barycentre. Results for different values of the sphere diameter are qualitatively similar. This choice in the computation of f allows us to better highlight the transitions in the core for the small cluster sizes we are considering here. From the plot we can see a distinction between the fcc-core-poor (f < 0.5) basin at small n, and the fcc-core-rich (f > 0.5) basin at large n. Lines represent contour lines. The saddle point is found at a value of n close to the estimated value of the critical nucleus.
Overall, Fig. 4 shows that crystal nuclei that pass the nucleation barrier are more compact and have a higher fcc content compared to pre-critical nuclei.
B. mW water

Critical nucleus
First of all, we estimate the size of the critical nucleus by using the mean first passage theory [90,91]. This theory allows to estimate the average time at which the growing nucleus overcomes the nucleation barrier, and then to estimate the critical nucleus size n c . The mean first passage time t f p (n), which gives the average time after which a nucleus of size n appears first in the system, is given by where k is the nucleation rate, erf is the error function, and Z = −∆G (n c )/(2πK B T ) is the Zeldovich factor. ∆G is the second derivative of the formation free energy of nuclei. n c corresponds to the value of n where the curvature of t f p (n) changes its sign. In Fig.5 we show t f p (n) versus the nucleus size n. From it we can see a big variation in the estimation of n c from the different methods compared here. To summarize these results, Ext-CNA, Ext-CNA-1st, Chill+ and BOO give a small value for n c going from 4 to 20.q 12 and LID give values for n c close to each other, that is 41 and 47, respectively, whileq 4q6 gives a value for n c which is very sensitive to the protocol employed to compute it (see Appendix A).

Composition ratio of the nucleus
After estimating n c we analyze the composition of the main cluster obtained from the different identification methods. In Fig. 6 we show the ratio r = n Ic /n Ih between the number of particles belonging to the main cluster which are associated to the cubic ice (n Ic ) and those associated to the hexagonal ice (n Ih ) versus the normalized nucleus size n/n c , where n c is the critical size of the nucleus given by the method under consideration. We do not show the ratio of particles of the nucleus identified as ice 0 because only some of the methods analyzed here include it between the possible crystal phases.
As shown by Prestipino in Ref. [24], theq 4q6 method can give predictions on the composition of the nucleus completely different by changing the protocol used to compute and partition this 2-dimensional OP. Here we consider different protocols obtaining different values for the ratio r and report the results separately in Appendix A.
Ext-CNA and Ext-CNA-1st give a preference to I c with a value of the ratio r between 1.3 and 1.4. BOO predicts a value r ∼ 1.4 for small normalized nucleus size, while for larger normalized nucleus size it drops to values closer to 1 (r ∼ 1.1). Chill+ has a strong imbalance towards ice I c for small sizes and reaches r ∼ 1 only for large nucleus size. Only LID measures r ∼ 1, except for small cluster size, where the hexagonal ice becomes predominant, a similar behavior to what we observed for hard spheres (see Sec.IV A 2). As shown in Sec.A 8, the ratio r ∼ 1 given by LID and Chill+ at large nucleus size is not observed in other methods. The larger value of r of these methods comes from the fact that they perform well only near the centre of the nucleus, which comprises a majority of cubic ice environments, and perform worse near the surface, where the hexagonal environments are more abundant than cubic ones. Fig. 7 shows the average radial composition for nuclei of size 150 ≤ n ≤ 200 obtained using LID. We find the same nucleation property that also characterized HS nuclei: while the overall average composition is the same between the stable ice I c and I h polytypes (r ∼ 1), the cubic diamond is more abundant than the hexagonal diamond near the centre of the nucleus. But the mW model also offers additional insights respect to the HS system. LID is the only method to detect the presence of ice 0 like structures (red symbols), whose growth is slower than the volume growth of both the ice I polytypes (See Fig. 20). Fig. 7 indeed confirms the presence of a small population of 0-like environments which peaks towards the surface of the nucleus. An independent confirmation of this unusual surface behaviour of mW water can be seen in the inset of Fig. 7 where we plot (orange symbols) the normalized density ρ * = (ρ − ρ f )/(ρ x − ρ f ), where ρ x = 0.985 g/cm 3 is the bulk density of the ice I phase and ρ f = 0.980 g/cm 3 is the density of the bulk liquid phase, at the thermodynamic conditions considered here. Importantly, the density of ice 0 (ρ = 0.953 g/cm 3 ) is lower than both the metastable liquid and ice I crystals at the same conditions. Indeed we observe that, instead of monotonically increasing from ρ f at the surface towards ρ x at the center of the nucleus, the density profile has a very pronounced density minimum towards the surface of the nucleus. The location of this minimum (which is computed independently from any structural order parameter, if not for the location of the center of mass) corresponds exactly to the location of the maximum in the ice 0 population (a grey vertical band is drawn in Fig. 7 to highlight the location of both). To further support the association between the density minimum and the presence of a population of low-density local structures, we have independently computed the local density of particles associated with each environment, and in the inset of Fig. 7 we plot the density ρ * c obtained by weighting these local densities with the fractional compositions obtained from LID (main panel). We see that ρ * c exactly mirrors ρ * , showing that we have obtained a good partial density decomposition. These results offer an even stronger case for the onion-like structure of growing nuclei, which in the case of water appears to be multi-step.

Radial composition of the nucleus
The presence of onion-like structures, and their radial composition is not explained by the small free energy differences between the bulk phases. In fact we observe that the cubic crystals (fcc and I c ) are found more abundantly near the centre of the nucleus, while their hexagonal counterparts (hcp and I h ) are found more abundantly towards the surface. In terms of bulk free energies instead the stable phases are fcc and ice I h in hard spheres and mW water respectively. To account for the ordering of the phases one needs to consider the free energy cost of structural fluctuations, which is size-dependent. It has been observed that small finite-size clusters of the cubic phase gain relative stability compared to the hexagonal phase thanks to the entropy associated with stacking disorder [5,87,92] and the low energetic cost of their grain boundaries [4].
We repeat here the analysis of small (pre-critical) clusters that we performed for HS (see Sec.IV A 2). In Fig. 8 we show the average radial fraction composition of the main cluster for two size range. For clusters of size in the range 20 ≤ n < 50 the nucleus is composed of a mixture of I c , I h and ice 0 with predominance of ice 0 and then of I h . Going from pre-critical to immediately critical clusters, that is for clusters of size in the range 50 ≤ n < 100, the onion-like structure starts to appear with ice 0 forming a peak which shifts towards outer layers for increasing cluster size. Also for mW water, as seen for HS, there is a selection of more compact clusters at the onset of the post-critical regime.

Equilibrium trajectories
To exclude the possibility that the nucleation pathway is due to the non-equilibrium nature of nucleation events at high supercooling, we apply the same analysis to trajectories obtained from Umbrella Sampling (US) simulations. Umbrella sampling, and other techniques such as metadynamics or forward flux sampling, are usually employed in homogeneous nucleation to enhance the sampling of crystalline cluster [4,[92][93][94][95]. In order to test the LID OP against homogeneous nucleation in mW water, which would confirm its ability to capture the local crystalline phases I c , I h and ice 0, we bias the umbrella sampling simulations by using LID as reaction coordinate. For performance reasons, here we construct LID by considering spatial first and second neighbors of a local particle, as done for hard spheres, instead of energetic neighbors. US simulations are performed with N = 10000 mW particles at ambient pressure and T = 218 K.
In Fig. 9 we show the average composition of the main cluster nucleated with US simulations for clusters of size 50 ≤ n ≤ 100, as identified by the LID OP with spatial neighbors. Differently from the spontaneous nucleation pathways analyzed before, US simulations allow us to study the structure of the nuclei in equilibrium. Moreover it allows us to study nucleation at higher temperatures (where spontaneous nucleation would not be observed). Despite these differences, we get a very similar result to that obtained by using LID from spontaneous nucleation (see Fig. 7): I c particles are more concentrated near the center of mass of nuclei, whereas I h particles are slightly more abundant near the surface (note that small differences in the fraction composition between phases are magnified when computing the number of particles in a crystalline phase composing the nucleus because it depends on the square of their distance from the center of mass), and ice 0 particles concentrated around the surface of nuclei. In the Inset of Fig. 9 we show, as in Fig. 7 for spontaneous nucleation simulations, the total fraction of crystalline particles (black diamonds), the normalized density ρ * = (ρ − ρ x )/(ρ f − ρ x ) (orange pointing downwards triangles), where ρ f = 0.995 g/cm 3 , ρ x = 0.983 g/cm 3 and ρ ice 0 = 0.952 g/cm 3 at the present thermodynamic conditions, and the normalized density ρ * c (magenta pointing rightwards triangles) computed by weighting the local densities of each phase with their fractional compositions obtained from LID (main panel). Note that here the linear transformation applied to ρ in order to get a normalized density ρ * differs from the one used in Fig. 7 for the swap of ρ f with ρ x because at T = 218 K ρ f > ρ x , while at T = 204 K it is the opposite (see Ref. [4]).
Similarly to the HS case (right panel of Fig. 4), in Fig. 10 we show the normalized histograms of F (n, f ) = −log[P (n, f )]. The reaction coordinate f is defined in the same way as for HS where the radius of the sphere defining the core is now 3σ mW . Panel (a) shows direct molecular simulations, while panels (b) and (c) are the result of US simulations, again at T = 218K, in which we initialized the configurations using the seeding technique [96] with nuclei in the I c and I sd phase respectively. I sd is the stacking disordered phase. For details on the simulation procedure see Ref. [4]. From panel (a) we can see the presence of the two basins, the I c -core-poor (f < 0.5) at small n, and the I c -core-rich (f > 0.5) at large n, separated by the saddle point located in correspondence of the critical nucleus (at T = 218K n c is close to ∼ 100). The US simulations (panels (b) and (c)) offer a view on the equilibrium landscape of the nucleation process for the formation of different nuclei: I sd nuclei in panel (b), and I c nuclei in panel (c). The potential of mean force for the I sd nucleation shows two channels: one corresponding to a I c -core-poor nuclei at small n, and one with I c -corerich nuclei at large n. The overall process in this case is similar to the one observed in direct simulations (panel (a)). The potential of mean force for the I c nuclei in panel (c) displays a process devoid of the I c -core-poor basin, showing the existence of well-separated nucleation channels [44].

Dynamical behavior
To study the dynamical behavior of the growing nucleus we compute how many particles attaching to the nucleus change their phase and how many do not during the entire dynamical process as a function of the nucleus size. In particular, we trace the evolution of particles in the main cluster in reverse time: for each trajectory we count how many particles of the main cluster which are in a specific phase at the end of the dynamics are still found to be in that phase at the time when they attached to the cluster as a function of the cluster size n at that time. In Fig. 11 we show the conditional probability that a particle in a cluster of size n will stay always in the I c (I h ) phase during the whole dynamics, indicated with blue circles (green squares), and the conditional probability that at the end of the dynamics it will be in the opposite phase, indicated with black diamonds (red triangles). In this case we use the LID method to identify the local structure around each particle.
From Fig. 11 we can see that for critical clusters (that is for n/n c > 1) on average a particle appearing in the main cluster of size n with phase I c (I h ) will stay in that phase for the whole dynamics with a conditional probability p(I c |I c ) 0.93 (p(I h |I h ) 0.77). Also the probability of starting with a phase and ending with the other phase is not symmetric: particles appearing in the main cluster of size n with phase I c (I h ) will end up to be in the I h (I c ) phase with a conditional probability p(I h |I c ) 0.07 (p(I c |I h ) 0.23).
We have seen that hexagonal local environments (more abundant on the surface) are more likely to change to cubic local environments as they get incorporated in the nucleus during the growth stage. To confirm that this transformation occurs on the surface, i.e. soon after local environments become crystalline, in Fig. 12 we compute the probability distribution of the time between the first appearance of the crystalline environment (black diamond symbols for I c and red triangles for I h ) and its last phase transformation. We see that transformations occur exponentially fast in time following the same curve for both transformations, and are thus surface events.

Precursors
Here we investigate the nature of the density decrease in proximity of the surface of the nucleus as found in the radial compositions of Figs. 7, 9. In the upper panel of Fig. 13 we show the size n of clusters identified by LID as a function of MC steps for a specific trajectory. The horizontal dashed red line corresponds to the critical nucleus size n c = 47. We define t * as the time when the nucleus has the critical size n = n c for the last time during the growth process (vertical dashed orange line in the figure). In the lower panel of Fig. 13 we show the system density ρ as a function of MC steps for the same trajectory considered in the upper panel. The horizontal dashed blue lines are obtained from density averages over a short time interval, and highlight that ρ decreases in correspondence of the formation of the critical nucleus. At the thermodynamic conditions we consider here (P = 0 Pa, T = 204 K) the density of ice I, ice 0 and liquid phase is ρ = 0.984, 0.953, 0.980 g/cm 3 , respectively (see Ref. [4]). Differently from classical predictions, for which the formation of a crystalline nucleus should correspond to an increase in density at the present thermodynamic conditions, here we see the opposite. As discussed in Sec.IV B 3 this density decrease can be explained by the formation of ice 0-like local structures in correspondence of the nucleus surface (see Fig. 7).
The same trend is observed in all nucleating trajectories i: in Fig. 14 we plot the densities as a function of the time from t * i . In Fig. 14 11. Conditional probability that a particle attaching to a nucleus of size n will -stay in the same phase until the end of the dynamics (black circles for Ic, and green diamonds for Ic) -change phase at the end of the dynamics (red squares for Ic which transforms into I h , and blue triangles for I h which transforms into Ic). each trajectory i, and the dashed green circle indicates pre-critical nuclei. From Fig. 14 we see that the average density steadily decreases from pre-critical precursor regions.

V. CONCLUSIONS
In two-step nucleation an intermediate phase is in a size-dependent competition with the stable phase (fcc vs hcp in hard spheres or cubic ice vs hexagonal ice in mW water) [44,97,98]. Here we consider this phenomenology in the particular case of polytype nucleation. We have studied the microscopic nucleation pathway in systems characterized by a competition of different polytypes, whose bulk free energy properties don't discriminate between them. Even in systems where no classical argument for a two-step process is expected, we find a selection of critical clusters with a compact structure that leads to the formation of onion-like structures, thus considerably extending the number of system showing this type of nucleation mechanism [30,[39][40][41][42][43]. In particular, our results highlight the role of structural fluctuations in nucleation phenomena [30,44].
Our results hinge on the development of a novel order parameter for local structure identification which is multidimensional and lossless, and is shown to successfully characterize these complex nucleation pathways and to identify local structures with high accuracy. A proper polymorph decomposition, for example, is essential in the determination of the nucleation rate [20]. We believe that the generality and flexibility of our method makes it suitable for the study of a large range of systems showing characteristic ordered or disordered signatures, such as defects or interfaces in crystalline or amorphous materials.

ACKNOWLEDGMENTS
We acknowledge support from the European Research Council Grant DLV-759187. We thank A. Attanasi and M. Mosayebi for useful discussions. The Common neighbor Analysis (CNA) method [99] assigns a structure type to every particle based on a nearest-neighbor graph accounting for the bond connectivity among neighbors of a given particle. Particles are considered to be neighbors if they are closer to each other than a specific cutoff. In the present work, for HS we employ the adaptive Common neighbor Analysis (a-CNA) method [100], in which an optimal cutoff radius is automatically computed for each individual particle. A major disadvantage of CNA is that no structure type is assigned to particles with unknown signatures, and it is sensitive to thermal fluctuations [101].

Extended common neighbor analysis (Ext-CNA)
In order to assign cubic or hexagonal diamond structure type to a water Oxygen, information on the position of his second nearest neighbors (i.e., second shell) are needed. In the diamond structure, nearest neighbor Oxygens don't have common neighbors, and the second and third shells are not well separated. In order to apply the CNA method to identify diamond structures, the extended CNA (Ext-CNA) has been introduced in Ref. [102]. In the software Ovito [73] it is available as Identify diamond structure function. In the Ext-CNA, the CNA method is applied to the 12 second nearest neighbors of a central particle, which are found as the first neighbors of the first 4 neighbors of the central particle under consideration. We refer to this CNA method to identify particles in the mW water model. We also consider the method we name Ext-CNA-1st (available as option in Ovito), which includes in the ice I structures also particles being first neighbors of a particle classified as ice I by the Ext-CNA method. These additional particles have four first neighbors positioned on the right lattice sites of the relative ice I structure, but at least one of its second nearest neighbors is off lattice.

Polyhedral template matching (PTM)
This method is based on the topology of the local particle environment [101]. It makes use of the convex hull formed by a fixed number of neighboring particles, which are identified using a Voronoi-based method. The planar graph representing the convex hull is used to classify structures. PTM is less sensitive to thermal fluctuations respect to a-CNA, but it still requires the definition of reference structures. In Fig. 15 we show the nucleus spanning the simulation box displayed in the main text in Fig.1, but here identified by using (from left to right) a-CNA, PTM, BOO up to second shell, BOO up to first shell, and LID. We notice that BOO up to the second shell shows problems in distinguishing parallel layers of alternating phases when they are close to each other, while BOO up to the first shell improves the identification of those parallel layers of alternating phases, even though it gives similar results on the composition of polytypes respect to BOO up to second shell.

Chill+
Chill+ [103] classifies cubic ice, hexagonal ice, and clathrate hydrate structures in water. It is based on the identification of staggered and eclipsed bonds: since an Oxygen atom in crystalline ice is 4-coordinated (first neighboring shell), if we consider two neighboring Oxygen atoms, we can look at the cluster of 8 atoms composed by these two and their first neighbors. Looking at the atoms along the axis of the bond between the first two atoms, if all the six neighboring atoms are visible we have a staggered bond, while if we see three neighboring atoms we have an eclipsed bond. Because the presence of thermal fluctuations and other effects distorting bonds, as for other methods comparing local environments to a reference structure, thresholds to establish if a bond is close enough to the perfect staggered or eclipsed bond and then being identified with it have to be introduced. In particular, if the bond order parameter q 3m is between 0.25 and -0.35 the bond is eclipsed, while if it is less than -0.8 the bond is staggered. The crystalline structure associated to an Oxygen atom depends on the number of eclipsed and staggered bonds. For example, hexagonal ice has 1 eclipsed and 3 staggered bonds, while cubic ice has all 4 bonds staggered. This method is specific for water.

Bond orientational order (BOO)
Steinhardt or bond orientational order (BOO) parameters q l (i) and w l (i) describe local order (as seen from particle i) in terms of spherical harmonics of order l. They are based on the following complex vector q lm (i) associated to the particle i where N b (i) is the number of neighbors of particle i, l is integer and m is an integer running from m = −l to FIG. 15. Homogeneous nucleation of hard spheres. Particles structure from the same configuration (snapshot) have been identified using the following methods (from left to right): a-CNA, PTM, BOO up to second shell, BOO up to first shell, and LID. In all panels the colors associated to fcc, hcp, and bcc structures are blue, green, and red, respectively. The calculation of a-CNA and PTM, and snapshots visualization have been obtained by using Ovito [73]. m = l, Y lm (r ij ) are the spherical harmonics and r ij is the position vector from particle i to j; and on the averaged q lm (i) defined as where the sum runs over the N b (i) plus the particle i. The local bond order, or Steinhardt, parameters q l (i) are defined as The parameter corresponding to a specific value of l capture a specific crystal symmetry. All q l (i) depend on the angles formed by neighboring particles and are independent of a reference frame. The averaged Steinhardt OP q l (i) are defined as Here we considerq 12 only for the identification of the solid nucleus without distinguishing polytypes, which has been used in other works [4,104], and two methods based on BOO OP for the identification of all phases: the standardq 4q6 map, in which case the choice of the protocol to compute and partition the map can strongly affect its application; and a group of 30 BOO, as described in the following, such to considerably increase the dimensionality of the order parameter space which allows to easily increase the separation between the different populations of local environments we want to discriminate between. The OP we use as input for the Neural Networks (NN) is composed of the following 30 BOO: q l (i) with l = 3, 4, ..., 12, q l (i) with l = 3, 4, ..., 12, w l (i) with l = 4, 6, 8, 10, 12, and w l (i) with l = 4, 6, 8, 10, 12. There are different ways to obtain first and second shells of neighbors in order to compute BOO, like SANN algorithm [76] or using a fixed cutoff (see Sec.II D). Here we consider the first neighbors shell as composed by the N particles closer to the particle under investigation, and the second neighbors shell as composed by the M particles closer to the particle under investigation, excluding the first N particles. As N and M we consider N = 12 and M = 6 for HS, while N = 4 and M = 12 for mW water. These values for N and M are related to the number of first and second neighbors in the crystalline structures forming in these models.

6.q4q6 sensitivity to protocols
This method for local structure identification is very popular, but, as discussed in the main text, it is very sensitive to the way in which it is computed and to the thresholds used to partition the map. Here we show that, when applied to the determination of the nucleus size and its composition of mW water, theq 4q6 method can give very different results.
First of all, in order to define the neighbors of a particle i, two approaches are usually employed: considering the n n particles closer to particle i or considering all the n cut particles found at a distance from particle i smaller than r cut . Once theq 4q6 map has been computed, it can be FIG. 16.q4q6 map calculation and partition following method LD-A. Each dot corresponds to theq4,q6 coordinates associated to a particle of the following systems composed of N=5376 mW particles at melting: Ic (blue), I h (green), ice 0 (red), and liquid water (black).
partitioned in different ways.
In Fig. 16 we show theq 4q6 map obtained by considering n cut neighbors with r cut = 1.43σ mW , and the particles phase is associated to fluid ifq 6 < 0.415, otherwise they are crystalline and in particular in the phase I c ifq 4 > 0.425, and I h otherwise (orange dashed lines correspond to these thresholds). This method, named LD-A in Ref. [24], doesn't discriminate between the liquid phase and ice 0 (black and red dots corresponding to the fluid phase and ice 0, respectively, overlap and then cannot be distinguished, see Fig. 16). In Fig. 16 we show also another choice of thresholds to partition theq 4q6 map, that we name LD-A2: particles are fluid ifq 6 < 0.385, otherwise they are crystalline and in particular associated to the phase I c if 2q 4 >q 6 + 0.35,  Average ratio r = nI c /nI h between the number of particles composing the nucleus in the cubic phase (Ic) and the hexagonal phase (I h ) using theq4q6 methods LD-A (orange squares), LD-A2 (dark-green triangles), and LD-B (magenta circles). r is plotted against the nucleus size n normalized by the critical nucleus size nc for each specific method. and I h otherwise (dark green dash-dotted lines correspond to these thresholds). This choice of thresholds allows to better partition theq 4q6 map at melting (not shown here) respect to LD-A.
In Fig. 17 we show another method to obtain theq 4q6 map, where the number of neighbors is fixed to n n = 16, and the threshold are the following: ifq 4 < 0.105 particles are fluid, while crystalline in the opposite case. Crystalline particles are classified as ice 0 ifq 6 < 0.11, and ice I in the opposite case. Ice I particles are classified as I c ifq 4 /0.36 +q 6 /0.45 > 1, and I h otherwise. This method, named LD-B in Ref. [24], allows to discriminate between the liquid phase and ice 0. In all cases theq 4q6 map is computed at the nucleation temperature T = 204 K and pressure P = 0 Pa.
In Fig. 18 we show the average first passage time t f p , described in Sec. IV B 1, as a function of the nucleus size n obtained applying the three different methods considered here to compute and partition theq 4q6 map. We can notice the big variation in the value of the critical nucleus size n c estimated from the different methods.
In Fig. 19 we show the ratio r between the number of particles n Ic in the cubic phase and the number of particles n I h in the hexagonal phase found in the nucleus as a function of its size n divided by the critical nucleus size n c applying the three different methods considered here to get theq 4q6 map. As found for the average first passage time, also in this case each method gives a different estimation of r (averaging only on the stationary part, that is excluding small cluster size): 0.94, 1.07, and 0.07 for LD-A, LD-A2, and LD-B, respectively. Even though LD-B is able to discriminate between the liquid phase and ice 0, it is strongly biased towards the hexagonal phase.

Composition of mW nuclei
Similarly to Fig. 1, we show in Fig. 20 the average fractional composition as a function of nucleus size for mW molecules at ambient pressure and temperature T = 204 K as identified by EXT-CNA-1st (panel a), BOO (panel b), and LID (panel c).

Benchmark
Considering the wide variation of results on the nucleus properties predicted by different methods adopted in the literature, some of which analyzed here, it would be desirable to find benchmarks to evaluate accuracy and reliability of each of them. Here we propose a simple test in which we know by construction the phase of each particle belonging to the nucleus and we use different methods to identify them. We consider a cluster composed of particles of both phases ice I c and I h obtained from a perfect lattice of stacking ice with alternated layers of I c and I h at a density ρ = 0.982 g/cm 3 corresponding to the temperature T = 235K at equilibrium conditions (see Ref. [4]). We obtain a cluster of size n = 200 following the minimum energy rule described in Ref. [4]. Then we let the cluster equilibrate in contact with a liquid phase of density ρ = 1.002 g/cm 3 , corresponding to equilibrium conditions at the temperature T = 235K, using fixed-topology MC simulations (see Ref. [4]) which allow for bonds elongation up to a maximum cutoff (set to 1.3Å), while keeping the topology fixed.
Since we know the phase (I c or I h ) of each particle composing the cluster, using different methods we identify each particle phase and compare this prediction with its true value. We distinguish particles of the cluster as belonging to different regions depending on the number of theirs first neighbors (f n) and the sum of first neighbors of first neighbors (f n2) in the following way: for all regions under consideration f n = 4, while f n2 = 16, 15, 14, 13 for regions 1,2,3, and 4, respectively. Only particles belonging to region 1 have a fully formed second shell.
In Table I we show (second column) the average percentage of particles of the cluster belonging to each region (first column), and the percentage of particles correctly identified as I c , or incorrectly identified as I h or as liquid phase, L, for the different methods (columns from third to seventeenth). In Table II we show the same results, but for the identification of I h . For example, for clusters of size n = 200 considered here, particles belonging to region 1 are in average only the 16.4% of the total. These results are obtained by averaging over 20 different clusters realized by using the minimum energy rule and 10 different evolution times. The identification methodq 4q6 is strongly affected by the choice of the protocol used to compute and partition it (see Ref. [24] and Appendix A), and then it is not shown in the tables.
From Tabs. I,II we can see for example that the method Ext-CNA identifies correctly the ∼96% of times cubic and hexagonal ice particles in region 1. When considering other regions the percentage of correct identification quickly goes to zero for increasing region label, that is for more and more incomplete second shells, in which case particles are more likely associated to a liquid phase. This is reflected in the very small value of the critical cluster size, n c = 4, obtained with this method (see Fig. 5). In the case of Ext-CNA-1st the performance in the region 1 is similar to the method Ext-CNA, while particles in other regions are mainly identified as crystalline. However, as we also noted by snapshots inspection, in regions 2,3 and 4 Ext-CNA-1st misidentifies crystalline particles, often associating I c phase to I h particles and vice versa. This is not surprising considering that Ext-CNA-1st associates to the first neighbors of a particle in the I c (I h ) phase the same I c (I h ) phase (see Appendix A), and nuclei tested in the present benchmark are composed of alternating layers of I c and I h phases. For this reason, when using the Identify diamond structure function of Ovito, it would be important to specify if also first neighbors or even second neighbors of crystalline particles are included in the method to compute quantities like for example the cubicity which gives a measure of the amount of I c respect to I h composing the nucleus. Finally, BOO shows a good identification rate with limited misidentifications, while LID and Chill+ give the best performance with extremely low misidentifications.
A conservative way to rate the performance of a method from these benchmarks is to evaluate the percentage of particles correctly identified in region 1 (particles with fully formed second shell), and considering the relevance of misidentification decreasing for increasing region label. From these considerations we conclude that  Ext-CNA is too conservative missing many crystalline particles of the nucleus, while Ext-CNA-1st is affected by an important misidentification of crystalline particles with incomplete second shells. BOO shows low misidentification of crystalline particles. On the other hand, LID and Chill+ are the two methods with the lowest misidentification, with LID showing the best performance for identification of crystalline phases in region 1.
In order to evaluate the influence of thermal fluctuations on the particle identification methods, we repeated the previous benchmark, but this time considering rigid clusters (no bonds elongation) equilibrated with the liquid phase. Also in this case we observe a similar behavior of the different methods.

Correlation between precursors and OP
For each particle i we compute the Euclidean distance d LID i between the vector LID at a specific time and the LID associated to the perfect crystalline structure. Here we consider as reference the LID signal associated to I c , as very similar results are obtained with respect to I h (not shown). In the following we show the value of d LID i associated to each particle of a sample at two specific times (see Fig. 13) at which the nucleus has a size of n = 70 (Fig. 21,22) and n = 197 (Fig. 22), (see the green square for n = 70 and the violet circle for n = 197 in Fig. 13). In Fig. 21 we show snapshots corresponding to the nucleus of size n = 70, where particles i with a distance d LID i smaller than 1.1, 1.2, 1.3, and 1.4 (from left to right and from top to bottom) are shown with a color code going from 1.0 (blue) to 1.5 (red). The field d LID i correlates with crystalline structures present in the system (see top-left snapshot in Fig. 21), and in particular with the main cluster as detected by other methods (see Fig. 22).
In Fig. 22, from top to bottom, we show: particles with the Euclidean distance d LID i < 1.1 (see Fig. 21 for color codes); particles belonging to the main cluster as identified by using LID (blue for I c , green for I h , and red for ice 0); and particles belonging to the main cluster as identified by using the Chill+ algorithm (blue for I c , and green for I h ). Left (right) column in Fig. 22 refers to a snapshot of the nucleation trajectory shown in Fig. 13 at the time t = 102 (t = 141) in 10 4 MC steps units. From Fig. 22 we can see that d LID i correlates very well with the nucleus identified by LID and Chill+, and that the latter method, apart from not providing ice 0 particles, finds a smaller nucleus, as expected from its ability to estimate a smaller value of the critical nucleus respect to LID (see Fig. 5).