Unsupervised learning universal critical behavior via the intrinsic dimension

The identification of universal properties from minimally processed data sets is one goal of machine learning techniques applied to statistical physics. Here, we study how the minimum number of variables needed to accurately describe the important features of a data set - the intrinsic dimension ($I_d$) - behaves in the vicinity of phase transitions. We employ state-of-the-art nearest neighbors-based $I_d$-estimators to compute the $I_d$ of raw Monte Carlo thermal configurations across different phase transitions: first-, second-order and Berezinskii-Kosterlitz-Thouless. For all the considered cases, we find that the $I_d$ uniquely characterizes the transition regime. The finite-size analysis of the $I_d$ allows not just to identify critical points with an accuracy comparable with methods that rely on a priori identification of order parameters, but also to determine the class of the transitions and to access the corresponding (critical) exponents. For the case of topological transitions, this analysis overcomes the reported limitations affecting other unsupervised learning methods. Our work reveals how raw data sets display unique signatures of universal behavior in the absence of any dimensional reduction scheme, and suggest direct parallelism between conventional order parameters in real space, and the intrinsic dimension in the data space.


I. INTRODUCTION
The growing field of machine learning (ML) is rapidly expanding our capabilities of analyzing and describing high-dimensional data sets [1][2][3][4]. With the increasing understanding of these methods, the community is becoming convinced that their outstanding performance is mostly due to the fact that this "high-dimensionality" is applicable only to the embedding space, while the data sets lay in a manifold that can be twisted and topologically complex but whose intrinsic dimension, I d , is typically much smaller than the large number of coordinates of the system [5,6](see graphics in Fig. 1 (A)). The determination of this I d is an active field of research [5,7,8] in unsupervised learning (UL), i.e., the branch of machine learning that aims to uncover the internal structure of a data set without the need of any label.
Recently, ML ideas have encountered fruitful applications in the context of statistical physics [9][10][11]. Such applications have ranged from the determination of physical properties [12][13][14][15][16][17][18][19], to the formulation of novel classes of variational ansätze [20][21][22][23]. These progress leveraged on analyzing and exploiting the results of dimensional reduction, and employing a variety of tools to analyze (or employ) the final representation (or truncation) obtained in this way. In various contexts, results obtained via these methods have remarkably shown to be already competitive with more traditional approaches [9].
Here, we pursue an alternative approach: our main purpose is to show that, from a ML perspective, physically relevant and universal information can be gathered by analyzing the very same embedding procedure that * Equal contribution. carries out the dimensional reduction, rather than focusing on its final result. In particular, we show how the intrinsic dimension correspondent to the partition function of statistical mechanics models displays universal scaling behavior in the vicinity of phase transitions, and it works as an order parameter for a corresponding structural transition in data space. Differently from previous works [12,[24][25][26][27][28][29][30], our approach is thus focused on data mining the data set as a whole, and thus, does not leverage on any kind of projection. At the technical level, this is achieved by employing a cutting-edge nearest-neighbor estimator of the I d , which is suitably designed to deal with non-linear data sets [8].
In order to access the complex data structure at phase boundaries, we study numerically instances of first-order, second-order (conformal), and Berezinskii-Kosterlitz-Thouless (BKT) transitions in two-dimensional (2D) classical spin systems. In all cases, I d displays a universal scaling behavior correspondent with the transition properties of the underlying lattice model. (i) For firstorder transitions, I d peaks at the critical point due to the coexistence of different orders, and the finite-size corrections of the transition temperature are dictated by trivial scaling exponents. (ii) For both second-order and topological transitions, we observe universal scaling collapse, with transition temperature and critical exponents determined to the percent level. (iii) Most importantly, we provide compelling evidence that the I d is an ideal tool to underpin topological transitions in an unsupervised fashion: as an example, we extract the critical temperature of the 2D XY model with 1% confidence even at modest system sizes.
Before diving into the main part of our manuscript, we provide a simplified picture that qualitatively captures how the intrinsic dimension is connected to the physical information obtained by sampling a partition   The important content of a data set typically lays in a manifold whose I d is much lower than the number of coordinates. In the example, despite the synthetic data set (Klein's bottle-shaped) is embedded in a three-dimensional space, it can be effectively described by a twisted manifold whose I d = 2.
The key ingredients to compute the I d are the first-and the second-nearest neighbor distances, r1 and r2, of each point of the data set. The computation of the I d is based on the fitting of the empirical cumulative distribution function (CDF) of the ratio µ = r2/r1, P (µ) [see text and Eq. (1)]. Panels (B): Low-and high-temperature data sets of a 3-site model in configuration space. The points represent the 3-site XY model configurations: θ = (θ1, θ2, θ3). The high-and zero-temperature cases show simple data structures: for T = 100, I d is equal to the number of spins, while for T = 0, I d = 1. Panel(C): Intrinsic dimension in the vicinity of a phase transition. The I d in the intermediate temperature regime, representative of phase transitions in larger systems, is considerably more complex. The temperature dependence of I d can be used to signalize and characterize critical points. As an example, we show the universal data collapse of the I d at the Berezinskii-Kosterlitz-Thouless described by the 2D XY model. function via Monte Carlo methods. The basic intuition behind the I d of data sets generated in the low-and hightemperature regimes of a simple 3-site XY model is drawn in Fig. 1 (B). At low temperature (left graphics of Fig. 1 (B)), most of the spin configurations sampled during the Markov chain correspond to fully ferromagnetic spin arrangements (see cartoon). In the limiting case T = 0, the ground states are given by XY ferromagnetic configurations, i.e., θ 1 = θ 2 = θ 3 , and the data set is described by a manifold that lays in a line (I d = 1). Oppositely, in the high temperatures regime, the data set is described by a manifold whose I d = 3: each new Monte Carlo configuration corresponds to an arbitrary arrangement of the three spins, so that the structure of the data set is that of a homogeneously occupied three-dimensional space. This simple example demonstrate how transitions in parameter space are accompanied by structural transitions in data space. Due to its collective origin, the transition region requires the computation of I d in very high dimensional data space: in Fig. 1 (C), we show a sample of our results, illustrating the scaling collapse of the intrinsic dimension correspondent of the 2D XY model in the vicinity of its BKT transition point.

II. INTRINSIC DIMENSION
Before diving into the analysis of concrete statistical mechanics models, we present here a self-contained discussion on the intrinsic dimension and its state-of-the-art estimators. This section is propaedeutic to the critical identification of the best estimator to be used in our applications below.
The I d represents the minimum number of variables needed to describe a data set [7,8]. Information about the I d is important to determine if dimensional reduction of high-dimensional data sets incurs information loss or not. Moreover, it can be used as an UL approach to characterize a system. Just to mention a few examples: in biological physics, the I d can be used to determine the number of independent directions a protein can have during a sequence evolution [31], in image analysis, to distinguish between different kind of image structures [32], in astrophysics, to estimate the amount of information available in spectropolarimetric data [33], in theoretical machine learning, to understand the properties of deep neural networks [34], and in ecology, to characterize the minimum number of independent axes of variation that adequately describes the functional variation among plants [35].
Different approaches have been developed to estimate the I d , see Ref. [7] for review. For example, dimensional reduction techniques, such as principal component analysis (PCA) [36], Multidimensional Scaling [37], Isomap [38], Locally Linear Embedding [39], t-distributed stochastic neighbor embedding (t-SNE) [40] or Uniform Manifold Approximation and Projection (UMAP) [41] to mention some of them, search for a lower dimensional space to project the data set by minimizing a projection error. The dimension of the identified subspace is viewed as an estimation of I d . However, identifying this dimension is far from trivial. For instance, in the PCA case, one should take into consideration the spectrum of the eigenvalues of the covariance matrix and look either for a gap or decide ad hoc a cut-off parameter. It is worth saying that, for PCA, this strategy will not work if the manifold of lower dimensionality is curved.
A closely related quantity is the fractal dimension [42], whose estimation relies on the scaling of the number of neighbors with the distance from a given point. This approach is largely employed in the study of percolation transitions [43], but it suffers from serious limitations when the density distribution of points is not uniform.
These limitations lead to the development of nearest neighbors methods, in which it is assumed that nearestneighborhood points can be considered as uniformly drawn from small enough I d -dimensional hyperspheres (not all the data set) [5,7]. Indeed, the avoidance of any projection step and the smoothing on the condition of data uniformity (from the full data set to a small neighborhood around each point) are key features for obtaining good results in highly non-uniform, non-linear data sets even at really high dimensions (a regime at which all the purely geometrical methods present a bias due to the curse of dimensionality).
The TWO-NN method employed in this work belongs to this type of methods, with the particularity that by focusing only on the first two nearest neighbors (see Fig. 1 A), the size of the I d -dimensional hyperspheres at which the density is assumed constant is reduced to its minimum expression. The method is rooted in computing the distribution functions of neighborhood distances, which are function of I d . More specifically, for each point x in the data set, we consider its first and second nearestneighbors distances r 1 ( x) and r 2 ( x), respectively. Under the condition that the data set is locally uniform in the range of second nearest-neighbors, it has been shown in Ref. [8] that the the distribution function of Or, in terms of the cumulative distribution, P (µ), which can be used to obtain I d by fitting S = {(ln(µ), − ln [1 − P emp (µ)]} with a straight line passing through the origin. The function P emp defines the empirical cumulate and is computed by sorting the values of µ in a ascending order, see Appendix A for more details. In Fig. 1 A, the steps for computing the I d in a highly non-linear manifold with complex topology (in this case, a Klein's bottle-shaped data set) are summarized: a) Compute the distance from the first and second neighbors b) Compute for each point µ and its empirical cumulate and c) fit S to a straight line. We stress that this method is not free of drawbacks. As mentioned above, being a purely geometrical method, it is affected by the curse of dimensionality, since the number of points needed to have an accurate measure of the I d grows exponentially with the I d . Moreover, Equation (1) was derived assuming a continuous real support. Therefore, applying it to data sets with a different support implies some degree of approximation that can fail in some limiting cases. For instance, this shall happen when two or more configurations have the same coordinates. However, as we detail below, these drawbacks do not affect the results obtained in this work: in particular, these limitations do not kick in when investigating transitions, even when configuration spaces are composed of discrete variables such as Ising spins. These limitations only affect data sets corresponding to either very small system sizes, or phases at extremely low temperatures where, during the MC sampling, configurations may be repeated as the accessible configuration space is very simple.

III. MODELS
Our approach focuses on the high-dimensional data sets associated with the equilibrium configuration states of a partition function. Such states are sampled with Markov Chain Monte Carlo simulations from the thermal weight ρ(E) ∼ e −E( x)/T , where E( x) is the energy of an independent configuration x and T is the temperature. We employ Wolff's cluster algorithm [44,45], and for each data set, we consider N r configurations.
We consider partition-function data sets of several models in the vicinity of various types of phase transitions [46,47]. The first example is the well known Ising model in two-dimensions where the spin degrees of freedom are s i = ±1, and i, j are the nearest neighboring bonds of a square lattice, with N s = L × L spins and periodic boundary condition. The Ising configuration states are defined as This model describes a second-order phase transition characterized by the breaking of a Z 2 symmetry at the critical temperature T c = 2/ ln(1+ √ 2). In the vicinity of T c , the spin correlation length diverge as ξ ∼ (T − T c ) −ν , where the critical exponent is ν = 1.
We also consider the first-and second-order phase transitions described by the q-states Potts model (qPM) where the spin σ i = 0, 1, 2, ..., q − 1, and δ σi,σj is the delta function. In particular, the q = 2 Potts model can be mapped into the Ising model. The Potts configuration states are defined by The qPM is characterized by a discrete Z q symmetry that is broken at the critical temperature T c = 1/ ln(1 + √ q). Importantly, this class of models displays a secondorder phase transition for q ≤ 4, and a first-order one for q > 4. We examine both these regime: the second-order transition described by the q = 3 PM (with correlation length critical exponent ν = 4/5), and the first-order transition described by the q = 8 PM [48,49]. Finally, as a representative of the BKT universality class, we investigate the two-dimensional XY model [50,51] where S i = (cos(θ i ), sin(θ i )), cos(θ i ) and sin(θ i ) being the projection of the spin at site i in the x and y directions, respectively, and θ i ∈ [0, 2π[. The XY configurations are defined as θ = [cos(θ 1 ), sin(θ 1 ), ..., cos(θ Ns ), sin(θ Ns )], This model is characterized by a continuous U (1) symmetry and describes a phase transition between a hightemperature phase with exponentially decaying spin correlations, and a low-temperature quasi-ordered phase characterized by power-law decaying correlations. The BKT critical temperature, T BKT , is not known exactly; state-of-the-art estimations based on the analysis of the spin stiffness of lattices of order O(10 6 ) spins give T BKT = 0.8935(1) [52]. The detection of the BKT critical point is hindered by the fact that it cannot be characterized by conventional local order parameters, as in the examples discussed previously, and due to the exponential growth of the correlation length near T BKT . Hence, the BKT transition represents a key challenging test for any UL method.
A. How to characterize partition functions as data sets Before proceeding to the discussion of the results, we point out some important aspects of the Ising, Potts and XY data sets (see Eqs. (3), (5) and (7), respectively.). First, a crucial step to obtain the I d (cfr. Eq. (1)) is to consider a proper metric; the distance r( x i , x j ) between two configuration states x i and x j must be non-negative, equal to zero only for identical configurations, symmetric, and satisfy the triangular inequality.
For the XY data sets the distance is defined as the the Euclidean distance: This distance properly takes into account the periodicity of the configuration states in the interval θ i ∈ [0, 2π[. For both Ising and Potts configuration states, we consider the Hamming distance, i.e., r( s i , s j ) (or r( σ i , σ j )) is given by the number of positions in the state vectors ( s i and s j ) for which the corresponding coordinates are different. The choice of the Hamming distance is motivated by the fact that the energy difference between two spins in the model of interest is given by a delta function.
As mentioned in the previous section, the two-NN method fails when two or more sampled configurations of the data set have identical coordinates. This issue typically occurs in the discrete-variables Ising and Potts data sets, when the total number of independent configuration states, N c , is smaller or of the same order of the number of configurations used in the data set, N r . For instance, for both Ising and Potts data sets, identical ferromagnetic configurations are sampled in most of the Monte Carlo steps when T T c . However, in the regime that we focus here (i.e., T close to T c and L > 10), as N c N r , this issue is irrelevant (we have explicitly checked this in our data sets). Finally, we mention that data sets generated by the XY configuration states are typically nonlinear, which can be noted by the fact that linear dimension reduction methods, such as PCA, fails to describe XY data sets, see Ref. [28]. In fact, even for the simple data set shown in Fig.1 (B), linear PCA fails in estimating the true I d of the system when the proper distance between the configurations are taken into account, see the Appendix A. This feature of the XY data sets reveals the necessity of using state-of-the-art I d -estimators (such as the two-NN method considered here) that properly takes into account nonlinearities.

A. Second-order phase transitions
We start our discussion considering the second-order phase transitions (2PTs) described by the Ising and the 3-states Potts (3PM) models, see Fig. 2. We consider data sets formed by N r = 5 × 10 4 configuration states. Overall, far from the transition, I d is an increasing function of T . For low-T , the computation of I d is affected by the discreteness of the Potts (Ising) configurations; which reflects on the larger error bars. However, this issue is mitigated close to the critical point, T c . Remarkably, I d exhibit a non-monotonic behavior in the vicinity of the critical point (see Figs. 2 (a1) and (b1)) which can be used to locate and characterize the transition point itself.
As is conventionally done in the analyses of physical observables, e.g., magnetic susceptibility and heat capacity, we now consider a finite-size scaling (FSS) theory for I d . First, based on the FSS hypothesis and postulating that I d behaves as a true order parameter for the transition, one has I d = L ζ f (ξ/L), where the correlation length diverges as ξ ∼ (T −T c ) −ν , ν is a critical exponent, and ζ is a scaling exponent associated with the divergence of I d at T c . Figs. 2 (a2) and (b2) show the universal data collapse for Ising and 3PM, respectively. The values obtained for T c and ν have a discrepancy with exact results of less than 0.5% and 4%, respectively. For Ising we obtain: T c = 2.283(2), ν = 1.02 (2), and ζ = 0.410(5) , while for 3PM: T c = 0.996(2), ν = 0.805 (5), and ζ = 0.420 (2). See the Appendix C, for the discussion about the details of the data-collapse procedure.
Further, we consider the size scaling of the shift of the local minimum of I d (T ) (i.e., the temperature T * (L)), We note that T * (L) is reminiscent of the universal scaling behavior of singular features of physical observables close to T c (e.g., the peak of the magnetic susceptibility) [53]. In order to compute T c , we employ the following procedure: (i) we obtain T * (L) by fitting the results in an interval close to T * (L) with a cubic function; the fitting is performed with a jackknife procedure, which allows establishing an error bar for T * (L). We then (ii) consider the aforementioned FSS to compute T c ; the fitting is performed considering different sets of points. This method provides a coherent error propagation for T c . We obtain, The results of this analysis confirm the validity of our original assumption, that is, that the intrinsic dimension is a valid order parameter describing the transition in data space as a structural transition. We remark that this is validated at two steps -firstly, via the quality of the scaling collapse, and secondly, by the scaling of the transition temperature obtained by analysis a single feature of the I d dependence with respect to the temperature. Those represent two fundamental tests that any valid order parameter shall satisfy at transition points.

B. Berezinskii-Kosterlitz-Thouless (BKT) phase transition
Unsupervised learning of phase transitions associated with the break of a discrete symmetry, as discussed in the previous section, can also be performed with PCA [12,26,54]. For example, the critical temperature of the Ising model can be obtained with an accuracy similar to the ones obtained here. PCA is based on a linear dimension reduction, and thus, differs on a fundamental way from our unsupervised approach, which is based solely on the analysis of the I d . Furthermore, as we discuss in this section, our approach can be extended to cases that PCA does not work, such as the topological BKT phase transition.
The difficulties in learning the BKT transition from raw XY configurations occur in both supervised [55] and unsupervised [26] ML approaches. In the latter case, it is related to the fact that manifolds generated by XY configurations are usually curved. Recent progress based on diffusion maps [28,56] or topological data analysis [57] have been made to solve this problem, typically considering problem-specific insights (such as the structure of topological excitations). These approaches have shown how considerable qualitative insight can be gathered on the nature of the BKT transition. However, it is presently unclear if the raw data structure corresponding to topological transitions can exhibit universal features, and if so, if unsupervised approaches can be used to detect the critical temperature with an accuracy that is comparable with conventional methods (that typically rely on the a priori knowledge of the order parameter).
In Fig. 3 (a), we show the temperature dependence of I d in the transition region. The intrinsic dimension clearly distinguishes the low-T regime, characterized by bound vortex-antivortex pairs, from the unbinding high-T regime. In the vicinity of the BKT critical point, T BKT , the behavior of I d resembles the one observed for the second-order phase transitions, i.e., I d exhibit a local minimum at T * (L) (observed for L > 30), which is a signature of the BKT transition. Note that the minimum is clearly visible already for lattices of order L = 50; at these sizes, the spin stiffness is instead featuring a very smooth behavior, and considerably larger systems are required to appreciate a qualitative jump in the latter.
We consider the conventional FSS for BKT transition, I d (T, L) = L ζ f (ξ(T )/L), where the singular value of the correlation length diverge exponentially, i.e., ξ ∼ exp a/ √ T − T c . In Fig. (1) (C), we show the universal data collapse for different values of L, where a, T BKT and ζ are treated as free-parameters in the collapse procedure; see the Appendix C. The value obtained, T BKT = 0.92 (1), is in good agreement with estimations of T BKT obtained on Ref. [53].
A more accurate estimation of T BKT is based on the finite-size scaling of T * (L). This approach relies on the computation of T * (L), which is performed with the same procedure described in the previous section, and the finite size scaling ansatz [53]: As discussed before this procedure allows us to establish an error bar for the calculated T BKT . We obtain T BKT = 0.909 ± 0.015, that is compatible within error bars with Ref. [52] where simulations with up to O(10 6 ) spins were carried out. For comparison, the best alternative method [28] utilizing unsupervised learning techniques reported relative errors of the order of 5%. Conventionally, T BKT is obtained with the aid of the so-called Nelson-Kosterlitz universal jump of the spin wave stiffness [58], which allows to determine the finitesite critical temperature, T * BKT (L). The FSS [Eq. (10)] is then used to determine BKT critical point at the thermodynamic limit. Remarkably, here we observe that the intrinsic dimension of raw XY data sets exhibit a clear signature of the finite-site T BKT , even for moderate system sizes we have considered.

C. First-order phase transitions
Finally, we consider an example of first-order phase transition (1PT): the 8-state Potts model (8PM). As is typical of 1PT the system exhibit a finite-size correlation length at T c , ξ 8 = 23.9 [59]. For L > ξ 8 , the transition can be described by trivial and generic critical exponents, e.g., ν = 1/d, d being the system dimension [49,60,61]. Furthermore, the finite-size shift of the critical temperature, T c (L), conventionally detected, for example, by the maximum value of the magnetic susceptibility, scales as T c (L) − T c ∼ 1/L d . Fig. 4 shows that I d also exhibit a clear signature of the 1PT, featured by a peak at T c for L ξ 8 . For L ≈ ξ 8 , the temperature dependence of I d resembles the one observed for 2PTs in Fig. 2; i.e., I d exhibit a local minimum at a temperature T * . Interesting, the FSS of T * is in agreement with first-order transitions, see Fig.  4 (b) [60,61]; the discrepancy of the calculated T c = 0.7448(1) with the exact value is less the 0.05%

V. DISCUSSION
Our results so far support the fact that, in the vicinity of a phase transition, the intrinsic dimension behaves as an order parameter for both first, second-order and BKT transition, and this order parameter is tight to a transition between different data structures in configuration space. Within this framework, the position of the transition is always identified with the scaling of the minimum of the intrinsic dimension. We now provide a qualitative discussion in support of this picture.
In continuous phase transitions, collective behavior is captured by only a handful of parameters. This suggests that the amount of information required to describe the system is parametrically simpler at the critical point when compared to its vicinity, as the latter region requires additional information on the operators required to perturb away from criticality. This emergent simplicity may have several consequences at the data structure level. The most simple consequence is that one expects a simplified data structure to be described by a minimum of the intrinsic dimension at the transition point. This is exactly what we have observed at both second-order and BKT transitions. We note that this expectation is not related to the number of states sampled by the partition function (this number is, in our case, fixed by N r are configurations are never repeated). The discussion of how our results change with N r is reported in Appendix B.
For first-order transitions, the above reasoning is not applicable as it relies on universal behavior, and thus the existence of a continuum limit. In these cases, one expects that the data space in the vicinity the transition point shall feature two separate regions, each of them composed of states representing the two phases meeting at T c . Exactly at the transition point, one expects an abrupt change in the data structure: indeed, the MC sampling will access a large number of configurations corresponding to both phases (in analogy to metastability), and thus display a sharp increase (see Fig. 4 (a)). Approaching the transition point from the disordered phase will feature instead of a minimum, that scales to the transition point.
The arguments above serve as a qualitative guideline behind the basic picture we put forward: the simplified field theory description applicable at transition points reflects directly into the data structure of the problem. We believe this picture to be correct in view of our results above, and in particular, in view of the quality of the scaling collapses obtained. At the methodological level, it would be an interesting future work to develop an analytical understanding of this fact, based on the very close analogy between the intrinsic dimension and fractal dimensions in structural transitions.
We note that the systems sizes where the intrinsic dimension starts featuring a clear minimum structure are very similar in the case of both second-order and BKT transitions (in the case of first-order, the minimum structure appears once sizes of the order of the correlation length are reached, in agreement with the discussion above). This behavior is quite different from the one of conventional one-or two-point correlation functions, whose characteristic features are typically appearing at very different system sizes for transitions belonging to different universality classes. Indeed, by focusing on the geometrical and 'curvature' attributes of the data, the emerging of the features of the intrinsic dimension depends mostly on the characteristics of the data set of configurations (for instance, the number of data points or the dimension of embedding space). Therefore, these features appear at similar system sizes when the data space is similar, independently of the class of transition.
We also note that, in comparison with previous applications in other fields [31,34,62], the values of the intrinsic dimension reported here are considerably larger. For future applications, like combining our analysis with clustering methods that do not rely on dimension reduction [63], it may be interesting to develop novel estimators that focus on large values of I d , potentially trading absolute accuracy with numerical efficiency.

VI. CONCLUSIONS
We have shown that phases transitions can be learned through a single property of raw data sets of configurations -the intrinsic dimension -without any need to perform dimensional reduction. The key observation made here is that, in analogy to physical observables, the intrinsic dimension exhibits universal scaling behavior close to different classes of transitions: first-, second-order, and Berezinskii-Kosterlitz-Thouless (BKT). This indicates how the intrinsic dimension serves as an order parameter in data space, showing how the latter undergoes a structural transition that parallels the phase transition identified by conventional order parameters.
At the practical level, we have shown that the finitesize analysis of intrinsic dimension allows not just to detect, but also to characterize critical points in an unsupervised manner. In particular, we have shown that the intrinsic dimension allows one to estimate transition temperatures and (critical) exponents of both first-and second-order transitions with accuracies ranging from 1% to 0.1% at very modest system sizes. In addition, the method is equally applicable to topological transitions, where we have demonstrated an accurate (with 1% of confidence) estimation of the BKT topological transition competitive with more traditional methods at the same system sizes. This latter result suggests that the lack of any dimensional reduction allows retaining topological information in the vicinity of the phase transition, which may instead be lost otherwise [26,28].
A fundamental aspect of our approach is that it is based on a I d -estimation method suitable to learn complex manifolds, such as the twisted XY manifold emerging at the BKT critical point. The results demonstrate the potential of state-of-the-art I d -estimators [8] methods to tackle many-body problems, and motivates an even stronger methodological connection between data mining techniques, and many-body physics.
Some of the methods presented here may be applied to quantum mechanical objects, such as quantum partition functions, density matrices, and wave functions. It is an open challenge to determine whether the data mining of quantum objects can provide an informative perspective on the latter, such as, e.g., accessing entanglement or other more challenging forms of quantum correlations. Finally, while we focused on configuration generated by Monte Carlo sampling, our approach is equally applicable to experimentally generated data; it may be interesting to apply it to settings where raw data configurations are available, such as, e.g., quantum gas microscope experiments [18,64,65]. an ascending order through a permutation, i.e., (µ 1 , µ 2 , ...µ Nr ), where µ i < µ j , for i < j.   Fig. 1 (B). It worth mentioning that while we depict the configurations θ = (θ 1 , θ 2 , θ 3 ) for clarity of illustration in Fig. 1 (B), in our calculations, θ is defined as in Eq. (7). In this way, the distance between two configurations θ i and θ j : r( θ i , θ j ) = 2 Ns k=1 (1− S i k · S j k ) , properly takes into account the periodicity of the variables θ i k . Another important technical aspect is that the fit of S is unstable for larger values of µ. As is considered in ref. [8], we discard the 10% of points characterized by the highest values of µ. Based on this approach, we obtain I d ≈ 1 and I d ≈ 3, for the zero and high temperature regimes, respectively, which is consistent with the value expected from physical reasonable assumptions (see Figs. 5 (a1) and (a2)).
In contrast, simple linear dimension reduction methods, such as Principal Component Analysis (PCA), fails to describe the I d of the XY data sets. To illustrate this point, we employ linear PCA in the same collection of configurations considered in the last paragraph. As can be seen from Figs. 5 (b1) and (b2), even for this simple example, PCA fails to obtain the true I d ; for T = 0, I P CA d = 2, while for T = 100, I P CA d = 6. This failure is related to the fact that the XY manifolds are curved [28].
We also show some examples of the plot of S for data sets generated in the vicinity of the critical points of the Ising and 2D XY models, see Fig. 6 (a) and (b), respectively. In both cases, the points S are well fitted by a straight line passing through the origin. We obtain similar results for the other system-sizes and values of T considered in this work. In this section we discuss the scaling of the I d with the number of configurations, N r , considered in the data set; for all the results shown in the main text, N r = 5 × 10 4 . The first important aspect to consider is that the TWO-NN is a scale-dependent method. In other words, the estimation of the I d is performed on a length scale that is related to the first and second neighbor distances of each point. Thus, by varying N r , one is probing a different neighborhood size like, i.e., estimating I d in different scales [8].
To illustrate how this change in scale affects the I d of the thermal data sets considered here, we first consider the 3-site XY model in Fig. 7 (a). For T = 1, the I d converges to 3 as expected for the high-temperature regime of this model. In the low temperature regime T ≈ 10 −6 , however, I d (N r ) exhibit a plateau at I d = 1 for N r ∈ [10 0 , 10 3 [. As illustrate in Fig.1 (B) this simple data set is well described by a one-dimensional manifold. This plateau in I d (N r ) is a signature of this soft direction [8]. Nevertheless, by further increasing N r , the I d increases (I d → 3 in this case), as an effect of the decrease of the scale in which I d is estimated. In this scale regime, the number of soft directions cannot be determined. We stress that the computation of the I d of the high-dimensional data sets considered here is always performed in this regime. In this case, the I d exhibit an exponential scaling with N r , as exemplified in Fig. 7 (b).
We now discuss how the temperature dependence of I d is affected by the change in N r . Fig. 8 (a1) and (b1) shows I d (T ) for different values of N r for the Potts and 2D XY data sets, respectively. Despite the change of the absolute value of I d with N r , the qualitative behavior of I d (T ) is not modified. Most importantly, we observe that the position of the local minimum at T * does not shift with N r for N r > 10 4 . Furthermore, as expected, the scaling of I d with N r is exponential, see Figs. 8 (a2) and (b2), at least in the vicinity of the phase transition. Similar results are obtained for other system sizes and for the Ising model. Summing up, our results indicate that, as long as N r > 10 4 , the universal scaling behavior exhibited by the I d is not affected by the scale in which the I d is measured.  . Contour plot of the average residuals projected on the direction of the optimal parameters. The white star points to the optimal parameters for our finite size analysis. 3 We choose a parametric functional hypothesis f (x; {a}).
In order to keep a low-bias on the hypothesis function f (x, {a}), we choose various k-degree polynomial Q k (x; a 0 , a 1 , . . . , a k ). Thus, we obtain a set of optimal {T c }, {ν }, {α } for each choice of degree k and each set of system sizes {L k }. Our estimates and errors for the critical temperature and critical exponents are then estimated as the average and standard deviation of these sets, respectively. The analysis for the BKT transition (XY model) is performed in a similar fashion. The only difference is the choice of scaling variable, which for this case is: For the Ising, 3-state Potts and XY models we select polynomials of degrees k ∈ {5, 6, 7, 8} and different sets of system sizes among the L ≥ 70 ones. Our estimations for the XY model are T c = 0.92(1), a = 1.4(1) and ζ = 0.40 (1).
We visualize the resulting residuals in Fig. 9. Since the parameter space is 3D, for convenience we plot the projected directions along with the optimal critical parameters. Similar considerations hold for the Potts and XY models, although not presented here.