Replacing neural networks by optimal analytical predictors for the detection of phase transitions

Identifying phase transitions and classifying phases of matter is central to understanding the properties and behavior of a broad range of material systems. In recent years, machine-learning (ML) techniques have been successfully applied to perform such tasks in a data-driven manner. However, the success of this approach notwithstanding, we still lack a clear understanding of ML methods for detecting phase transitions, particularly of those that utilize neural networks (NNs). In this work, we derive analytical expressions for the optimal output of three widely used NN-based methods for detecting phase transitions. These optimal predictions correspond to the results obtained in the limit of high model capacity. Therefore, in practice they can, for example, be recovered using sufficiently large, well-trained NNs. The inner workings of the considered methods are revealed through the explicit dependence of the optimal output on the input data. By evaluating the analytical expressions, we can identify phase transitions directly from experimentally accessible data without training NNs, which makes this procedure favorable in terms of computation time. Our theoretical results are supported by extensive numerical simulations covering, e.g., topological, quantum, and many-body localization phase transitions. We expect similar analyses to provide a deeper understanding of other classification tasks in condensed matter physics.


I. INTRODUCTION
In recent years, machine learning (ML) has been used extensively to approach complex physics problems [1][2][3]. Among these applications, the task of classifying phases of matter and the identification of phase transitions is particularly exciting [4][5][6][7], as it could enable the autonomous discovery of novel phases of matter. Classical ML methods have successfully revealed the phase diagrams of a plethora of systems based on data from experimental measurements [8][9][10][11][12] and numerical simulations [4,5,. Many of the most powerful ML methods for detecting phase transitions utilize neural networks (NNs) at their core [4, 5, 14-26, 28, 29, 31-34, 36-39].
All three methods follow a similar workflow, which is illustrated in Fig. 1 (steps 1-3). They take as input samples that represent the state of a physical system at various values of a tuning parameter. The samples are processed by an NN whose parameters are tuned to minimize a specific loss function. By analyzing the NN predictions, one can compute a scalar quantity that highlights the critical value of the tuning parameter at which the system's state changes most. As such, this quantity highlights phase boundaries and serves as an indicator for phase transitions. The decision whether the change corresponds to a crossover or a phase transition does, however, requires further analysis, such as finite-size scaling. The three methods differ in their choice of loss function, i.e., in the formulation of the underlying classification or regression task, and thus in the resulting indicator for phase transitions.
NNs are universal function approximators [42][43][44][45]. This fact makes supervised learning, the learning-byconfusion scheme, and the prediction-based method extremely powerful and has played a central role in the original conception of these methods [4,5,29]. Namely, the use of NNs for detecting phase transitions from data has been inspired by the success of deep NNs (DNNs) in image recognition tasks [46]. The more expressive a ML model [40,47,48], such as an NN, the more resources are needed to train it, and the more difficult it is to interpret the underlying functional dependence of its predictions on the input [49,50]. Therefore, NNs typically act as black boxes that can correctly highlight phase transitions but whose internal workings remain opaque to the user. Since the proposal of supervised learning, the learning-by-confusion scheme, and the prediction-based method, there have been numerous attempts to understand their working principle, particularly through the extraction of order parameters. As an example, (kernel) support vector machines, which are easier to analyze than NNs due to their inherent linear nature, were used as predictive models [51][52][53][54]. Other approaches to improve interpretability rely on systematic input engineering, such that the objective function that the NN learns is approximately linearly [55], or on a systematic reduction of the NN expressivity [15]. Another set of works [56][57][58][59] analyzed trained NNs using standard interpretability tools from ML, which rely on truncated Taylor expansions. Despite these efforts, we still understand little about the working principle of ML methods for the detection of phase transitions based on NNs, when they fail or succeed, and how they differ [2] -in particular when DNNs are used (i.e., in the limit of FIG. 1. Schematic representation of the setup and workflow of supervised learning, the learning-by-confusion scheme, and the prediction-based method for detecting phase transitions from data. The physical system under consideration is characterized by a tuning parameter p. The goal is to identify the critical value of the tuning parameter pc at which the system transitions from one phase to another. In a first step (step 1), the state x of the physical system is (repeatedly) sampled at various values of the tuning parameter {p1, p2, . . . , pK }, where {P1(x), P2(x), . . . , PK (x)} are the corresponding probability distributions. Based on these samples, a neural network (NN) is trained to perform a particular classification or regression task, i.e., its tunable parameters are updated to minimize a particular loss function (step 2). The three ML methods for detecting phase transitions differ in their formulation of the underlying NN tasks. Having trained the NN, its predictionsŷ are used to compute the value of an indicator of phase transitions I at fixed values of the tuning parameter (step 3). Ideally, the indicator has a local maximum at pc where the largest change in the state of the system occurs. As a result, the ML methods then autonomously highlight phase boundaries along the chosen scanning range of the tuning parameter. Note that the indicators of phase transitions obtained with supervised learning, the learning-by-confusion scheme, and the prediction-based method differ. The contribution of our work is highlighted in blue: We derive analytical expressions for the optimal predictionsŷ opt of the NNs used in these three methods. The optimal predictions minimize the corresponding loss function and are thus achieved by NNs whose capacity, i.e., ability to fit a wide variety of functions [40,41], is sufficiently high. The optimal predictions can solely be expressed in terms of the probability distributions underlying the physical system. Using the optimal predictionsŷ opt in place of the NN predictionsŷ, we further obtain analytical expressions for the optimal indicators of phase transitions I opt (step 2 * ). Evaluating these analytical expressions provides an alternative path for computing indicators of phase transitions without ever training NNs, see Tab. I where we compare the computation times of the two approaches.
high model expressivity). These open questions reflect the general scarcity of rigorous theory in ML [35].
Here, we address these gaps in knowledge by pursuing a novel approach based on deriving analytical expressions for the optimal predictions of the NNs underlying supervised learning, learning by confusion, and the prediction-based method. The predictions are optimal in the sense that they minimize the target loss function, i.e., the corresponding model performs the desired task (as specified by the loss function) optimally. Based on the optimal predictions, we find analytical expressions for the optimal indicators of phase transitions of these three methods.
The optimal indicators correspond to the output of the methods when using ideal highcapacity [40] predictive models, such as well-trained, highly expressive NNs. The inner workings of these methods are revealed through the dependence of the optimal indicators on the input data. Moreover, the analytical expressions make it possible to compute the optimal indicator directly from the input data without training NNs, see step 2 * ) in Fig. 1, manifesting an alternative numerical routine to infer phase transitions. We demonstrate the procedure in a numerical study on a variety of models exhibiting, e.g., symmetry-breaking, topological, quantum, and many-body localization phase transitions.
This work is structured as follows: In Sec. II, we introduce the task of detecting phase transitions from data in an automated fashion, including supervised learning, the learning-by-confusion scheme, and the prediction-based method. Section III discusses the analytical expressions of their optimal indicators of phase transitions, i.e., their output when using well-trained, highly expressive NNs. A numerical study of the optimal predictions and indicators for the Ising model, Ising gauge theory, XY model, XXZ model, Kitaev model, and Bose-Hubbard model is presented in Sec. IV. Finally, the results are discussed in Sec. V and conclusions are drawn in Sec. VI.

II. AUTOMATED DETECTION OF PHASE TRANSITIONS FROM DATA
In this section, we will formally introduce the task of automatically detecting phase transitions from data and how supervised learning (SL), learning by confusion (LBC), and the prediction-based method (PBM) approach this problem. We consider the following scenario: The physical system to be analyzed is characterized by a tuning parameter p sampled equidistantly with a grid spacing ∆p. In the following we denote the points at the boundary of the sampled region as p 1 and p K with K ∈ N sampled points in total (K = p K −p1 ∆p + 1). At each sampled point p k (1 ≤ k ≤ K) we draw M ∈ N samples from the system's state {S jk } M j=1 which constitute our available data. We allow for this data to be pre-processed via a mapping to a representation space R : S → x. At the core of each of the three methods for detecting phase transitions under consideration lies a predictive model m : x →ŷ, such as an NN, which takes the pre-processed data X = {x jk |1 ≤ j ≤ M, 1 ≤ k ≤ K} as input. We denote the available data at sampled point p k as X k = {x jk } M j=1 . Note that X may contain duplicates. LetX be the set of unique inputs obtained from X by removing all duplicates. We assume that the system is present either in a single phase A or two distinct phases A and B across the sampled range of the tuning parameter {p k } K k=1 . If a system exhibits multiple distinct phases, the parameter range can (in principle) be analyzed in a piece-wise fashion (for more details on this case, see Appendix A 1 and A 2). The task is then to compute a scalar indicator I(p), which peaks at the phase boundary if two distinct phases are present, i.e., has a local maximum, and does not exhibit a peak otherwise. More specifically, if the system is in phase A from p 1 to p c and phase B from p c to p K with critical point p c (not necessarily a sampled point), the indicator I(p) should exhibit a local maximum at the sampled point closest to the critical point arg min p k |p c − p k |.

A. Supervised learning
In SL, a predictive model m is trained on the data available in regions near the two boundaries of the chosen parameter range denoted by I and II. Region I and II are comprised of the set of sampled points {p k |1 ≤ k ≤ r I } and {p k |l II ≤ k ≤ K}, respectively. Here, r I , l II ∈ N denote the rightmost and leftmost parameter point in region I and II, respectively. In SL, we assume that there exist two distinct phases A and B, with the regions I and II being located deep within these phases. Without loss of generality we assign the label y = 1 and y = 0 to data obtained in region I and II, respectively. The predictive model is trained to minimize a cross-entropy (CE) loss where the sum runs over all M T data points in the training set T ⊆ X , T = x jk |1 ≤ j ≤ M, k ∈ {1, . . . , r I } ∪ {l II , . . . , K} . Let us denote the set containing all unique inputs present in T without repetition asT . The output of the predictive modelŷ(x) ∈ [0, 1] corresponds to the probability of input x having the label y = 1, whereas 1 −ŷ(x) is the probability that the input x carries the label y = 0.
After training the predictive model to minimize the loss function in Eq. (1), it is evaluated on all available data X . Averaging over the predictionsŷ(x) for all data X k at a given point p k (1 ≤ k ≤ K) yields a prediction as a function of the tuning parameter The indicator for phase transitions in SL, I SL , is then given by the negative derivative of the prediction with respect to the tuning parameter The estimated critical value of the tuning parameter in SL corresponds to the location of the global maximum in its indicator [Eq. (3)], which can easily be determined in an automated fashion without human supervision. If one chose to label data obtained in region I with y = 0 and region II with y = 1 instead, the same indicator signal can be recovered via a sign change I SL (p k ) → −I SL (p k ). Note that it is also common to identify the estimated critical value of the tuning parameter in SL as arg min p k |ŷ(p k ) − 0.5|, see Appendix D 1 for a comparison motivating our choice.
Intuitively, if there is a transition from one phase to another (phase A to phase B) when varying the tuning parameter p, the mean predictionsŷ SL (p) should drop fromŷ SL (p 1 ) = 1 (deep within phase A) toŷ SL (p N ) = 0 (deep within phase B) as p is increased. If the transition is sharp, the predictions should also change abruptly. Such a change results in a peak in the negative derivative of the predictions, i.e., in the indicator for phase transitions. In that case, the predictive model acts as an order parameter that approaches 1/0 deep within phase A/B. In general, one expects the predictions -and thus the indicator -to vary most strongly at the critical point p c . If there is only a single phase, one expects the predictions to be approximately constant, resulting in a flat indicator I SL (p). Our derivation of the optimal indicator I opt SL will provide a rigorous basis for these heuristic arguments underlying the SL method.

B. Learning by confusion
In LBC, predictive models are trained on all available data X . The labels are obtained by performing a split of the sampled parameter range into two neighboring regions labeled I and II. Each input x drawn in region I or II carries the label y = 1 or y = 0, respectively.
The values of the tuning parameters which realize each of the K + 1 possible bipartitions are given as p bp For a given bipartition point p bp k , region I and II are then comprised of the sampled points Note that for bipartitions 1 (p bp 1 = p 1 − ∆p/2) and K + 1 (p bp K+1 = p K + ∆p/2), region I or II encompasses the entire sampled parameter range and all data is assigned the label 1 or 0, respectively.
To each bipartition, i.e., choice of data labeling, we associate a distinct predictive model m k (1 ≤ k ≤ K + 1) which is trained to minimize a CE loss where the sum runs over all M X = KM data points. Again, the output of the predictive modelŷ(x) ∈ [0, 1] corresponds to the probability of input x having the label y = 1, whereas 1 −ŷ(x) is the probability of the input x carrying the label y = 0.
Once a predictive model has been trained to minimize the loss function in Eq. (4) for a given bipartition, it is evaluated on all available data points. In particular, we can compute the mean classification accuracy as a function of the bipartition parameter p bp k (1 ≤ k ≤ K +1) as where θ denotes the heaviside step function. The predictionsŷ(x) are obtained from the predictive model m k associated with the bipartition point p bp k , and y(x) are the corresponding labels.
Clearly, the mean classification accuracy I LBC will exhibit trivial local maxima at the points p bp 1 = p 1 − ∆p/2 and p bp K+1 = p 1 + ∆p/2, where the entire data is assigned the label 0 or 1, respectively. Therefore, a predictive model effortlessly reaches a perfect accuracy of 1, because it simply needs to predict a single label regardless of the input. However, given that the underlying data can be separated into two distinct classes of similar character (i.e., phases) through appropriate bipartitioning of the parameter range at p c , one also expects the classification accuracy to have a local maximum at p c . At such a point, the predictive model is "least confused" by the choice of data labeling. Hence, the mean classification accuracy serves as the indicator for phase transitions within LBC. The estimated critical value of the tuning parameter in LBC corresponds to the location of the largest local maximum (excluding the points p bp 1 and p bp K+1 at the boundary) in its indicator [Eq. (5)].

C. Prediction-based method
In PBM, a predictive model m is trained on all available data X to infer the value of the tuning parameter p k (1 ≤ k ≤ K) at which an input x was generated. While SL and LBC constitute supervised classification tasks, PBM corresponds to a supervised regression task, where the label is given by the tuning parameter itself We train the predictive model m to minimize a meansquare-error (MSE) loss function After training, the predictive model is evaluated on all available data points X . Averaging over the predictionŝ y(x) for all data X k at a given point p k yields a mean prediction as a function of the tuning parameter We then compute the deviation of the prediction from the true underlying value of the tuning parameter δy PBM (p k ) =ŷ PBM (p k ) − p k . The indicator for phase transitions of PBM, I PBM , is then given by the derivative of this deviation with respect to the tuning parameter The estimated critical value of the tuning parameter in PBM corresponds to the location of the global maximum in its indicator [Eq. (8)].
Intuitively, if there is only a single phase, in which inputs cannot be distinguished well by the predictive model, one expects the mean predictions to be approximately constant. This results in the deviations δy PBM varying approximately linear with the tuning parameter. Hence, the indicator I PBM will be approximately constant. However, if there is a transition from one phase to another as the tuning parameter is varied, the predictions and the corresponding deviations also vary sharply. This results in a peak in the derivative of the deviations, i.e., the indicator for phase transitions I PBM .
In particular, one expects that the predictions are most susceptible at the phase boundary. Thus, its derivative should vary most strongly at the critical point p c .
In many standard applications of NNs, it is typical to split the available data into multiple sets, in particular, to avoid overfitting [40]. For example, suppose we aim to construct an accurate on-the-fly classifier of individual samples into distinct phases of matter. In this case, it may be beneficial to split the available data into a training and validation set to avoid overfitting if only a limited amount of data is available. In the case of PBM and LBC, we did not explicitly split the data set X into a training set and test set (as well as a potential validation set). This can be done, e.g., to assess sampling convergence by comparing the predictions obtained on the training set and test set or to perform early stopping with NNs (see Appendix B 2 for concrete examples). Note, however, that the task we consider here is the detection of phase transitions given the data at hand. As such, the data set X does not necessarily need to be split. In particular, in the limit of a sufficient number of samples all splits of a data set coincide, assuming that all samples are drawn independently from the same probability distributions underlying the physical system (see Fig. 1). Therefore, the predictions and indicators obtained by training NNs using multiple distinct data sets will coincide with the values obtained using the entire data set for training and evaluation up to deviations arising from finite-sample statistics. That is, in the limit of a sufficient number of samples, the results obtained in the two scenarios coincide [40,60,61]. Moreover, given a fixed amount of data X , better statistics are achieved by utilizing the entire data for training and evaluation.

III. OPTIMAL INDICATORS OF PHASE TRANSITIONS
In this section, we discuss the optimal indicators of phase transitions I opt for each of the phase-classification methods presented in Sec. II. The optimal indicators can be directly calculated given the predictionsŷ opt (x) of an optimal model m opt which minimizes the corresponding loss function. The detailed proofs can be found in Appendix A 1. In the limit of sufficient data, i.e., given accurate estimates of the probability distributions underlying the physical system {P k (x)} K k=1 , such a model is also Bayes optimal [40,62]. Meaning, no other statistical model can outperform it on the classification or regression task at hand (on average). In this case, the optimal loss value it achieves coincides with the Bayes error [40,62], i.e., the intrinsic irreducible error inherent to the problem.
Supervised learning.-In SL (see Sec. II A), the optimal predictions are given aŝ where P I (x) = rI k=1 P k (x) (10) and are the (unnormalized) probabilities of drawing an input x in region I and II, respectively. Hence, the optimal prediction for a particular input corresponds to the probability of drawing that input in region I compared to region II. Here, P k (x) denotes the (normalized) probability to draw the input x at the sampled point p k . Given a data set X k , this probability is estimated is the number of times the input x is present in the data set X k . While having access to an analytical expression for the underlying probability distributions {P k (x)} K k=1 may ease computation and enable additional insights, it is not required to compute the optimal predictions (see Sec. IV for application to physical systems). An expression for the optimal value of the loss in SL, L opt SL , can be obtained by replacingŷ(x) withŷ opt Assuming that all inputs within the entire data set X are already present in the training set T , i.e.,T =X , the mean optimal prediction at a given point This corresponds to the probability of finding an input drawn at that point p k in region I compared to region II. We find this assumption to be (approximately) satisfied for all physical systems analyzed in this work and can estimate the errors arising from a violation, see Sec. IV and Appendix A 2. The optimal indicator of phase transitions in SL is then given as In general, there will be a transition point where the probability in Eq. (12) changes most and thus where its derivative, the optimal indicator in Eq. (13), peaks.
Learning by confusion.-For a given bipartition of the parameter range into regions I and II, the optimal predictions of LBC (see Sec. II B) are given aŝ which corresponds to the probability of drawing the input in region I compared to region II. This characteristic is inherent to the underlying classification task [compare Eqs. (9) and (14)]. The mean classification error associated with an input x is given by min{ŷ opt LBC (x), 1 − y opt LBC (x)}. This classification error arises from a "confusion" of the model: different labels can be assigned to the same input due to an overlap of the underlying probability distributions. The mean classification error over the entire parameter range given a particular choice of bipartition, i.e., labeling of the data, then corresponds to (15) This forms the optimal indicator for phase transitions in LBC. An expression for the optimal value of the loss in LBC, L opt LBC , can be obtained by replacingŷ(x) witĥ y opt LBC (x) in Eq. (4). The critical point p c is highlighted by a dip in the mean classification error, i.e., by a peak in the mean classification accuracy [Eq. (15)]. It corresponds to the bipartition point for which the probability distributions underlying the two regions have the least overlap (on average), resulting in the highest classification accuracy and the least confusion. While confusion can arise due to sub-optimal predictions of models with restricted capacity (see Appendix B 2 for a concrete example), we find that confusion can even persist in the limit of high model capacity if it is inherent to the underlying data. Based on the analytical expressions, we thus gained an intuitive and rigorous understanding of the concept of confusion underlying LBC [5].
Prediction-based method.-The optimal predictions within PBM (see Sec. II C) are given aŝ Here, the optimal prediction for a given input is obtained by a weighted sum over each point in the parameter range, where the weight of each point p k corresponds to the probability of obtaining the input at that point along the parameter range compared to all other points. Therefore, the prediction accuracy decreases if the same input can be drawn at multiple values of the tuning parameter, i.e., when the underlying probability distributions overlap. An expression for the optimal value of the loss in PBM, L opt PBM , can be obtained by replacingŷ(x) bŷ y opt PBM (x) in Eq. (6). The mean prediction of an optimal model m opt at a sampled point p k is given bŷ Thus, the optimal indicator for phase transitions is where δy opt PBM (p k ) =ŷ opt PBM (p k ) − p k . Recall that in PBM, phase transitions are detected by analyzing the dependence of the prediction error on the tuning parameter. The optimal indicator [Eq. (18)] highlights the value of the tuning parameter at which the mean predictions change most, i.e., where the overlap of the underlying probability distributions changes most. The optimal predictions and indicators of PBM have previously been derived in Ref. [34] but have neither been utilized in a numerical routine, nor been used to explain previous studies.
The optimal predictions of SL, LBC, and PBM can solely be expressed in terms of the probability distributions {P k (x)} N k=1 governing the input data. Crucially, this means that the optimal predictions -and thus the optimal indicators of phase transitions -do not depend on the particular nature of an input or how similar it is to other inputs. Such notions of similarity form the basis of a large set of other phase-classification methods, e.g., based on principal component analysis [13], diffusion maps [27], or anomaly detection [32]. The analytical form of the optimal predictions indicates that SL, LBC, and PBM ultimately gauge changes in the probability distributions governing the data akin to probability metrics [63]. Note that the same optimal predictions and indicators will be obtained for multiple choices of representations R given that the same probability distributions can still describe the data in the representation space. Consequently, knowledge of the symmetries of the system can be utilized to calculate indicators of phase transitions more efficiently. We will make use of this in Sec. IV.

A. Demonstration on prototypical probability distributions
In this section, we compute the optimal indicators of SL, LBC, and PBM for a set of simple probability distributions governing the input data. As we will see later, the probability distributions governing the data in physical systems can be regarded as generalizations of the special cases discussed in this section. Thus, they serve as a reasonable basis for understanding. We compare these results to the indicators obtained by numerical optimization of NNs. The details on the NN architecture and training, including the corresponding hyperparameters, can be found in Appendix B. This first demonstration shows how the analytical expressions can be used to calculate the optimal indicator directly from input data without NNs. Moreover, it confirms that the optimal predictive models can be recovered by training NNs with sufficient expressive power. Case 1.-Let us first consider the case where the probability distribution governing the data is identical across the parameter range, i.e., P k (x) = P(x) ∀1 ≤ k ≤ K. Clearly, in this case all three methods should indicate the presence of a single phase. The optimal prediction in SL isŷ opt SL (p) = corresponding to the relative size of region I compared to region II [see Fig. 2(b)]. Here, K I = r I and K II = K − l II correspond to the number of sampled parameter values in region I or II, respectively. Taking the derivative of Eq. (19) results in a flat indicator signal I opt SL = 0. In LBC, the optimal classification accuracy for a particular bipartition is given by I opt LBC = max{K I /K, K II /K}. This results in a characteristic V-shape [5], which has its minimum at the center of the parameter range under consideration, see Fig. 2(c). In PBM, the optimal mean prediction is also placed at the center of masŝ y opt PBM (p k ) = 1/K K k=1 p k = const., which results in a constant indicator I opt PBM = −1 [see Fig. 2(d)]. As such, all three methods yield optimal indicators that correctly signal the presence of a single phase, i.e., the absence of two distinct phases. For a concrete numerical demonstration we consider the case of binary inputs X = {0, 1} with equal probability P(0) = P(1) = 0.5. Figure 2(a)-(e) shows the results for all three methods using the analytical expressions as well as NNs. Note that the analytical predictions and indicators can be approximated well using NNs as predictive models.
Case 2.-Next, we consider the case where the input data naturally separates into two distinct sets. That is, the underlying probability distributions result in a bipartition of the parameter range into two regions A and B, where each input can only be drawn in one of the two regions. In these regions, we choose the probability distributions to be identical where 1 ≤ k, c ≤ K. This is a prototypical example for the case where the physical system transitions from phase A to B when crossing a critical value of the tuning parameter p c . Here, p c corresponds to a sampled value of the tuning parameter, which may, in general, not be the case.
Using SL, the optimal strategy corresponds tô This results in which diverges as ∆p → 0 and exhibits a peak at the two points which constitute the boundary between regions A and B [see Fig. 2 Here, we approximate the derivative in Eq. (13) by a symmetric difference quotient In LBC, one can reach a perfect (error-free) classification when matching the natural bipartition present in the data. Let us denote the region between the bipartition point underlying the data, p c , and the chosen bipartition point in the LBC scheme, p bp k , as III. The number of sampled parameter values within the smallest region between I, II, and III is K m k = min{K I , K II , K III }. Note that all input data drawn within one of these regions must be misclassified. Thus, the optimal strategy which yields the smallest classification error corresponds to misclassifying all input data drawn within the smallest region. The optimal classification accuracy is then given as This results in a characteristic W-shape of the indicator [5], see Fig. 2(h), where the middle-peak occurs at the bipartition point p bp i closest to p c .
In PBM, we havê where p A/B denotes the center of region A and B, respectively. This results in where we approximated the derivative in Eq. (18) by a symmetric difference quotient [see Fig. 2 The expression in Eq. (25) diverges as ∆p → 0 for k ∈ {c, c + 1} and results in a peak at the two points which constitute the boundary between regions A and B. As such, the optimal indicators of all three methods correctly indicate the presence of two distinct sets of data, i.e., two distinct phases. The results obtained using the analytical expressions can be approximated well using NNs as predictive models. This is illustrated in Figs. 2(f)-(j), where we consider the special case of binary inputs with P A (0) = 1, P B (0) = 0, and p c = 1.
Case 3.-Lastly, we consider the case where the probability distributions underlying the data do not overlap, i.e., the probability of drawing a given input at two distinct values of the tuning parameter vanishes. In particular, this situation can occur when dealing with large state spaces, which are prone to result in insufficient sampling statistics in practice. That is, even in scenarios where the ground-truth probability distributions underlying the data do overlap, the estimated probabilities P k (x) ≈ M k (x)/M based on the drawn data set X may not (see Appendix A 5 for a concrete physical example). Many image classification tasks encountered in traditional ML applications [64][65][66][67][68][69][70] a priori fall into this category. In particular, the probability distributions underlying the data are typically not known in these cases. Therefore, constructing optimal models, in particular Bayes optimal models, largely remains conceptual in nature [62,71].
Here, an optimal predictive model is capable of distinguishing between samples obtained at distinct values of the tuning parameter with perfect accuracy. This results in I opt LBC (p bp k ) = 1 ∀1 ≤ k ≤ K + 1 for LBC [see Fig. 2(m)]. In the case of PBM, we haveŷ opt PBM (p k ) = p k such that I opt PBM (p k ) = −1 ∀1 ≤ k ≤ K, see Fig. 2(n). In both cases, the indicator signals the absence of two distinct sets of data, i.e., phases. The optimal predictions of SL for x ∈X are underdetermined: only the predictions for inputs within the training data x ∈T are fixed after training and the assumption thatX =T is violated in this particular case [see Fig. 2(l)]. Note, however, that the predictions are, in principle, also unconstrained when using SL with NNs. For a simple numerical example, we consider the case where a single unique (scalar) input is drawn at each point along the parameter range with where 1 ≤ k ≤ K. The results are shown in Figs. 2(k)-(o). In practice, NNs will tend to predict similar outputs for similar inputs. The continuous nature of the NN results in SL highlighting the value of the tuning parameter p = 2 where a discontinuity in the input data is present.
We also observe this tendency for the NNs in LBC and PBM during training.

B. Computational cost
We can use the analytical expressions to assess the computational cost associated with the evaluation of the mean optimal predictions and optimal indicators of SL, LBC, and PBM for a given set of input data (see Appendix A 3 for proofs). In our estimation, we neglect the overhead arising from the computation of the probability distributions {P k (x)} K k=1 which is identical for all three methods. In the case of SL, the computation of its optimal predictions (as a function of the tuning parameter) and indicator scales as O(MX K). Here, we assume that the number of sampled values of the tuning parameter during training is small compared to the total number of sampled points K I + K II K. For PBM and LBC the computation scales as O(MX K 2 ) and O(MX K 3 ), respectively. By saving the optimal predictions for each input y opt (x) instead of recomputing it, the computational cost can be reduced and scales as O(MX K), O(MX K), and O(MX K 2 ), in the case of SL, PBM, and LBC, respectively. Note the appearance of MX which can result in an exponential scaling for quantum problems due to the exponential growth of the Hilbert space H (and thus the state space MX ).

IV. APPLICATION TO PHYSICAL SYSTEMS
In this section, we will compute the optimal predictions and indicators of phase transitions of SL, LBC, and PBM, directly from data using the analytical expressions introduced in Sec. III for the Ising model, Ising gauge theory, XY model, XXZ model, Kitaev model, and Bose-Hubbard model. For the classical systems, namely the Ising model, Ising gauge theory, and XY model, spin configurations are sampled from a thermal distribution at various temperatures T k using the Metropolis-Hastings algorithm [72]. Here, the temperature serves as a tuning parameter. The probability that a system in equilibrium at inverse temperature β k = 1/k B T k (where k B is the Boltzmann constant) is found in a state with spin configuration σ is given by a Boltzmann distribution where Z k = σ e −β k H(σ) is the partition function and H is the respective system Hamiltonian. In principle, one could use the raw spin configurations as input, i.e., estimate the underlying probability distributions as P k (σ) = M k (σ)/M . However, the probability of drawing a particular spin configuration only depends on its energy [see Eq. (28)]. One can show that the optimal predictions and indicators remain identical when the energy is used as input instead of the raw configurations, i.e., when the probability distributions governing the data are given by where g(E) is the degeneracy factor (see Appendix A 4 for a proof). Using the energy as input instead of the raw configurations reduces both the input dimension and the size of the associated state space. This, in turn, reduces the cost of computing the optimal predictions and indicators. Similarly, one could take advantage of the symmetries of the system by adopting a symmetryadapted representation.
In the quantum case, we will typically be looking at a state associated with a Hamiltonian H(p) that depends on the tuning parameter p. This state could, for example, be the ground state or a state which has undergone unitary time evolution starting from a fixed initial state. Having chosen a complete orthonormal basis {|j } d j=1 to study the system [d = dim(H)], the relevant quantum state at p k can be written as |Ψ k = d j=1 c jk |j . Thus, the probability distribution P k associated with a given value p k of the tuning parameter is P k (j) = |c jk | 2 with 1 ≤ j ≤ d and 1 ≤ k ≤ K. The value of P k (j) corresponds to the probability of measuring the system in state |j given that the value of the tuning parameter is p k . This corresponds to using the indices of the basis states |j (1 ≤ j ≤ d) as inputs, which are governed by the probability distributions {P k (j)} K k=1 . For simplicity, we choose MX = d. In the case of spin systems, we use the S z basis, whereas we choose the Fock basis for bosonic and fermionic systems. This choice of bases corresponds to experimentally accessible local measurements [73][74][75][76][77][78][79]. In this work, we obtain the ground states through exact diagonalization. Thus, we have direct access to the underlying probability distributions and do not rely on sampling. In Appendix A 5, we show that the optimal indicators can also be obtained from individual samples, i.e., measurement outcomes (similar to the classical case). As such, the procedure is in principle applicable to experimental scenarios.
In general, we can consider scenarios where a state |Ψ i is drawn with probability a k (i) at p k . Then, the relevant quantum state is given by a classical The probability distribution associated with such a state is P k (j) = i a k (i)|c ij | 2 . This case will be particularly relevant for the study of many-body localization phase transitions where disorder is naturally present (see Sec. IV F). Here, the tuning parameter p k itself characterizes a distribution a k . Further details on the data generation can be found in Appendix C.
Clearly, in the quantum case there is an ambiguity in the choice of input, or equivalently, the choice of mea- surement basis. Changing the measurement basis may change the probability distributions underlying the data, and thus the corresponding optimal predictors and indicators. In turn, the estimated critical value of the tuning parameter may change (which is difficult to assess a priori for a given system). In order to avoid an explicit choice of measurement basis, sampling over various classical projections can be performed. Classical representations of quantum states obtained via classical shadow tomography [35,80,81] are an example of this. Alternatively, measurements given by informationally complete (IC) positive operator-valued measures (POVMs) can be used [82,83]. However, projective measurements in a single basis have been the most common choice, reflecting experimental constraints or prior knowledge of the system [10,11,31,37,84,85].

A. Ising model
The two-dimensional square-lattice ferromagnetic Ising model is described by the following Hamiltonian where the sum runs over all nearest-neighboring sites (with periodic boundary conditions) and J is the interaction strength (J > 0). At each lattice site k, there is a discrete spin variable σ i ∈ {+1, −1}. This results in a state space of size 2 L×L for a square lattice of linear size L. The system is completely characterized by its spin configuration σ = (σ 1 , σ 2 , . . . , σ L×L ). Two example spin configurations of the Ising model at different temperatures are shown in Fig. 3(a). The Ising model exhibits a symmetry-breaking phase transition at a critical temperature of [86] T c = 2J The system undergoes a transition between a paramagnetic (disordered) phase at high temperature and a ferromagnetic (ordered) phase at low temperature. Spontaneous magnetization occurs below the critical temperature T c , where the interaction is sufficiently strong to cause neighboring spins to align spontaneously. This spontaneous symmetry breaking leads to a non-zero mean magnetization. Above T c , thermal fluctuations dominate over spin alignment resulting in a vanishing magnetization. Consequently, the phase transition can be characterized by the magnetization M (σ) = L 2 i=1 σ i which serves as an order parameter that is zero within the paramagnetic phase and approaches one in the ferromagnetic phase, see Fig. 3(h). The phase transition can also be revealed by the heat capacity which diverges at T c [see Fig. 3 The results for the Ising model are shown in Fig. 3. Interestingly, SL fails to predict the correct critical temperature even for large lattices [see Figs. 3(b),(e)]. In fact, we can further analyze the special case when the inputs are governed by Boltzmann distributions [Eq. (29)]: For training data obtained at T 1 = 0 (region I) and T K > 0 (region II), the mean optimal prediction of SL at an intermediate temperature T k iŝ which approaches P k (E gs ) in the thermodynamic limit as T K → ∞ (see Appendix A 4 for a proof). Here, E gs denotes the ground-state energy. Therefore, in this case, the optimal indicator in SL peaks at the temperature at which the probability of drawing the ground state changes most [see the blue-dashed line in Fig. 3(f)]. The location of the peak tends to zero as one approaches the thermodynamic limit, see Fig. 3(e).
The optimal indicator of PBM shows two distinct peaks. One coincides with the peak of the optimal indicator in SL, whereas the other coincides with the critical temperature of the Ising model [see Fig. 3 This observation suggests a deeper connection between SL and PBM. A similar indicator signal (with two distinct peaks) was observed in Ref. [29] with NNs after a sufficient number of training epochs. In principle, the finite-size scaling analysis allows one to identify the dominant peak as erroneous without prior knowledge of T c , because it shifts towards T = 0 as the lattice size is increased, whereas the small peak remains stable. In the same fashion, the output of SL can be identified to be erroneous. Note that the fluctuations present in the optimal indicator signal of PBM can be attributed to finite-sample statistics. A detailed study of the effect of finite-sample statistics on the optimal predictions and indicators can be found in Appendix A 5. Crucially, the analytical expression for the optimal indicator signal allows us to disentangle the stochasticity inherent to the NN training from other sources of noise, which was not rigorously possible in previous works.
In Ref. [4], SL with NNs was able to predict the critical temperature of the Ising model for various lattice sizes correctly. In this case, small NNs with restricted expressive power in combination with 2 regularization were used. Similarly, using PBM in Ref. [29] a single, distinct peak at T c was observed after a small number of training epochs with a second peak emerging after longer training. Training time, NN size, and explicit and indicators. We recover the same behavior using NNs as in Ref. [4,29] by restricting the model capacity, e.g., by choosing a small NN, stopping the training early, or using strong 2 regularization (see Appendix B 1 for details). As these restrictions are lifted, i.e., by choosing a larger NN, training for longer, or reducing the regularization strength, the NN-based predictions and indicators approach the corresponding optimal predictions and indicators displayed in Fig. 3. Thus, our analysis demonstrates that SL and PBM necessarily rely on models with restricted capacity and hyperparameter tuning to correctly predict the critical temperature of the Ising model.
Finally, the optimal indicator of LBC correctly highlights the critical temperature of the Ising model for various lattice sizes matching the results of Ref. [5], see Figs. 3(c),(e). Overall, the optimal indicators of all three methods show peaks at temperatures where the probability distribution underlying the data varies strongly. Recall the finding from Sec. III that all three methods gauge changes in the probability distributions underlying the data. We have confirmed that the results shown in Fig. 3 are stable against small perturbations of the chosen parameter range, including regions I and II in SL.

B. Ising gauge theory
Wegner's Ising gauge theory (IGT) [88] is described by the following Hamiltonian where P refers to plaquettes on the lattice, see Fig 4(a). The IGT is a prototypical example of a classical system that exhibits a topological phase of matter [89]. It is a spin model (σ i ∈ {+1, −1}) defined on a square lattice of linear size L (with periodic boundary conditions) where the spins are placed on the lattice bonds [see Fig. 4(a)]. The IGT ground state is a degenerate manifold made up of all states which fulfill the condition that the product of spins on each plaquette is i∈P σ i = 1 corresponding to a topological phase. These topological constraints can be violated at finite temperature, where the system leaves its ground state. Note that there is no phase transition at finite temperature: the critical temperature approaches zero in the thermodynamic limit. In finite-sized systems, however, the violations of local constraints are suppressed. Therefore, the system exhibits a crossover from the topological phase at low temperature to a phase with violated topological constraints at high temperature. The crossover temperature T c is defined by the first appearance of a violated local constraint and scales as T c ∝ 1/ ln 2L 2 [87]. Figure 4(a), which shows typical spin configurations of the IGT, highlights that the phases of the IGT are hard to distinguish visually without prior knowledge of the local constraints or a dual representation [4,31]. Note that the heat capacity fails to identify the crossover, see Fig. 4(f). The topological character of the ground-state phase can be revealed through Wilson loops. These are formed by connecting edges with spins of the same orientation, see Fig. 4(a). In the ground-state phase, all such loops are closed. The violation of a plaquette constraint breaks a loop.
Recall that SL, LBC, and PBM are a priori sensitive to both phase transitions and crossovers. The results for the crossover in the IGT are shown in Fig. 4. The optimal indicator of SL [ Fig. 4(b)] shows an appropriate scaling behavior. Moreover, the corresponding estimated critical temperature highlights the first appearance of violated local constraints, see Figs. 4(e),(f). This can be confirmed explicitly as SL can be shown to measure changes in the probability of drawing the ground state (cf. Sec. IV A). Observe that the underlying probability distribution undergoes a large change at the crossover temperature, see Fig. 4(e). SL and PBM were found to correctly highlight the crossover temperature of the IGT using NNs in Refs. [4] and [31], respectively. In fact, the optimal model underlying PBM for the IGT coincides with the physically motivated density-of-states-based model proposed in Ref. [31], see Appendix D 2 for details. We find that the optimal indicator of PBM correctly marks the crossover temperature of the IGT except at small lattice sizes. As for the Ising model, the optimal indicator of PBM exhibits two peaks in this case. The peak located at the crossover temperature dominates for large lattice sizes. Note that for the IGT is is not beneficial to reduce the model capacity when using PBM or SL, which leads to an erroneous peak closely matching the specific heat [see Fig. 4(f)], given that the corresponding optimal indicators correctly highlight the crossover temperature.
The optimal indicator of LBC correctly highlights the crossover temperature via its local maximum at small lattice sizes, but shows slight deviations from the appropriate scaling behavior for large lattices. In Ref. [31] difficulties were observed to identify the crossover temperature using LBC due to a distorted W-shape of its indicator. Choosing the same range for the tuning parameter, we can qualitatively reproduce their results using our analytical expression for the optimal indicator of LBC, see Appendix D 2. Using NNs, it is difficult to make concrete statements on whether a method succeeds or fails at identifying a given phase transition due to the inherent stochasticity arising during NN training and the choice of hyperparameters, such as the NN size. Our theoretical analysis allows for rigorous statements to be made about the optimal outcome when applying ML methods for detecting phase transitions to a given system (i.e., data set). In this particular example, the analytical ex- pressions allow us to determine that when training highly expressive NNs for sufficiently long, the indicator signal of LBC is indeed ambiguous (as reported in Ref. [31]). Note that restricting the model capacity is not found to resolve this issue [31].

C. XY model
Next, we consider the two-dimensional classical XY model that exhibits a Berezinskii-Kosterlitz-Thouless (BKT) transition driven by the emergence of topological defects [91,92]. The model is described by the following Hamiltonian where ij denotes the sum over nearest neighbors (with periodic boundary conditions) of a square lattice of linear size L. The angle θ i ∈ [0, 2π) corresponds to the orientation of the spin at site i. The formation of topological defects (i.e., vortices and antivortices) results in a quasi-long-range-ordered phase. The transition between the quasi-long-range-ordered phase at low temperature and a disordered phase at high temperature is a BKT transition, and the associated critical temperature is k B T c /J ≈ 0.8935 [90]. Below T c , vortex-antivortex pairs form due to thermal fluctuations, but they remain bound to minimize their total free energy [see Fig. 5(a)]. At T c , the entropic contribution to the free energy equals the binding energy of a pair which triggers vortex unbinding. These unbinding events drive the BKT phase transition. Note that the heat capacity has a peak at T > T c which is associated with the entropy released when most vortex pairs unbind [93,94], see Fig. 5(g). Moreover, while the XY model has strictly zero magnetization for all T > 0 in the thermodynamic limit, a non-zero value is found for systems of finite size [95], see Fig. 5(h). Instead, the critical temperature can, for example, be estimated based on the helicity modulus [94,96] (see Appendix C).
The results for the XY model are shown in Fig. 5. Here, SL fails to predict the critical temperature correctly. This failure is linked to the fact that the optimal indicator of SL highlights changes in the probability to obtain the ground state (cf. Sec. IV A), which quickly vanishes with increasing temperature, see Fig. 5(f). In a similar spirit, in Ref. [23] it was found that "naive" SL (without engineering the features or NN architecture) fails to yield accurate estimates of the critical temperature. Here, we explicitly confirm that a classification based on detecting vortices does not correspond to the most optimal strategy. The peak in the optimal indicator of LBC matches the peak in the heat capacity at k B T /J ≈ 1, see Figs. 5(c) and Overall, the behavior of the optimal indicators of all three methods closely resembles our previous example regarding perfectly distinguishable input data (see case 3 in Sec. III A). This can be traced back to the small overlap of the underlying probability distributions, see phase-classification methods based on the similarity of input data [27] may provide more valuable insights. In particular, we find that the indicators peak close to the transition temperature, i.e., near the location of the peak in the heat capacity and drop in the magnetization, when restricting the model capacity, e.g., by stopping the NN training early (see Appendix B). Recall that this was also observed in the case of the Ising model (see Sec. IV A and Appendix B).

D. XXZ model
Having discussed classical models, we move on to the quantum case. First, we consider the spin-1/2 XXZ chain [97,98] with open boundary conditions whose Hamiltonian is given by where J is the coupling strength along the x-and y-direction and ∆ is the coupling strength in the z-direction. For ∆/J < 1, the XXZ chain is in the ferromagnetic phase, see Fig. 6(a). The ground state is spanned by the two product states where all spins point either in the z or −z direction which have a magnetization of M = 2 S z tot = ±L. The ferromagnetic phase exhibits a broken symmetry: these states do not exhibit the discrete symmetry of spin reflection S z i → −S z i under which the Hamiltonian is invariant. For ∆/J > 1, the XXZ chain is in the antiferromagnetic phase with broken symmetry and two degenerate ground states. These are product states with vanishing magnetization. For −1 < ∆/J < 1 the XXZ chain is in the paramagnetic XY phase characterized by uni-axial symmetry of the easy-plane type and vanishing magnetization.
Here, we restrict our analysis to the transition between the ferromagnetic and paramagnetic XY phases. The ground states are obtained through exact diagonalization. Figure 6 shows the results when the ground state with S z tot = +L/2 is selected in the ferromagnetic phase and S z is chosen as a measurement basis. The quantum phase transition can be revealed by looking at the magnetization, see Fig. 6(f). The optimal indicators of all three methods correctly highlight the phase transition. Looking at the underlying probability distributions [see Fig. 6(e)], the problem closely resembles the prototypical case of a bipartitioned data set (see case 2 in Sec. III A). Thus, the optimal predictions and indicators also qualitatively match the results obtained in this case. In particular, the optimal predictions of SL can be described by Eq. (33), where the ferromagnetic ground state takes the role of the ground state energy (see Appendix A 4 for proof). We verified that the optimal indicators also mark the phase transition when other states from the ground state manifold are selected in the ferromagnetic phase and when measurements are performed in the S x or S y basis.

E. Kitaev model
The Kitaev chain is a one-dimensional model based on L spinless fermions, which undergoes a quantum phase transition between a topologically trivial and non-trivial phase [99,100]. The Kitaev Hamiltonian is given by where we consider open boundary conditions, µ is the chemical potential, t is the hopping amplitude, and ∆ is the induced superconducting gap. In the following, we set ∆ = −t. The ground state of this model features a quantum phase transition from a topologically trivial (|µ/t| > 2) to a non-trivial state (|µ/t| < 2), see Fig. 7(a). In the topological phase, Majorana zero modes [101] are present. Here we restrict ourselves to µ/t ≤ 0. We compute the ground states through exact diagonalization. For results based on individual measurement outcomes (of projective measurements in the Fock basis), see Appendix A 5.
The topologically trivial and non-trivial phase can be distinguished through entanglement spectra and the cor-responding entanglement entropy [102]. Consider the reduced density matrix ρ A of a system in the pure state |Ψ obtained by subdividing the Hilbert space H into two parts, A and B, and tracing out the degrees of freedom of B ρ A = Tr B |Ψ Ψ|, (38) with {λ i } the spectrum of ρ A and {− ln(λ i )} the entanglement spectrum. Here, we consider the bipartition of the chain into left and right halves with L A = L B = L/2. The entanglement entropy can then be computed as The three largest eigenvalues of ρ A are shown in Fig. 7(g) and the resulting entanglement entropy is shown in Fig. 7(h). Both the spectrum and entanglement entropy exhibit the largest change close to the critical value µ c /t = −2. The entanglement entropy approaches zero deep within the topologically trivial phase, signalling that the two halves of the ground state of the chain are not entangled. In the topological phase, the entanglement entropy approaches a value of ln(2) characteristic of an entangled ground state. Figure 7 shows the results of SL, LBC, and PBM. The location of the local maxima of the optimal indicators based on all three methods converges to the critical value of µ c /t = −2 with increasing chain length. Considering the probability distributions governing the input data [see Fig. 7(f)], we observe that almost all basis states become occupied with non-negligible probability as the tuning parameter µ/t is tuned across its critical value. Note that in Ref. [5], the phase transition in the Kitaev model was successfully revealed using LBC with NNs where the entanglement spectrum of the ground state served as an input. The scaling behavior of the estimated critical value of the tuning parameter based on the optimal indicators of SL, LBC, and PBM is comparable to standard physical indicators, such as the eigenvalues of the reduced density matrix or the entanglement entropy [see Fig. 7(e)]. In the limit µ/t → −∞, the ground state of the Kitaev chain corresponds to the Fock state with each site being occupied. Thus, in the limit µ 1 /t → −∞, the optimal predictions of SL follow Eq. (33), where the aforementioned Fock state takes the role of the ground state energy.

76]. The system is described by the Hamiltonian
where J is the hopping strength and U is the on-site interaction strength [see top panel in Fig. 8(a)]. Here, we fix U/J = 2.9. The last term in Eq. (40) corresponds to a quasiperiodic potential h i = cos(2πβi + φ) mimicking on-site disorder with amplitude W , where we fix 1/β = 1.618. This system transitions to the MBL phase, where thermalization breaks down as the disorder strength is increased beyond a critical value W c /J, see bottom panel in Fig. 8(a). We analyze the system in the long-time limit tJ = 100 after unitary time-evolution starting from a Mott-insulating state with one particle per site by solving the Schrödinger equation numerically. We average over different disorder realizations obtained by sampling the phase φ ∈ [0, 2π) of the potential uniformly.
A popular way to differentiate between the thermalizing and MBL regimes relies on the study of spectral statistics using tools from random matrix theory [103][104][105]. In the thermal regime, the statistical distribution of level spacings is given by a Gaussian orthogonal ensemble (GOE), while a Poisson distribution is expected for localized states. The ratio of consecutive level spacings is with δ i = E i −E i−1 at a given eigenenergy E i . Averaging over the spectrum and multiple disorder realizations yields r , which varies from r GOE = 0.5307 within the thermalizing phase to r Poisson = 2 ln(2) − 1 ≈ 0.3863 within the MBL phase, see Fig. 8(g).
The results are shown in Fig. 8. All three methods correctly identify the MBL phase boundary, where we take W c /J ≈ 4 − 7 from Refs. [10,76] as a reference. This is in agreement with the spectral analysis: the crossover between the average ratio of consecutive level spacings for systems of size L = 6 and L = 8 is located at W c /J ≈ 4, see Fig. 8(g). Moreover, the phase boundary marks the range of the tuning parameter in which the most significant change in the underlying probability distribution occurs [see Fig. 8(e)]. A line-cut along the index corresponding to the initial Mott-insulating state is shown in Fig. 8(f). It corresponds to the disorder-averaged probability of retrieving the initial state after unitary time evolution. The MBL phase boundary is marked by the sudden increase in P retr [75] which is correctly picked up by SL, LBC, and PBM. Our results are also in agreement with Ref. [10], which examined the MBL phase transition within the same model using SL, PBM, and LBC with NNs on numerical and experimental data. As such, this example highlights the possibility of calculating optimal indicators directly from experimental data. Note that in Ref. [10], the authors attempted to construct a simplified indicator for phase transitions when using LBC by subtracting the Vshaped indicator signal in the case of indistinguishable data (see case 1 in Sec. III A) as a baseline. However, we find that this procedure biases the peak of the optimal indicator signal of LBC towards the center of the parameter range under consideration and is thus not a viable procedure, see Appendix D 3.

V. DISCUSSION
In the previous section, we have demonstrated that the optimal indicators of SL, LBC, and PBM successfully detect phase transitions and crossovers in a variety of different classical and quantum systems based on numerical data. Recall that the optimal analytical predictors correspond to an optimal model that reaches the global minimum of the loss function. A priori, it is unclear if the optimal predictors can be recovered in practice when training NNs, because the employed NNs are of finite size and local optimization techniques are used. In Appendix B, we demonstrate that the optimal predictions and indicators of all six systems studied in Sec. IV can be recovered by training NNs. This reachability further underpins the practical relevance of our analysis for the case when using SL, LBC, and PBM with NNs.
In a traditional NN-based approach, one searches for the optimal model by iteratively updating the parameters of an NN in order to minimize a loss function [see step 2) in Fig. 1]. In contrast, our numerical routine based on the derived analytical expressions allows for the optimal model to be constructed directly from data [see step 2 * ) in Fig. 1]. As such, evaluating the analytical predictors also compares favorably to the NN-based approach in terms of computation time. For each of the three methods and across all six studied physical systems, we find that the time needed to train an NN of minimal size (one hidden layer with a single node) for a single epoch is of the same order of magnitude as the time needed to compute the optimal predictions, optimal indicator, and optimal loss (see Tab. I). Therefore, the computation time associated with constructing and evaluating an optimal model is at worst comparable with training and evaluating an NN-based model. In practice, however, the latter approach typically requires significantly more computation time because larger NNs need to be used, the training takes many epochs, and hyperparameters need to be adjusted (see Appendix B for a detailed discussion). In particular, as the system size increases and the associated state space grows, converging to the global minimum of the loss function can become increasingly difficult. The convergence of the optimal model, on the other hand, is guaranteed by construction.
We have observed that the optimal indicator of a given method may fail to correctly highlight a phase transition. A failure can, for example, occur if only a limited amount of data is available and finite-sample statistics dominate. In this case, while the ground-truth probability distributions underlying the data show a significant overlap resulting in a peak in the indicator signal, the inferred probability distributions do not (see Appendix A 5 for a concrete example). However, even if the data set is sufficiently large, i.e., the ground-truth probability distributions are well approximated, the optimal model can fail (see classical systems in Sec. IV for examples). Both instances of failure can often be resolved by employing non-optimal models. Such a model can be realized by an NN whose capacity, i.e., its ability to fit a wide variety of functions [40,41], is restricted. This can be achieved, e.g., by reducing the NN size, performing early stopping, or the explicit addition of 2 regularization (see Appendix B 1 and B 2). In these instances, other phaseclassification methods which are inherently based on the similarity of input data [13,27,32,34] are also expected to provide valuable insights. These methods stand in contrast to the optimal predictors of SL, LBC, and PBM, which are not explicitly based on learning order parameters, i.e., recognizing prevalent patterns or orderings. Instead, the optimal predictors gauge changes in the probability distributions governing the data. Contrary to popular opinion, the failure of optimal models, or equivalently high-capacity NNs, does not always correspond to overfitting in the traditional sense [40]: the gap between training and test loss vanishes in the limit of a sufficiently large data set (which is available for the examples discussed in Sec. IV). Therefore, sub-optimal models, such as NNs with insufficient capacity, are in fact underfitting the data. This signals a fundamental mismatch between the classification or regression task underlying a particular ML method, i.e., the corresponding loss function, and the goal of detecting phase transitions. In particular, it raises the intriguing question of whether one can adjust the learning task in SL, PBM, and LBC such that the corresponding optimal models also correctly highlight the phase transition in these problematic cases, e.g., through an appropriate modification of the underlying loss functions or by enforcing explicit constraints.

VI. CONCLUSION AND OUTLOOK
The ML methods for detecting phase transitions from data given by SL, LBC, and PBM can be viewed under a unifying light: all three approaches have predictive models, such as NNs, at their heart which are trained to solve a given classification or regression task. Analyzing their predictions allows us to compute a scalar indicator that highlights phase boundaries. The power and success of these methods is largely attributed to the universal function approximation capabilities of their underlying NNs, which are often sacrificed in practice to regain interpretability [15,[51][52][53][54][55]106]. Here, we took an alternative approach to cope with the interpretability-expressivity tradeoff: By analyzing the class of predictive models that solve the classification and regression tasks underlying SL, LBC, and PBM optimally, we have derived analytical expressions for the indicators of phase transitions of these three methods.
Our work establishes a solid theoretical foundation for SL, LBC, and PBM, based on which we were able to explain and understand the results of a variety of previous studies [4,5,10,23,29,31]. We anticipate that similar analyses will be useful to gain an understanding of other methods for identifying phase transitions with NNs [21,28,32,36,38,107] and other classification tasks in condensed matter physics [85,[108][109][110][111][112][113]. In these cases, the optimal models can also serve as benchmark solutions that enable future studies aimed at investigate the learning process of NNs and improving their design and update routines [11,85,[114][115][116][117]. For example, in Refs. [4,16,20] it was shown that an NN trained to predict the phase transition in a given model using SL can successfully classify configurations generated from an entirely different Hamiltonian. An exciting prospect is to explore whether the success of this "transfer learning" can be rigorously explained based on our results.
The analytical expressions not only enable our understanding of the phase-classification methods under consideration -they also allow for the direct computation of their optimal predictions and indicators based on the input data without explicitly training NNs. We have demonstrated that this novel procedure can successfully reveal a broad range of different phase transitions in a numerical setting and is favorable in terms of computation time. Our results suggest a variety of avenues for further explorations. As a next step, one can consider whether tools from ML, especially for density estimation [118][119][120][121], can aid in the computation of the optimal indicators. In the quantum case, classical representations of quantum states obtained via classical shadow tomography [35,80] may help to evade the arising exponential complexity. We believe that optimal predictors will be a valuable tool to detect, interpret, and characterize phases of matter and their transitions from experimental data, particularly in the advent of digital quantum computers [122][123][124][125][126] and programmable quantum simulators [10,11,79,[127][128][129].
The code for computing the optimal predictions and indicators of SL, LBC, and PBM utilized in this work is open source [130]. ACKNOWLEDGMENTS We would like to thank Niels Lörch, Andreas Trabesinger, and Christoph Bruder for helpful suggestions on the manuscript. We thank Niels Lörch, Eliska Greplova, Eugene Demler, Florian Marquardt, and Christoph Bruder for stimulating discussions. We acknowledge financial support from the NCCR QSIT funded by the Swiss National Science Foundation S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Appendix A: Optimal predictions and indicators
In this appendix, we provide detailed derivations of the optimal predictions and indicators of SL, LBC, and PBM. In particular, we discuss the assumptions underlying the derivation of the optimal predictions of SL and how the analytical predictors are evaluated in practice. This includes an analysis of the computational cost associated with constructing and evaluating the optimal models and the role of finite-sample statistics.

Derivation of optimal predictions and indicators
Here, we derive the form of the optimal predictions and indicators of phase transitions for SL, LBC, and PBM presented in Sec. II of the main text.
Supervised learning.-In SL, a predictive model m is trained to minimize the CE loss function given in Eq. (1). Now, consider a particular input contained within the training setx ∈T . We can determine the optimal model predictionŷ opt SL (x) for this particular input by minimizing the loss function in Eq. (1) with respect toŷ(x), i.e., by solving the necessary condition Using the explicit expressions for the labels (y = 1 and y = 0 for all inputs drawn in region I and II, respectively) in Eq. (A1), we have Here, M I/II (x) denotes the number of times the inputx is found in region I or II, respectively. In SL, the predictive model must, by definition, satisfyŷ(x) ∈ [0, 1] ∀x. Thus, Eq. (A2) is satisfied given predictions of the form .
The opposite choice of labeling (y = 0 and y = 1 for all inputs drawn in region I and II, respectively) is equally valid and would result in That is, the role ofŷ opt SL (x) and 1 −ŷ opt SL (x) are swapped. In this work, we stick to the former choice [Eq. (A3)]. The optimality of the predictions in Eq. (A3) can be confirmed by calculating the second derivative of the loss function The probability distribution governing the input data is denoted as P k (x) ≈ M k (x)/M (1 ≤ k ≤ K). This allows for Eq. (A3) to be expressed aŝ where are the (unnormalized) probabilities of drawing the input x in region I and II, respectively. Repeating the above procedure for all inputs within the training setT , we obtainŷ opt SL (x) = which matches Eq. (9) reported in the main text.
Relaxations of the assumption in SL that there are only two distinct phases to be distinguished will be discussed in Appendix A 2.
Note that the same optimal predictions are obtained when training on a MSE loss function instead of a CE loss function. Again, consider a particular inputx contained within the training setT . We can determine the optimal model predictionŷ opt SL (x) for this input by minimizing the loss function in Eq. (A10) with respect toŷ(x), i.e., by solving Plugging the expression for the labels given by a one-hotencoding in Eq. (A11), we have This coincides with the condition for the predictions given in Eq. (A2) obtained from a CE loss function. Their optimality can be confirmed via Therefore, in SL, the optimal predictions and indicators associated with optimal models trained on a CE or MSE loss function are identical.
Learning by confusion.-To reveal the phase transition by means of LBC, we perform several splits of the parameter range into two neighboring regions labeled I and II. For a fixed bipartition, we minimize a CE [Eq. (4)] or MSE loss function Following the analysis of SL presented above, we obtain a similar expression for the optimal predictionŝ with T = X in LBC. Thus, we recover Eq. (14) of the main text. Their optimality can be confirmed via in the case of a CE or MSE loss, respectively. The value of the indicator in LBC for a given bipartition corresponds to the mean classification accuracy [Eq. (5)], where the continuous predictionsŷ(x) ∈ [0, 1] are mapped to binary labels via θ (ŷ(x) − 0.5). Using the optimal prediction in Eq. (A15), the mean classification error for a given input x is min{ŷ opt LBC (x), 1 −ŷ opt LBC (x)}. Weighting the contribution of each input x to the mean classification error by its probability P k (x), we arrive at Eq. (15) of the main text. Note that, in principle, the assumption in LBC that there are only two phases to be distinguished can be relaxed [5]. In this case, the optimal indicator may show multiple distinct peaks highlighting the different phase boundaries [131].
Prediction-based method.-In PBM, a predictive model m : x →ŷ(x) is trained to minimize the MSE loss function L PBM specified in Eq. (6). Consider a particular inputx ∈X . We can determine the optimal model predictionŷ opt PBM (x) for this input by minimizing the loss function in Eq. (6) with respect toŷ(x), i.e., by solving .

(A19)
This prediction is indeed optimal, as Repeating this procedure for all available inputs x ∈X yieldsŷ opt Thereby, we recover Eq. (16) of the main text. Note that this derivation can be generalized to higher dimensional parameter spaces (which may host multiple distinct phases) in a straightforward manner (see Ref. [34]), resulting inŷ .

(A22)
Here, the sum runs over all sampled points p k in parameter space. The optimal indicator is then given as a divergence where δy opt PBM (p k ) = x∈X P k (x)ŷ opt PBM (x) − p k .

Assumptions for supervised learning
Let us we review the assumption ofX =T underlying the derivation for the optimal predictions and corresponding indicator of SL. In general, ifX =T the optimal predictions of SL can be expressed (A24) The first contribution in Eq. (A24) comes from predictions for inputs contained in the training data, which are determined through minimization of the corresponding loss function [see Eq. (A1)]. The second contribution comes from predictions for inputs not contained in the training data, which are a priori only restricted to the unit intervalŷ SL (x) ∈ [0, 1]. Therefore, this contribution to Eq. (A24) is bounded by the probability of drawing an input at p k that is not present in the training data, which is given by x / ∈T P k (x). When using SL with NNs, the predictions for inputs not contained in the training data [second contribution in Eq. (A24)] will be most susceptible to noise inherent to NN training and hyperparameter choices. As such, its physical relevance is questionable. It may be possible to obtain better bounds for this second contribution when using SL with NNs, e.g., based on the theory of neural tangent kernels [132].
Let us explicitly discuss the classical systems analyzed in this work, which are governed by Boltzmann distribution [Eqs. (28) and (29)]. Because the probability of drawing a particular configuration sample (or energy) at any non-zero temperature is non-zero, the assumption ofX =T holds given a sufficient number of samples. When computing the optimal indicator of SL numerically, we work with a finite number of samples. Thus, it can happen an input is encountered which is not part of the training data x ∈T . In practice, we can verify on-the-fly whether this is the case. If so, we set y SL (x) = 0 in Eq. (A24). Thereby, we effectively ignore the contribution to the predictions of SL from inputs not present in the training data. Note that because these predictions correspond to inputs with low probability, they are also most susceptible to finite-sample statistics. This procedure is further justified by the fact that the optimal predictionsŷ opt SL obtained in this manner track the ground-state probability with high accuracy [see Figs. 3(b), 4(b), and 5(b)]. That is, the optimal predictions closely match the expression in Eq. (33) valid in the case where deviations due to finite-sample statistics vanish.
In the quantum case, it is typically not straightforward to determine a priori whether the assumption ofX =T is met for a given system and choice of basis. Here, when calculating the optimal predictions and indicators numerically, we use the same procedure as described for the classical case. In our study, we only find cases where x ∈T for the XXZ model. The error resulting from neglecting the second contribution in Eq. (A24) is marginal, as the probability of drawing such inputs across the parameter range is found to be small. Note that the optimal indicator of SL obtained in such a manner correctly reveals the quantum phase transition in the XXZ (see Fig 6). In fact, the optimal predictions calculated via this procedure correspond to the probability of measuring the ferromagnetic ground state (see Sec. IV D). For the above reasons, we expect that the optimal predictions of SL are capable of revealing phase transitions even ifX =T .
A relevant scenario in which the assumption that X =T is violated occurs when the system transitions between multiple phases as the tuning parameter is varied. Then, inputs drawn in the phases present in the middle of the sampled range of the tuning parameter may not be present in the two boundary phases. By dropping the second contribution in Eq. (A24), we may still faithfully detect the transition between the first and second phase. However, all subsequent phase boundaries will then likely be missed. In the future, it will be of interest to lift the assumption ofX =T underlying the optimal predictions through appropriate interpolation schemes [31,35,132], which would allow for the generalization capabilities of SL to be explored.

Computational cost
Here, we will derive the scaling of the computational cost with the number of unique inputs MX and the number of sampled tuning parameter values N reported in Sec. III B of the main text. Note that we do not consider the overhead associated with computing the probability distributions {P k (x)} K k=1 ∀x ∈X from the data at hand (or any other constant overhead). The computation of the optimal predictions and indicators can be approached in two ways: Either the optimal predictions for a given inputŷ opt (x) are recomputed in each function call, or they are cached. We report the required number of floating-point operations in both instances, which can be counted based on the analytical expressions reported in Sec. III. This counting represents a rough, hardware-independent estimate of the required computational cost. In the following, we will assume that the optimal indicators in SL and PBM are computed using a symmetric difference quotient, cf. Eq. (22). Caching these values, the number of operations required to compute the optimal indicator is MX K 2 (F min + 2), where F min denotes the number of floating-point operations required to compute min{ŷ opt LBC (x), 1 −ŷ opt LBC (x)}. This corresponds to O(MX K 2 ) operations. Without caching, the optimal indicator requires MX K 3 + MX K 2 (F min + 2) + K operations to compute, resulting in a scaling of O(MX K 3 ).
Prediction-based method.-In PBM, the computation ofŷ opt PBM for all x ∈X requires MX (3K −1) floating-point operations. Caching these values, the number of operations required to compute the mean optimal prediction y opt PBM for all {p k } K k=1 is 5MX K −K −MX . Computing the optimal indicator then requires MX (5K − 1) + K operations. The computation of the mean optimal predictions and the optimal indicator each require O(MX K) operations. If the valuesŷ opt PBM (x) ∀x ∈X are not cached, computing the mean optimal prediction instead requires 3MX K 2 + K(MX − 1) operations. Computing the optimal indicator then requires 3MX K 2 + KMX + K operations. For both quantities, this results in a scaling of O(MX K 2 ).
Numerical implementation.-The measured computation times associated with calculating the optimal indicators of phase transitions of SL, LBC, and PBM, for all six physical systems discussed in the main text (see Sec. IV) are reported in Tab. I. The corresponding code is open source [130]. Again, we do not consider the computational cost associated with generating samples and estimating the underlying probability distributions. Overall, the computation times are remarkably low. For all systems, the optimal indicator of SL and PBM can be obtained in under a second, and the optimal indicator of LBC in under a minute. We observe that the computation times of SL and PBM are comparable, with PBM being slightly slower than SL. In contrast, the computations times of LBC are two orders of magnitude larger. Note that these are the evaluation times corresponding to the largest system sizes under consideration. We find that the computation times qualitatively agree with the complexity analysis described above (for the case where caching is performed). An additional speed-up can be gained through parallel execution. In particular, it is straightforward to compute optimal predictions (in the case of SL and PBM) and optimal indicators (in the case of LBC) at discrete values of the tuning parameter in parallel, e.g., via multithreading (which is implemented in [130]).

Boltzmann-distributed inputs
Let us discuss the special case when the drawn inputs x, such as spin configurations, follow a Boltzmann distribution The probability to draw a sample with energy E is thus given by where g(E) is the corresponding degeneracy factor Here, S denotes the state space of the samples x, i.e., the set of all unique samples without duplicates. Therefore, we have P k (x) = P k (H(x)) /g (H(x)) .
where we assume thatT =X = S. Using Eq. (12), we havê where S E is the set of unique energies corresponding to the state space S. To obtain an expression for the optimal loss, we can rewrite Eq. (1) as Using Eq. (A29), we have where we use the fact that y(x) = y(H(x)), i.e., the assigned labels remain identical. Equation (A32) can be simplified to using Eq. (A28).
Learning by confusion.-For a fixed bipartition in LBC, we can proceed in a similar manner. Plugging Eq. (A28) into Eq. (14) assumingX = S, we havê Using Eq. (15), this yields To obtain an expression for the optimal loss, we follow the above procedure outlined for SL starting with Eq. (4) and eventually arrive at Prediction-based method.-Plugging Eq. (A28) into Eq. (16) assumingX = S, we find that Using Eq. (17), we havê To obtain an expression for the optimal loss, we rewrite Eq. (6) as Using Eq. (A37), we have (A40) where y(x) = y(H(x)). With Eq. (A28) we finally get (A41) Thus, we have shown that the optimal predictions, indicators, and loss values of SL, LBC, and PBM remain identical when configuration samples which follow a Boltzmann distribution are used as input, or when the corresponding energies are used as input instead. In practice, given a finite set of samples the inferred probability distribution P k (x) ≈ M k (x)/M is only approximately Boltzmann, i.e.,T ,X ≈ S, and the two scenarios are only equivalent up to deviations due to finite-sample statistics. In particular, the inferred probability distribution P k (x) = M k (x)/M based on raw configuration samples may not correspond to the inferred probability distribution P k (E) = M k (E)/M based on the corresponding energy, where the degeneracy factor for the conversion is inferred from the samples as However, using the energy as input instead of configuration samples yields a more accurate estimate of the ground-truth distribution. This is because the associated state space S E is significantly smaller compared to the entire configuration space S, resulting in better statistics given a fixed number of samples. In the 2D Ising model, for example, the size of the configuration space is 2 L 2 , whereas there are L 2 − 1 unique number of energies (for even L). Therefore, the optimal predictions and indicators obtained using the energy as input converge significantly faster compared to the case where raw spin configurations are used. Note that the energy is readily available in numerical studies. However, in principle, one can obtain the same results without having access to the energy given that a sufficient number of raw configurations are sampled. In the future, it will be of interest to employ more elaborate techniques for density estimation [118][119][120][121] in order to obtain a more accurate estimate of the underlying distribution given a reduced data set size.
Finally, let us continue the analysis of the optimal predictions and indicators of SL in case of Boltzmanndistributed inputs. We take region I to be composed of a single point T 1 . Let T 1 → 0 such that where E gs is the ground-state energy. Plugging into Eq. (A9) yieldŝ We calculate the mean prediction at a given temperature asŷ Using Eq. (A44), this results in Assuming region II is composed of a single point T K , we have P II (E gs ) = P K (E gs ) and recover Eq. (33) of the main text. For T K → ∞, we have P K (E gs ) = g(E gs )/M S , where M S is the total number of unique system configurations. For the two-dimensional Ising model, for example, M S = 2 L×L . Approaching the thermodynamic limit, this yieldsŷ opt SL (T k ) → P k (E gs ).
Note that these results can be extended to non-Boltzmann distributions: Given that and following the same procedure as above, we havê In particular, Eq. (A48) can be used to explain the optimal indicator signals of SL in the XXZ chain (Sec. IV D) and Kitaev chain (Sec. IV E). In this case, x * corresponds to a ground state which is one of the chosen basis states.

Finite-sample statistics
Finally, we investigate how the optimal predictions and indicators of SL, LBC, and PBM change as the number of data points M per sampled value of the tuning parameter is varied. Recall that the results for the classical systems displayed in the main text were obtained using the energy from Monte Carlo sampling as input, where M = 10 5 spin configurations are drawn per temperature. For small lattice sizes, however, it is possible to enumerate all spin configurations explicitly. In Fig. 9, we compare the optimal predictions and indicators for the Ising model on a 4 × 4 lattice when enumerating all 2 16 = 65536 spin configurations explicitly or using Monte Carlo sampling with 10 5 number of configurations per sampled value of the tuning parameter. The results obtained based on the two distinct data sets are in good agreement, which is to be expected given that there are only 15 unique energies. The noise present in the indicator signals of SL and PBM when using Monte Carlo samples is absent when using exact enumeration. In the latter case, both indicators vary smoothly as a function of temperature. As such, this noise can be attributed to finite-sample statistics.
In general, for both the classical and quantum systems we observe that the overlap in the underlying probability distributions leading to a peak in the indicator signals decreases as the number of samples M is decreased. However, meaningful results can already be obtained when only a fraction of the total state space is covered. In the case of the Ising model on a 60 × 60 lattice, for example, we observe that the optimal predictions and indicators are already well converged for M = 10 2 , i.e., matching the results obtained with M = 10 5 . In particular, the key features in the indicators, i.e., the peak locations, can already be identified for M = 10. Compare this to the unique number of energies given by 3599. Figure 10 shows the optimal predictions and indicators of SL, LBC, and PBM for the Kitaev chain of length L = 20 given various values of M . Recall that the results for the quantum systems displayed in the main text (see Sec. IV) were obtained based on the "ground-truth" probability distributions from exact diagonalization. Here, we explicitly sample these probability distributions, i.e., perform projective measurements and infer the probability distribution based on the measurement results. In SL and PBM, accurate estimates for the critical value of the tuning parameter can be obtained based on M = 10 3 samples, whereas M = 10 4 samples are required for a local maximum to emerge in LBC. This only covers a fraction of the total state space comprised of MX = 524288 states. Notice that the indicator of LBC shows a plateau close to one in the topological phase for a small number of samples, which signifies the absence of "confusion" inherent to the data (see Fig. 10). Similarly, the optimal prediction of PBM is approximately linear in the topological phase for a small number of samples, corresponding to a model which can perfectly resolve the value of the tuning parameter associated with the input. This demonstrates the fact that while the ground-truth probability distributions may have substantial overlap, estimated probabilities based on a drawn data set may not.
The high level of uncertainty in the indicator of SL and PBM compared to LBC can be attributed to the symmetric difference quotient used to approximate the derivative. Moreover, in LBC we associate a distinct optimal predictive model to each bipartition point, whereas the optimal indicator is extracted from a single optimal model in case of SL and PBM. This leads to an additional suppression of fluctuations in case of LBC. In the future, it will be of interest to enhance the quality of the optimal predictions and indicators based on finite data through improved derivative computations in case of SL and PBM [133], as well as more elaborate techniques for density estimation [118][119][120][121].

Appendix B: Computation using neural networks
In this appendix, we discuss the application of SL, LBC, and PBM to the six physical systems discussed in the main text (see Sec. IV) using NNs. First, we show that one can recover the optimal analytical predictions and indicators by training NNs. Next, we discuss the computational cost associated with training NNs compared to constructing and evaluating optimal models. Finally, we investigate the influence of NN size, early stopping, regularization, and finite-sample statistics on the results. Before training the NNs, each input x = {x i } is standardized via the following affine transformation where x i and σ xi are the mean value and standard deviation of x i across the training data, respectively. Standardization generally leads to a faster rate of convergence when applying gradient-based optimizers [134]. Note that this bijective mapping does not change the probability associated with each input, i.e., P k (x) = P k (x ) ∀1 ≤ k ≤ K. Therefore, the optimal predictions and indicators remain unchanged.
Neural network architecture.-For simplicity, the NNs used in this work consist of a series of fully-connected layers, where rectified linear units (ReLUs), f (z) = max (0, z), are used as activation functions [40]. The NNs for SL and LBC have two output nodes, where a softmax activation function is used in the output layer to guarantee thatŷ(x ) ∈ [0, 1].
Here, the sum runs over all output nodes, andŷ corresponds to the value of one of the output nodes after application of the softmax activation function. In PBM, Training.-The NNs are implemented using Flux in Julia [135], where the weights and biases are optimized via gradient descent with Adam [136] to minimize the loss function over a series of training epochs. In SL and LBC, we train on a CE loss function [Eq. (1) and (4), respectively], whereas in PBM we train on a MSE loss function [Eq. (6)]. Gradients are calculated using backpropagation [40,137,138]. For the prototypical probability distributions discussed in Sec. III A, we train for 10000 epochs with a learning rate of 0.001. The number of training epochs and learning rate for all other models is reported in the corresponding figure captions.
Results. -Figures 11 and 12 show the predictions and indicators of the three methods obtained using NNs (dashed lines) after long training for all six physical systems considered in the main text. Here, we chose the smallest system sizes for convenience. Overall, they are in excellent agreement with the corresponding optimal predictions and indicators (bold lines). As the system size is increased, it becomes increasingly difficult to approximate the corresponding optimal predictions and indicator with high accuracy because the NN size has to be increased systematically, i.e., hyperparameters need to be adjusted more carefully. However, even for the largest system sizes considered in this work qualitative agreement can still be achieved with moderate NN sizes, see Appendix B 1 for an explicit example.
Computational cost.-Finally, let us touch upon the computational cost of training NNs. Table I Tab. I), calculating the loss function, obtaining the gradient via backpropagation, and performing a single gradient step. This represents a lower bound for the total computation time associated with obtaining NN-based predictions and indicators. In a typical application, however, larger NNs need to be used, the NNs need to be trained for multiple epochs, the NN parameters (or the corresponding predictions and indicator) need to be cached at regular intervals, hyperparameters need to be tuned, and finally the indicator needs to be computed based on the NN predictions. The computation time for a single epoch is also expected to increase if the data is processed in a batchwise fashion (albeit likely at the benefit of requiring less training epochs overall). We find that this lower bound on the training time is comparable with the evaluation time of the corresponding optimal predictions and indicators (and optimal loss) and the two times differ by less than an order of magnitude across all six physical systems studied in the main text. This empirical finding can be explained as follows: To construct the optimal model, the probability of all inputs needs to be evaluated. Similarly, in each training epoch the NN is evaluated at all inputs contained in the training data set. The computation time associated with evaluating a small NN for a given input is comparable with evaluating the corresponding optimal model prediction, and the overhead associated with the gradient computation via backpropagation is of the same order of magnitude as the NN forward pass [139]. Suppose one is interested in the predictions and indicators of SL, PBM, and LBC, in the limit of a perfectly trained, highly expressive NNs. Evidently, based on the discussion above, the evaluation of the analytical expressions is generally more efficient in that case. The precise timings will depend on the particular implementation, as well as the choice of hyperparameters. However, even in the case where small NNs are trained for short times the computation time associated with constructing and evaluating an optimal model is at worst comparable. Here, we have neglected any overhead associated with constructing probability distributions based on drawn samples. In principle, when using NN one does not rely on the estimated probability distributions, i.e., one can directly work with the unprocessed dataset. Note, however, that in many scenarios (including this work) the overhead of estimated probability distributions from the dataset is negligible. When studying quantum systems using exact diagonalization, one has direct access to the underlying probability distributions. Similarly, when performing Monte Carlo studies the energy statistics are readily available.

Controlling model capacity
Here, we investigate the effect of NN size, training time, and 2 regularization on the NN-based predictions and indicators and compare them with the corresponding optimal predictions and indicators. All three factors influence the capacity of the resulting model and thus determine its ability to approximate the optimal predictive model realizing the global minimum of the loss function corresponding to the optimal predictions and indicators [40,41]. As pointed out in the main text (see Sec. IV), there are instances where the optimal model does not correctly highlight the corresponding phase transition whereas simpler models do.
As an example, let us consider the application of PBM to the Ising model. Figure 13 shows the results for a 60 × 60 lattice obtained with NNs composed of a single hidden layer with a variable number of hidden nodes ranging from 2 to 2048. Figs. 13(b),(f) show the corresponding NN-based predictions and indicators after training for 10000 epochs. For NNs with 2 and 8 nodes, the indicator shows a clear peak at the critical value of the tuning parameter. As the number of nodes increases, the NN results start to resemble the optimal predictions and indicators (black) more closely. This reflects the fact that, the expressivity of an NN increases as the number of nodes is increased. A similar behavior is also visible in Fig. 13(a) which shows the loss over time, where NNs with more than 8 nodes achieve values close to the optimal loss (black), i.e., the global minimum. cators for the smallest NN (2 hidden nodes) evaluated at various training epochs. Here, the indicator gradually converges towards its final form, which exhibits a peak at the critical value of the tuning parameter. Similarly, Figs. 13(d) and (h) shows the results for the largest NN (2048 hidden nodes). Here, early on during training the indicator is sharply peaked near the critical value of the tuning parameter. As the training progresses, the indicator signal starts to wash out and converge to the optimal indicator signal. The evolution of the global maximum of the indicator signal as a function of the training epoch for the various NN sizes is shown in Fig. 13(e). These results quantify how accurately the estimated critical value of the tuning parameter based on the optimal indicator (black) is reproduced for a given NN size and training time. Figure 13(h) shows that even for the large NNs there seems to be an intermediate time period during training where the indicator peaks near the critical value of the tuning parameter correctly highlighting the phase transition. Looking at Fig. 13(a), during these intermediate time periods the corresponding loss function starts to saturate and display a kink. This suggests a procedure for early stopping, where the training is stopped once a kink in the loss function is observed [40]. Early stopping based on the validation loss will be discussed in the subsequent section (see Sec. B 2). During training, the model capacity increases as visible by the steady decrease in the corresponding loss [40,140,141]: initially the model cannot resolve anything, in the intermediate stages it can resolve between the two phases leading to the sharp peak, and eventually it approaches the optimal predictive model (which, in this case, does not correctly highlight the phase transition). By stopping the training at the intermediate stage (i.e., selecting the corresponding NN parameters after the training is complete) a model of intermediate resolution can be obtained. Thus, early stopping acts as an implicit regularization [40,140,141]. In the case of PBM, stopping the training early yields in an NN whose indicator peaks near the critical temperature of the Ising model. However, this is not always the case. In LBC, for example, the estimated critical temperature gradually improves during training, i.e., as the model capacity increases . Recall that the optimal indicator of LBC correctly highlights the phase transition. Qualitatively similar results can be obtained for the other methods and systems. In particular, in the Ising model and XY model, we find that the indicators of SL and PBM both show a clear peak near the critical transition temperature early on during training around the epochs marked by a kink in the loss function. The peak locations of the corresponding NN-based indicator signals coincide with the signals of physical indicators, such as the heat capacity or magnetization.
Lastly, we can also control the capacity of our model through explicit 2 regularization [40] where the sum runs over all tunable parameters θ i of the NN and λ 2 is the regularization strength. Figure 14 shows ing model becomes more complex and converges towards the optimal model that minimizes the loss function in the absence of regularization. Consequently, the predictions and indicators converge towards the optimal predictions and indicator. In the Ising model, we thus find that explicit regularization helps to construct a model of intermediate resolution whose indicator correctly highlights the critical temperature (similarly for SL). However, as mentioned above, models with restricted capacity may not always highlight the critical value of the tuning parameter correctly. In the IGT, for example, the indicator of regularized NNs tends to display an erroneous peak similar to the specific heat, see Fig. 4.

Finite-sample statistics: Splitting data into training, validation, and test sets
Here, we investigate NN-based predictions and indicators in the case where only a limited amount of data is available. In particular, we discuss the effect of splitting the data into a training, validation, and test set. Recall that in the limit of sufficient data, the training, validation, and test set will coincide as they are all sampled independently from the same probability distribution underlying the physical system, see Sec. II. Therefore, in the limit of sufficient data the training, validation, and test losses will decrease in lockstep during training. This is illustrated in Fig. 15 which shows the training, validation, and test loss of PBM for the Kitaev chain for different data set sizes. For small data sets, the training, validation, and test sets can differ, resulting in differing training, validation, and test losses. In particular, one can observe a characteristic increase of the validation loss after a certain time period attributed to overfitting [40]. This allows one to perform early stopping such that the minimum in the validation loss is realized [40]. Note that the location of the minima in the validation loss coincides with the kink in the corresponding training loss. The sharp local minimum in the validation loss fades as the data set size is increased further, leaving only the corresponding kinks in the training loss as a signal for early stopping. The latter situation has been discussed in Appendix B 1. Therefore, a splitting into training, validation, and test set may allow for a clearer signal to perform early stopping given a small data set.
Another effect arising when a limited amount of data is available and finite-sample statistics play a role is best illustrated by investigating the Kitaev chain using LBC. Figure 16 shows the NN-based indicator signal of LBC obtained for training, test, and validation sets of various sizes. For small data set sizes [see Fig. 16(a),(d)] the optimal indicator (black, solid) shows no local maximum due to the negligible overlap in the inferred probability distribution. The NN-based indicator of a sufficiently large NN closely matches the optimal indicator on the training set after training [ Fig. 16(a)], whereas a small NN is incapable of approximating the optimal indicator on the training set. However, interestingly the indicator signal of the small NN qualitatively matches the optimal indicator signal based on the ground-truth probability distributions. In particular, it features a local maximum allowing for an estimate of the critical value of the tuning parameter to be obtained. This is another example illustrating how simple models can lead to sharp indicator signals. While the inferred probability distribution only has a marginal overlap in the topological phase resulting in the absence of a local maximum in the optimal indicator signal (black), the data may be partially indistinguishable to a simple model. This illustrates how "confusion" can also arise due to models with restricted expressivity (see Sec. III). The same phenomenon can also be observed for the indicator signal of the large NN evaluated on the test set (or validation set), see Fig. 16(d). Here, the confusion arises because the predictions for the unseen data within the validation and test set are sub-optimal. In the future, it will be of interest to investigate whether this effect can be mimicked through appropriate interpolation of the optimal predictions [31,35,132]. Figures 16(b),(e) and Figs. 16(c),(f) show how the discrepancy between the optimal indicator signal based on a finite data set and the NN-based indicator vanishes for the large NN as the data set size increases. This arises because eventually the training, validation, and test sets become indistinguishable. Note, however, that the discrepancy persists for the small NN.

Appendix C: Data generation
In this appendix, we provide further details on the data-generation process for each of the physical systems analyzed in the main text (see Sec. IV). For the classical systems, given by the Ising model, IGT, and XY model, we use the Metropolis-Hastings algorithm [72] to sample spin configurations from the thermal distribution at a given temperature T . The lattice is initialized in a state with all spins pointing up for the Ising model, and a random spin configuration in the case of the IGT and XY model. The lattice is updated by drawing a random spin, which is flipped with probability min(1, e −∆E /T ), where ∆E is the energy difference resulting from the considered flip. In the XY model, instead of flipping a given spin, we add a perturbation ∆θ ∈ [−π, π], which is drawn uniformly at random. To ensure that the systems are sufficiently thermalized, we sweep the complete lattice 10 5 times, where each lattice site is updated once per sweep. After the thermalization period, we collect 10 5 samples, which we find to be sufficient for achieving convergence (see Appendix A 5). In the Ising model and IGT, we increase the temperature gradually, whereas it is decreased in the XY model. In the XY model, we can further validate the quality of the Monte Carlo samples by estimating the BKT transition point. One way to do this is to determine the temperature at which the helicity modulus Υ crosses 2T /π [94,96]. The helicity modulus is also referred to as spin stiffness or spin rigidity and measures the response of the system to an in-plane twist of the spins. We find that the estimated BKT transition point based on our samples matches the literature value well, see Fig. 17. Note that in the XY model, the angle of each spin can take on any value θ ∈ [0, 2π]. This results in a continuum of states. Hence, we discretize the energy in practice, which serves as an input for the ML methods. This discretization eases computation and, more crucially, results in overlapping probability distributions given finite-sample statistics (see case 3 in Sec. III A). The discretization is performed through simple histogram-binning using 1000 bins of equal size. The number of bins was increased systematically until a convergence of the optimal indicator signals was observed. In future works, histogram-binning may be replaced by more elaborate techniques for density estimation [118][119][120][121].
Let us move on to the quantum case. To perform exact diagonalization and solve the Schrödinger equation, we use the QuSpin package [143,144] in Python. Note that when computing the ground state of the Kitaev chain through exact diagonalization, we restrict ourselves to the even-particle sector whose corresponding ground state has a lower energy within the topologically trivial phase. In the topological phase, the ground state is doubly degenerate, and the two states can be distinguished by their fermionic parity. This is because of the presence of the pairing term in the Kitaev chain Hamiltonian [Eq. (37)]. As a consequence, H does not conserve the In this appendix, we provide additional material which facilitates the comparison to other works.

Alternative approach towards supervised learning
Here, we review our approach to SL (see Sec. II A) and put it into context. In Ref. [4], the authors originally proposed to identify the estimated critical value of the tuning parameter in SL as arg min p k |ŷ(p k )−0.5| . In all systems analyzed in the main text (see Sec. IV), this yields similar results compared to our approach based on identifying the peak location of the mean prediction's derivative [Eq. (3)]. Note that the latter approach has, e.g., already been mentioned as an alternative in Ref. [20]. Looking at Fig. 8(b), we observe that these two procedures would yield slightly different estimated critical values for the MBL phase transition. This discrepancy is even more prominent for the Mott insulator to superfluid transition in the Bose-Hubbard model. Here, we investigate the two-dimensional Bose-Hubbard model whose Hamiltonian is given by where J is the nearest-neighbor hopping strength, U is the on-site interaction strength, and µ is the chemical potential. This model undergoes a quantum phase transition at zero temperature from a Mott insulating phase to a superfluid phase as the tuning parameter J/U is increased at a fixed chemical potential. This gives rise to the characteristic Mott lobes [146,147], see Fig. 18(a).
We perform mean-field calculations based on a Gutzwiller ansatz in which the ground-state wave function is written as a product state where |n i denotes the Fock state with n bosons at site i [148]. We minimize the expectation value of the Hamiltonian with respect to the Gutzwiller coefficients {|f n | 2 } nmax n=0 by means of simulated annealing [21,149] with a maximum number of bosons per site of n max = 20. As such, the Gutzwiller coefficients {|f n | 2 } nmax n=0 represent the relevant probability distributions governing the data. Note that the simulated annealing algorithm can get stuck in local energy minima. To counteract this noise, we average the Gutzwiller coefficients obtained from 500 independent simulated annealing runs.
At the tip of the first Mott lobe (µ/U = 0.5) the phase transition occurs at J c /U = 1/(5.8z) [see Fig. 18(a)], where z is the coordination number (here z = 4) [142]. The phase transition can be revealed by looking at the average boson number per site n , see Fig. 18(g). The Mott insulator is characterized by an integer density enforced by the Mott energy gap ∝ U . As a result of the energy gap, the Mott insulator is incompressible. In contrast, the superfluid phase is compressible and is characterized by strong number fluctuations (even at low temperature). Figure 18 shows the results of SL, LBC, and PBM. Here, both SL and PBM correctly identify the quantum phase transition, whereas LBC fails. Looking at Fig. 18(e), we see that a large change in the underlying probability distributions occurs at the quantum phase transition. In Ref. [22], the Mott insulating to superfluid transition in the Bose-Hubbard model was correctly highlighted using LBC with NNs. However, in this case, the Gutzwiller coefficients directly served as input, whereas here the individual Fock basis states (i.e., their indices) constitute the input. Note that the phase transition would not be predicted with a high accuracy using SL if we estimated the predicted critical temperature as the value of the tuning parameter for whichŷ opt SL = 0.5, see black-dashed line in Fig. 18. This motivates our approach to SL compared to the procedure originally proposed in Ref. [4]. However, both approaches for obtaining estimated critical values are directly applicable given optimal predictions. Figure 19 shows the optimal indicator of LBC for the IGT with inverse temperature β as a tuning parameter. The signal qualitatively matches the indicator of LBC reported in Fig. C1 of Ref. [31] obtained with NNs, confirming that for high capacity models the indicator signal of LBC is indeed ambiguous in this case.

Analysis of Ising gauge theory
In Ref. [31], the authors also investigated the IGT with PBM using NNs. They empirically find that the NNbased predictions agree well with a physical model based on the underlying density of states, which was proposed in an ad hoc fashion guided by physical intuition. In our work, we explicitly confirm this physical intuition on what the NN learns by proving that the optimal prediction of PBM for a given configuration in the IGT corresponds to the most likely tuning parameter value based on the underlying Boltzmann distribution. Figure 20 shows the optimal indicator in LBC for all physical systems considered in the main text, as well as a modified version where the V-shaped indicator signal characteristic of indistinguishable data is subtracted. Note that this V-shaped indicator signal is computed separately for each system, i.e., parameter range. For all systems, we find that the modified indicator peaks near the center of the parameter range under consideration, whereas the original indicator signal peaks near the phase transition (red-dashed line). This bias arises because the subtracted signal is lowest near the center of the parameter range. As such, the bias can be easily missed if the transition point is indeed located in the center of the chosen parameter range, see Fig. 20(d).