Interpretable and unsupervised phase classification

Fully automated classification methods that yield direct physical insights into phase diagrams are of current interest. Here, we demonstrate an unsupervised machine learning method for phase classification which is rendered interpretable via an analytical derivation of its optimal predictions and allows for an automated construction scheme for order parameters. Based on these findings, we propose and apply an alternative, physically-motivated, data-driven scheme which relies on the difference between mean input features. This mean-based method is computationally cheap and directly interpretable. As an example, we consider the physically rich ground-state phase diagram of the spinless Falicov-Kimball model.

Fully automated classification methods that yield direct physical insights into phase diagrams are of current interest. Here, we demonstrate an unsupervised machine learning method for phase classification which is rendered interpretable via an analytical derivation of its optimal predictions and allows for an automated construction scheme for order parameters. Based on these findings, we propose and apply an alternative, physically-motivated, data-driven scheme which relies on the difference between mean input features. This mean-based method is computationally cheap and directly interpretable. As an example, we consider the physically rich ground-state phase diagram of the spinless Falicov-Kimball model.
Phase diagrams and phase transitions are of paramount importance to physics [1][2][3]. While typical many-body systems have a large number of degrees of freedom, their phases are usually characterized by a small set of physical quantities like response functions or order parameters. However, the identification of phases and their order parameters is often a complex problem involving a large state space [4,5]. Machine learning methods are apt for this task [3,[6][7][8][9][10][11][12][13][14] as they can deal with large data sets and efficiently extract information from them. Ideally, such machine learning methods should not require any a priori knowledge about the phases, i.e., the methods should be unsupervised [7,9,[15][16][17][18][19][20][21][22][23][24][25][26][27][28][29]]. Yet, they should also allow for a straightforward physical insight into the character of phases. Significant progress has been made recently [20,21,25,26], but some open issues with interpretability remain. Thus, unsupervised and interpretable phase classification stays a challenging, but highly rewarding task.
A good example of both progress in the field and relevant issues regarding interpretability is the unsupervised method introduced in Ref. [18]. This approach is based on a predictive model trained to infer the parameters of a physical system from input data -obtained by experimental measurements or numerical simulations -that characterize the system's state. In the following, we refer to this approach as the prediction-based method. The predictions for the system parameters in the predictionbased method are changing most strongly near phase boundaries. Hence, the vector-field divergence of the deviations of the predicted system parameters from their true values serves as an indicator (label I in Fig. 1) of phase boundaries.
The prediction-based method was hitherto successfully applied to symmetry-breaking [18], drivendissipative [18], quantum [19], and topological phase transitions [19,30] in various systems. The predictionbased method requires a predictive model with sufficient expressive power [31,32] to resolve different phases. The resulting phase classifications can therefore be hard to interpret if highly expressive models, such as deep neural networks (DNNs) [32], have to be used [20,26]. Addition-ally, the training of DNNs is computationally demanding. For complicated phase diagrams with a large number of phases, the applicability of the prediction-based method remains to be demonstrated.
Herein, we render the prediction-based method fully interpretable by deriving its optimal predictions. To devise local order parameters for the predicted phases, we employ linear models [25,26,33,34] to infer the system parameters. Such local order parameters distinguish neighboring phases. As the key result of this Letter, we demonstrate a physically motivated, general, datadriven, unsupervised phase classification approach which relies on the difference between mean input features as an indicator for phase transitions (Fig. 1). In the following, we refer to this approach as the mean-based method. The mean-based method eliminates the need for a predictive model, is computationally cheap, and directly interpretable. tions [41][42][43][44][45], pattern formations in ultracold atoms in optical lattices [46][47][48][49], localization and correlations [50][51][52][53][54][55][56], or various nonequilibrium phenomena [57][58][59][60][61][62][63][64]. Hitherto, the classification of ground-state phases in the FKM was a manual and -due to the richness of the phase diagram [38][39][40] -lengthy and cumbersome task. The complexity of the FKM phase diagram makes it a challenging example for unsupervised and interpretable phase classification methods.
The Hamiltonian of the spinless FKM is Here, t is the hopping integral (energy unit throughout this work), U is the on-site Coulomb interaction strength, are the creation (annihilation) operators of heavy (f ) and light (d) fermions at lattice site i. The number operator n f,i = f † i f i commutes with the Hamiltonian for all i; we can replace it by its eigenvalues w i ∈ {0, 1}. The ground state is thus determined by the classical f -particle configuration w = {w i } that minimizes the system energy. We focus on the "neutral" case [38], characterized by an equal density of heavy and light particles ρ ≡ N f /L 2 = N d /L 2 . Here, N f (N d ) is the total number of heavy (light) particles and L = 20which we fix throughout this Letter -is the linear size of the square two-dimensional lattice with periodic boundary conditions (plane symmetry group: p4m [65]). Figure 2(a) shows a sketch of the expected phase diagram in two-dimensional parameter space [38]. It highlights the regions of stability of three main types of orderings, namely, (1) segregated, (2) diagonal, and (3) axial orderings. A multitude of other phases with smaller stability regions are expected to be present in the full diagram [38,39].
For each p we performed up to 64 independent simulations. Because simulated annealing does not always converge to the ground state for large systems, we investigate two cases: a "noise-free" case where the best estimate w 0 is taken as the ground-state and a "noisy case" where we take into account 10 configurations with the lowest energies at each p. The latter case is important for checking the robustness of our methods and is particularly relevant for experiments, where thermal fluctuations are inevitable.
We start the analysis of the phase diagram using the prediction-based method with DNNs (Sec. S2 in SM [66]).
Since the predictionsp ≡ Û ,ρ are most susceptible near the phase boundaries, maxima in the vector-field divergence of δp ≡p − p given by serve as an indicator I (p) (Fig. 1) of phase boundaries. We train the DNN to predict p = (U, ρ) using as input the magnitude of the two-dimensional discrete Fourier transform |F 0 |. Figure 2(b) and (c) shows the obtained phase diagram in the noise-free and the noisy case, respectively. The usage of |F 0 | instead of w 0 results in a shorter training time because data augmentation by lattice translations is not necessary. Moreover, it yields an improved signal-to-noise ratio for ∇ p · δp, because the peak in |F 0 | at k = (0, 0) corresponds to N f , i.e., ρ is directly fed into the DNN. Consequently, the vector field δp exhibits a horizontal structure (Sec. S6 in SM [66]), meaning that ρ is predicted with near-perfect accuracy in both cases. The horizontal structure of δp implies that maxima in ∇ p · δp indicate phase transitions along U at fixed ρ. The (U, ρ) parameter space can thus be analyzed with cuts along U -we will do this later on.
The largest connected regions with a negative divergence signal in Fig. 2(b) coincide with the three main regions within the sketched phase diagram displayed in Fig. 2(a) and their character can be confirmed by simple order parameter analysis (Sec. S3 in SM [66]). Thus, the prediction-based method correctly identifies the expected main characteristics of the phase diagram. However, the phase boundaries are not reproduced with a large contrast and it is not easy to identify stability regions, besides the ones of the main orderings [1, 2, 3 in Fig. 2(a)]. Specifically, the prediction-based method indicates changes of the phase at several points within stable phase regions [ Fig. 2(b)]. These artefacts intensify in the noisy case [ Fig. 2(c)], where -in addition -the training of the DNN becomes computationally heavy. Moreover, a large(er) amount of input data is needed to obtain the phase diagram with sufficient accuracy. To cope with these problems, it is first necessary to understand the DNN predictions.
For this purpose, we derive the form of the optimal predictive model for the prediction-based method. In the general noisy case, the optimal prediction for an input x of a model trained to minimize a mean-square-error loss (derivation in Sec. S2 in SM [66]) iŝ Here, the sum runs over all sampled points {p i } in parameter space. The probability of drawing the input x at p i is governed by the distribution P i . By considering identical, non-zero values of P i (x) within a particular region of parameter space and zero outside, we get the optimal predictions for the noise-free case. The vectorfield divergence ∇ p · δp [ Fig. 2(b)] matches the one of an optimal predictive model. We, therefore, successfully rendered the phase classification of the prediction-based method interpretable. However, we still lack physical insights into the character of the predicted phases. An important observation concerning the noise-free case is that the method predicts a phase transition whenever neighboring configurations in U (at ρ = const.) are not related by transformations of p4m. The optimal model predictions for all points within such a phase are placed at its center of mass. This results in a signal of ∇ p · δp = −1 within a phase. The value of ∇ p · δp at two points p that constitute a phase boundary serves as a measure of the mean extent of the two corresponding phases (along U ). Consequently, the large extent of the segregated phase in parameter space is the cause for the large, isolated maxima in ∇ p · δp within it, Fig. 2(b). Such isolated points are physically meaningless.
The noisy case can be understood from a single linescan. At ρ = 63/400 [dashed line in Fig. 2(c)] a transition from a non-segregated to a segregated ordering occurs. Figure 3(a),(d) shows that the predictionsÛ and the corresponding divergence ∂δU/∂U obtained with a DNN indicate the corresponding phase boundary. Finitesample statistics cause significant fluctuations in the distributions P i (|F 0 |) and varying predictionsÛ within the segregated phase. The fluctuations in P i can yield divergence signals close to zero that, again, correspond to misleading predictions of phase boundaries within the segregated phase.
Nevertheless, we can now can use the predictionsÛ for the definition of automatically generated local order parameters. The example in Fig. 3(a) exhibits the largest change inÛ at the transition from non-segregated to segregated orderings;Û increases only slowly within the segregated phase (U 3). Moreover, the predictionsÛ can be qualitatively reproduced by training a linear model as opposed to a DNN [ Fig. 3(b),(e)]. The local order param-eterÛ based on the linear model is directly interpretable through its weights and biases. This approach also works in the "noise-free" case and with other inputs (Sec. S6 in SM [66]). We thus demonstrate that the construction of local order parameters can be automated by training a linear model for each individual phase transition which was identified prior using a DNN with the predictionbased method. Eventually, the DNN-based model can be replaced by the corresponding set of linear models which is conceptually related to training local surrogate models [33].
Importantly, the form of the optimal model in the prediction-based method paves the way for a class of computationally cheap algorithms for unsupervised phase classification without any predictive model. Here, we focus on the mean-based method (other methods in Sec. S4 in SM [66]). The optimal model predictions reveal that the corresponding predicted phase diagram can be reproduced by detecting changes in observables derived from w 0 which are invariant under transformations of p4m. Ideally, the observable should not be very sensitive to small changes in w 0 within a stability region.
A suitable physically-motivated choice are correlation functions that measure the order of a given configuration: where m ξ n is the number of constituents in the set {j ξ n } which contains lattice points matching three types of orders that measure square (ξ = sq), axial (ξ = ax), and diagonal (ξ = di) correlations over n lattice sites (illustration in Fig. 2(d) and details in Sec. S3 in SM [66]).
With these correlation functions, we define the following correlation indicator for the mean-based method: Here, thē ·-notation indicates the average over all inputs at a given point p, if multiple inputs are considered. The indicator ∆κ measures the magnitude of the change of order quantified by κ ξ n .
Our results for, both, noisy and noise-free cases demonstrate that the mean-based method with the indicator ∆κ reveals the phase diagram more clearly than ∇ p · δp, We now ask the question if the mean-based method can be applied to the phase classification problem without a specific physically-motivated input. To this end, we extend the mean-based method to general inputs, by defining a generic indicator: Here, Since the generic indicator ∆x detects phase transitions, Eq. (6) establishes a general, data-driven scheme without a predictive model which is applicable in both the noise-free and noisy case. In any such data-driven approach the used input x crucially affects the performance of the phase classification; Fig. 2(e),(f) show that the set of correlation functions κ are an appropriate choice in the case of the FKM (results with different inputs in Sec. S6 in SM [66]).
Another suitable choice for the input for the meanbased method are the Fourier transformed configura- Fig. 3(c),(f)]. Even in this case the difference signal is a good indicator for the phase transition at U ≈ 2. This underpins the generality and robustness of the mean-based method and shows its possible applicability to other models beyond the FKM, where the inputs are generally different.
We stress that the indicators of phase transitions in the mean-based method [Eq. (6)] and the prediction-based method [Eq. (2)] differ fundamentally. They constitute two distinct approaches to characterize changes in the underlying distributions P (x). However, the mean-based method has some clear advantages: it is computationally cheap and allows for direct physical insights. For example, by calculating an indicator, Eq. (6), based on each individual element of κ or |F 0 | one can directly find the elements which have the greatest impact on the overall indicator. This further characterizes each phase transition and simplifies the subsequent analysis.
In conclusion, we have rendered the prediction-based method fully interpretable with a derivation of its optimal model predictions and the automatic local-orderparameter generation using linear surrogate models. Moreover, we have presented a mean-based method that works outstandingly well as an unsupervised phase classification approach for various inputs and in the presence of noise. We infer that applications of our mean-based method to arbitrary phase diagrams featuring, e.g., quantum or topological phase transitions are feasible. Specifically, applications to quantum-classical systems such as the FKM and its numerous generalizations [40,57,[67][68][69] are now straightforward. The success of the meanbased method suggests extensions to unsupervised phase classification methods based on higher-order moments or alternative distance measures.
We This Supplemental Material contains technical details necessary for a reproduction of the results presented in the main text and additional arguments to support our conclusions. We start by addressing details of the simulated annealing procedure to generate ground-state configurations of the Falicov-Kimball model (FKM). We provide technical details on the training and architecture of neural networks in the prediction-based method, as well as a derivation of its optimal predictive model in both the noise-free and noisy case. Next, order parameters and correlation functions quantifying the presence of different orderings and correlations in the FKM are defined. To contextualize the prediction-based method and motivate the mean-based method, we compare the methods to two alternative phase classification schemes. Next, technical details on the calculation of the indicator in the mean-based method are given and potential variations and extensions are discussed. Finally, we provide complementary figures (Fig. S5-S7) which support the discussion from the main text: the vector field arising in the prediction-based method, the inferred twodimensional phase diagram using the mean-based method with |F 0 | as input, as well as an analysis of the line-scan at ρ = 63/400 ≈ 0.16 for different types of inputs in the noise-free and noisy case. The code for the prediction-and mean-based method that was utilized in this work is open source [70]. CONTENTS S1. Simulated annealing 8 For a fixed f -particle configuration w, the Hamiltonian of the FKM in Eq. (1) in the main text can be transformed into where we introduce the matrix elements h jj = U w j δ jj − tδ jj . Its eigenvalues λ α are obtained by numerical diagonalization. Finding the ground state of the FKM then means to find the configuration w which leads to the lowest energy However, even after accounting for the lattice symmetries, the ground-state configurations of systems with linear size L = 20 can not be determined exactly by comparing the energies of all possible configurations w in general. An approximate method is required. Instead of using a reduced set of chosen orderings, as was done in previous studies [38,39] of the model, we determine the corresponding f -particle ground-state configuration w 0 using simulated annealing.
We use an algorithm based on a semi-classical Metropolis Monte Carlo [71], where we use E gs (w) in the statistical weights energy instead of the free energy. This means, that the candidate configuration w c , generated by a random displacement of a single f -particle from the current configuration w, is accepted as new where r is a random number drawn from a uniform distribution r ∈ [0, 1] and β = 1/T is the inverse temperature (T ). We first used a classical protocol, where we started at relatively high temperature T ∼ 0.1t and cooled the sample in 20 − 40 discrete temperature steps to zero. A thermalization process consisting of 10 2 − 10 3 × L 2 updates was done at every time step.
However, we have found that an alternative adaptive protocol was much more efficient in lowering the energy. Namely, we start the annealing with a long thermalization at a low temperature (typically T = 0.003t). In the next steps, depending on if the algorithm has found a configuration with lower energy at the current temperature or not, the temperature was either lowered by dividing its value by a factor between one and two (typically 1.25) or increased by multiplying it by the same factor. The modified protocol is better in escaping local minima and has less troubles with the fact, that the FKM can go through more than one ordered phase with decreasing temperature [72,73].
We have typically used a number of independent runs with random initial conditions. For small lattices (L ≤ 10) all simulations converged to configurations identical up to transformations of p4m. For L = 20 we used 64 runs with random initial conditions, plus several runs with initial configurations reflecting typical ground-state orderings identified for smaller lattices (L ≤ 16). We further investigated two cases in the main text. In the noisy case, we considered the 16 configurations with the lowest energies. In the noise-free case, we performed one additional step: Namely, at each investigated p = (U, ρ) we took the configuration w with the lowest energy and compared it with the energy calculated using the configuration w obtained as the ground state for the same ρ, but different (neighboring) U . The configuration w with the lowest energy was then taken as the final ground-state approximation.

S2. PREDICTION-BASED METHOD
In this section, we provide details on the architecture and training of deep neural networks (DNNs) and linear models used in the prediction-based method. In particular, we list the corresponding hyperparameters employed throughout this work.

S2.1. Deep neural networks
In this work, we analyzed the ground-state phase diagram of the spinless FKM using the prediction-based method with DNNs as predictive models. These are built as follows: if the NN input is image-like, such as ground-state configurations w 0 or the magnitude of their discrete Fourier transform |F 0 |, we first apply K different square filters with the same linear size L as the input image [32]. Subsequently, we apply a rectified linear unit, ReLU (x) = max (0, x), as an activation function [32]. This results in an output feature map of size 1 × 1 × K which is then flattened to a feature vector with K elements. In case of vector-like NN inputs, we skip this step. In both cases, we feed the corresponding vectors into a series of fully-connected layers (FCLs), where ReLUs are used as activation functions [32]. We note that the NN architecture remains to be optimized systematically. However, such a DNN with sufficiently many parameters can serve as an accurate predictive model.
For training the DNNs, the inputs are standardized by means of the affine transformation and the outputs are normalized as where x i denotes the i-th input/output,x i and σ i are the mean and standard deviation of the distribution of the i-th input/output over the entire training data. Standardization ensures that the distribution of each transformed input x i over the entire training data is characterized by (x i = 0, σ i = 1). Whereas normalization results in the distribution of each transformed output x i over the entire training data being characterized by (x i =x i /σ i , σ i = 1). Scaling of the inputs, here by means of standardization, is common practice in the data pre-processing step of machine learning tasks relying on gradient descent for optimization, as it generally yields a faster convergence rate [74]. The additional normalization of the outputs improves the model accuracy when training with a mean-square error (MSE) loss function, as it ensures that the outputs do not differ in size or spread and are consequently treated on an equal footing during the optimization. The MSE loss function is defined as where the sum runs over all N p sampled points p in parameter space and all N x inputs x at each point p. Here, p = Û ,ρ denotes the predictions of the DNN given a particular input x.
The DNNs are implemented in PyTorch [75], where the weights and biases are optimized using the stochastic gradient-based optimizer Adam [76] to minimize the loss function L MSE [Eq. (S5)] over a series of epochs. After each training epoch, the vector-field divergence is calculated based on the predictionsp for each sampled point p in parameter space. This is done using the symmetric difference quotient where δU =Û − U and δρ =ρ − ρ. The divergence is averaged over all inputs per point p. The learning rate is reduced by a fixed factor f r if the loss L MSE does not drop below a certain relative threshold value within a given number of epochs, referred to as "patience". Gradients are calculated using backpropagation. During training, weights and biases are updated batch-wise, i.e., during each epoch the entire training data is randomly split into batches of equal size. For each batch, the predictions and the resulting loss L MSE are calculated and the NN parameters are then updated accordingly.
To incorporate configurations related through transformations of p4m we use online data augmentation, i.e., each time a configuration is revisited during training a random transformation of p4m is performed. Note that if we use |F 0 | as inputs, we do not need to consider any lattice translations. In the case of symmetry-invariant inputs, such as correlation functions, we do not need to apply any transformations beforehand. Data augmentation is crucial, as all configurations related through transformations of p4m have the same energy. Therefore, data augmentation aims at removing physically irrelevant differences between configuration samples and enforces the NN to pick up on patterns in the input, which are equally present in all the transformed versions. The DNN hyperparameters employed in this work are collected in Tab. S1. The color scale in Fig. 2(b) and (c) in the main text was cut off at -1 and -2, respectively, for better visualization. There were very few distinct points in parameter space with a divergence signal ∇ p · δp below this cut-off.

S2.2. Linear models
To construct directly interpretable local order parameters for each predicted phase transition, we rely on the prediction-based method with linear models. The predictionsp of a linear model are given aŝ with an input vector x, a weight matrix θ and an vector of additive biases b. Evidently, a linear model allows for a direct interpretation in terms of its weights θ and biases b. In addition to the MSE term in Eq. (S5), when training linear models we add a L 2 regularization term to our loss function Here, the regularization rate λ controls the strength of our preference for smaller weights θ i , where the sum runs over all weights [32]. In this work, λ is chosen small enough such that the trained model qualitatively yields the same predictions as a model trained to minimize L MSE . The additional restrictions posed on the model by the L 2 regularization term thereby removes the remaining freedom in its parameters. This is of particular importance when trying to interpret the model through its weights and biases [21,25,26].
The linear models are trained to minimize L [Eq. (S9)] using the scikit-learn implementation for Ridge regression with default settings [77]. Data augmentation is performed offline by applying n trafo random transformations of p4m to each input configuration beforehand analogous to the online variant described previously. The hyperparameters for training the linear models employed in this work are collected in Tab. S2.

S2.3. Derivation of optimal predictive model
In this section, we provide an analytical derivation of the optimal model predictions when using the prediction-based method for phase classification in the general noisy case, as well as the special noise-free case.
Noisy case -We assume to have a system with a single, tunable parameter p, which we sample on an equidistant grid with a grid spacing ∆p. At each grid point p i , we draw inputs x from an underlying probability distribution x ∼ P i (x). This situation is illustrated in Fig. S1(a). We train a model f : x → f (x) to infer p from the samples {x} generated at p, i.e., to minimize a MSE loss function (S10) Here, p runs over all N p sampled grid points and x runs over all N x inputs at p.
Let us pick a particular input x j . We can determine the optimal model prediction f opt (x j ) by minimizing L MSE w.r.t. f (x j ), i.e., Here, N j x (p i ) denotes the number of times the particular input x j is present at point p i , and P i (x j ) ≡ N j x (p i ) /N x is the associated probability. We can additionally defineP i (x j ) ≡ P i (x j ) / p P (x j ) as the probability of drawing the particular input x j at point p i compared to all other sampled points p. We then obtain (S12) Repeating this step for all inputs {x}, we find that any model f opt which minimizes L MSE will output f opt (x j ) for an input x j . This implies that the optimal model predicts the center of mass for a particular input x j , where each grid point is weighted according to the probability to draw the input x j . Note that there are no additional restriction on the form of f .
Given an optimal model f opt , the divergence of δp at a point p i is calculated as Here, δp =p − p withp = 1 Nx x f opt (x), where the sum runs over all N x samples {x} drawn at point p. Hence, ∂δp ∂p pi ≈p The generalization to a parameter space of arbitrary dimension is straightforward.
Here, we have derived the divergence signal of an optimal model f opt for the most general (noisy) situation. This removes the need for further interpretation of the DNN because it merely serves to approximate f opt , and thereby renders the method interpretable. Additionally, it opens up the possibility to approximate f opt without the need of DNNs as universal function approximators. Specifically, one may compute the optimal model predictions f opt in Eq. (S12) from estimates of the underlying probability distributions P (x), e.g., obtained using Monte Carlo methods. A comparison of such approaches to NN-based ones will be subject to future studies.
Note that even with the form of the optimal model f opt at hand, we still may want to investigate the decisionmaking of DNNs trying to approximate f opt to obtain physical insights, e.g., using state-of-the-art attribution methods following Ref. [21,22]. However, the scheme based on training linear models for predicted phase transitions to construct local order parameters proposed in the main text proved to be more effective to automate this task.
Noise-free case -Consider now the special noise-free case in which there are regions along p where samples {x} are identical up to transformations of p4m. A situation with two such regions, labelled I and II is shown in Fig. S1(b). In particular, there is a single set x ∈ {x I 0 , x I 1 , ..., x I α } from which the samples are drawn at each point p ∈ {p 0 , p 1 , ..., p n } within region I. Similarly, samples at all points p ∈ {p n+1 , p n+2 , ..., p m } within region II are drawn from set x ∈ {x II 0 , x II 1 , ..., x II β } not related by transformations of p4m to the set of region I. Consequently,P x I j = 0 for all points p in region II andP x I j = const. = 1/N I p for all points p in region I, and vice-versa. Here, N I p denotes the number of sampled points in region I. Using Eq. (S12) we then obtain where p runs over all N I p sampled points in region I. Similarly, f opt x II i = p II . Meaning, that the optimal model yields predictions at the center of mass for all samples within a given region.
Given an optimal model f opt , the divergence of δp at a point p i is given by Eq. (S14). Hence, for any point p i within region j ∈ {I, II} that is not directly located in vicinity of the boundary of the region, we have ∂δp ∂p pi ≈ −1. (S16) Conversely, for the two points at the boundary between region I and II we have This shows that the prediction-based method predicts a phase transition whenever neighboring inputs cannot be related through transformations of p4m. Equivalently, it predicts phases to be regions in which neighboring inputs are simply related by transformations of p4m. The value of the divergence peak at a phase boundary serves as an indicator of the mean extent of the corresponding phases in parameter space. The size of a phase is thus linked to its stability, i.e., its robustness against variations in the system parameters. This procedure can be generalized straightforwardly to a parameter space of arbitrary dimension with an arbitrary number of phases. Figure S2 shows the vector-field divergence ∇ p · δp as a function of U at ρ = 35/400 for the DNN trained using |F 0 | as input in the noise-free case whose predicted two-dimensional ground-state phase diagram of the FKM is shown in Fig. 2(b) in the main text. The values of the divergence match the results in Eq. (S16) and (S17) almost perfectly. This confirms that our trained predictive model is indeed optimal, i.e., minimizes L MSE .
FIG. S2. Vector-field divergence ∇p · δp as a function of U at ρ = 35/400 obtained analytically based on Eq. (S16) and (S17), as well as numerically using a DNN trained with |F 0| as input for the two-dimensional ground-state phase diagram in the "noise-free" case [see Fig. 2(b) in the main text for full predicted phase diagram].

S3. ORDER PARAMETERS AND CORRELATION FUNCTIONS
In this section, we discuss order parameters for segregated, diagonal and axial orderings [cf. labels (1)-(3) in Fig. 2 in the main text], and provide definitions for set of three correlation functions κ ξ n measuring square, axial, and diagonal order.

S3.1. Order parameters
To define order parameters for the diagonal (di), as well as axial (ax) orderings, we introduce appropriate filters F ξ , ξ ∈ {di, ax}. The values of the order parameters are obtained by taking the Frobenius scalar product of the raw configurations w with the corresponding filters. To account for configurations that are related through transformations of p4m, we also subject the filters to the corresponding transformations. Ultimately, we take the maximum value over all symmetry-related filters {F ξ } as the value of the order parameter O ξ for a particular configuration sample w: where denotes the element-wise product and 1 is the identity matrix. Figure  Defining an order parameter for the segregated (sg) ordering is conceptually simple. It amounts to determining whether the configuration sample contains a single, connected cluster of f particles. This is implemented by a backtracking algorithm [78]. We define a binary order parameter O sg taking on the value 1 (0) if the configuration sample does (not) contains a single, connected cluster of f particles (as determined by the algorithm).
Clearly, all three order parameters O ξ , ξ ∈ {di, ax, sg} share a common set of desired properties [79]. In particular, the order parameters are invariant under when the input configuration samples are subjected to transformations of p4m. Furthermore, the maximum value of the order parameters is max w O ξ (w) = 1 which is only achieved for samples showing perfect ordering of type ξ. Additionally, the three order parameters defined by means of filters can take on values ranging from 0 to 1, indicating the partial presence of the corresponding pattern. Figure S4 shows the values of all three order parameters O ξ , ξ ∈ {di, ax, sg} for each sampled point p = (U, ρ) in parameter space for the FKM in the noise-free and noisy case. In the noisy case, we additionally average over all available configurations at each point p to obtain a scalar value. The order parameters reveal the presence of segregated, diagonal, and axial orderings marked as (1), (2), and (3) in Fig. 2 in the main text, respectively.

S3.2. Correlation functions
As a set of observables derived from the configurations which remain invariant under transformations of p4m we here consider a minimal set of three different correlation functions κ ξ n , ξ ∈ {sq, ax, di}. In particular, these measure square (κ sq n ), axial (κ ax n ), and diagonal correlations (κ di n ) over n lattice sites. For a configuration sample w, the three correlation functions are calculated as Since we assume periodic boundary conditions, the largest unique n is given by n = L/2. The computation of the correlation functions from Eq. (S19) is illustrated in Fig. 2(d) of the main text. In case of the FKM on a square lattice, we find that these three correlation functions and combinations thereof are sufficient to describe most patterns and ordering. Therefore, they represent a physically-motivated way of detecting phase transitions based on the magnitude of the change in order in the set of correlation functions.

S4. ALTERNATIVE PHASE CLASSIFICATION METHODS
The derivation of the optimal model predictions has strengthened our understanding of the prediction-based method. While we have so far relied on DNNs to approximate f opt , it opens up the possibility of devising algorithms which yield the same predictions as f opt , or equivalently, reproduce the predicted phase diagram. To reproduce the predicted ground-state phase diagram in the noise-free case obtained using the prediction-based method [see Fig. 2(b) in the main text], a model simply needs to be sensitive to any change in the configurations (up to transformations of p4m). If a change is detected, a new phase is declared. Once all the points in parameter space are analysed, the full phase diagram is obtained. The main approach we take in this work is to adopt a representation of the configurations which is invariant under transformations of p4m: the correlation functions in Eq. (S19). Consequently, a simple comparison of the representations reveals the phase diagram. Furthermore, the extension of this approach to the noisy case is straightforward and eventually leads to the formulation of the general mean-based method. In what follows, we describe two alternative approaches to reproduce the results of the prediction-based method in the special noise-free case.
As a naïve first approach, one could compare the ground-state configuration samples of different points in parameter space. If a symmetry transformation is found which relates the configuration samples, the corresponding points belong to the same phase, otherwise a new phase is declared. Clearly, the computational complexity of such an approach is reduced significantly by considering |F 0 |, as opposed to w 0 to remove the need to consider lattice translations. Note that such an approach fails when considering the general noisy case.
As a second approach, we propose to use the system Hamiltonian as motivated by the simulated annealing procedure. For a given point in parameter space (point I), we take the corresponding ground-state configuration sample and calculate its energy using the system Hamiltonian at a neighbouring point along U (point II). Additionally, we evaluate the energy of the ground-state configuration sample at point II using the system Hamiltonian at point II. If the difference in energy is smaller than an appropriate threshold value, we may regard the two samples degenerate and assign them to the same phase. This is valid, since both samples are equally likely to be generated using the simulated annealing procedure. Otherwise a new phase is declared. Analysing the entire two-dimensional parameter space in this fashion will eventually yield the same phases as predicted by the prediction-based method. Note that such an approach assumes that we have perfect knowledge of the system Hamiltonian, which is usually not the case in experiments. Furthermore, an extension of this approach to the general noisy case may not be straightforward.

S5. MEAN-BASED METHOD
As a key result, we propose the mean-based method as a novel data-driven scheme for identifying and characterizing phase transitions in an automated fashion. It relies on a difference signal ∆x that serves as a generic indicator for phase transitions which is calculated as ∆x (p) = x (U + ∆U, ρ) −x (U − ∆U, ρ) . (S20) Here,x (p i ) = j P i (x j ) x j ≈ 1/N j x j x j (p i ) denotes the average input at a point p i , where we average over all corresponding N j x input x j .
Given a set of configurations {x} at a point p i we perform offline data augmentation (see training procedure for DNNs and linear models in Section S2) by applying n trafo random transformations of p4m to each configuration. Based on the augmented set of configurations, we then compute input features x (such as κ or |F 0 |) or use the configurations themselves to obtain the corresponding averagex. In this work, we choose n trafo = 20, 0 when considering the magnitude of their discrete Fourier transform |F |, and corresponding correlation functions κ, respectively. The difference in n trafo when using different inputs is based on the fact, that a reduced number of transformations need to be considered when the chosen input is invariant under (parts of) the transformations of p4m, as is the case for |F | or κ. Clearly, this results in reduced computational cost compared to using an input which is (in general) not invariant under transformations of p4m, such as the raw configurations w themselves.
Clearly, the mean-based method relying on Eq. (S20) as an indicator fails at identifying phase transitions for which x remains unchanged. As different inputs will be more suitable depending on the nature of the phase transitions under investigation, an appropriate choice of input can likely combat such failures. However, we aim to provide a method which does not require significant tuning of the input, e.g., based on incorporating physical knowledge of the system at hand. For this, one may extend the approach to detect changes in the m-th order moments µ x,m of the underlying probability distributions P (x) as opposed to the mean. The corresponding indicators are then given as ∆µ x,m (p) = µ x,m (U + ∆U, ρ) − µ x,m (U − ∆U, ρ) .
(S22) Equation (S22) represents variants of the mean-based method relying on different measures that characterize changes in the underlying probability distributions P (x). In particular, such indicators may yield complementary information about the corresponding phase transitions.
We continue the thought of devising measures which quantify changes in the distributions P (x), and can thereby serve as indicators for phase transitions, by considering the Hellinger distance H (P i , P j ) [81] as a measure for the similarity between two probability distributions P i and P j . The corresponding indicator I (p) for phase transitions would then simply be given as I (p) ≡ H P (U +∆U,ρ) , P (U −∆U,ρ) . In future studies, such extensions of the mean-based method should be investigated and compared based on the insights into the phase diagrams they provide and their computational cost.

S6. COMPLEMENTARY FIGURES
This section contains complementary figures which depict the vector field arising in the prediction-based method (Fig. S5), the inferred two-dimensional phase diagram using the mean-based method with |F 0 | (as opposed to correlation functions) as input (Fig. S6), and the analysis of the line-scan at ρ = 63/400 ≈ 0.16 for different types of inputs in the noise-free and noisy case (Fig. S7). The vector field exhibits a horizontal structure which demonstrates that ρ is predicted with near-perfect accuracy.
FIG. S6. Indicator ∆x [Eq. (S20)] based on |F 0| as input on the two-dimensional parameter space of the FKM in (a) the noise-free and (b) the noisy case. This shows that the mean-based approach also works when using |F 0|, as opposed to κ as input [cf. Fig. 2 (c), respectively. These results show that the construction of local order parameters using linear models can equally be carried out in the noise-free case and using the set of correlation functions κ as input.