Efficient Construction Method for Phase Diagrams Using Uncertainty Sampling

We develop a method to efficiently construct phase diagrams using machine learning. Uncertainty sampling (US) in active learning is utilized to intensively sample around phase boundaries. Here, we demonstrate constructions of three known experimental phase diagrams by the US approach. Compared with random sampling, the US approach decreases the number of sampling points to about 20%. In particular, the reduction rate is pronounced in more complicated phase diagrams. Furthermore, we show that using the US approach, undetected new phase can be rapidly found, and smaller number of initial sampling points are sufficient. Thus, we conclude that the US approach is useful to construct complicated phase diagrams from scratch and will be an essential tool in materials science.


I. INTRODUCTION
Phase diagrams are crucial in materials development because they contain extremely useful information. However, numerous syntheses and measurements are necessary to complete a phase diagram. Thus, this indispensable task occupies a large percentage of materials discovery.
In combinatorial materials science, machine learning techniques have been applied to construct phase diagrams [1][2][3][4][5][6]. In this field, a large amount of materials in the phase diagram can be obtained simultaneously by high-throughput materials synthesis. Then, from measurement results such as XRD patterns, categories of many synthesized materials should be rapidly determined to complete phase diagrams. To realize automatically categorization, clustering and matrix factorization are utilized.
On the other hand, recent materials informatics studies aim to develop novel materials with the smallest number of syntheses or first-principles calculations as possible with aid of machine learning. Many successful examples have been reported using both experiments and simulations [7][8][9][10][11][12][13][14][15][16][17]. In these investigations, machine learn- * kei.terayama@riken.jp † K. Terayama and R. Tamura contributed equally to this work. ‡ tsuda@k.u-tokyo.ac.jp ing efficiently recommends a candidate material possessing the desired properties even if a limited materials data is existed. In accordance with this idea, we aim to propose some materials which should be synthesized to complete phase diagrams by machine learning. If an appropriate proposal is realized, a reliable phase diagram can be obtained, even if the number of synthesized materials is small. This problem setting resembles that in active learning. Active learning is a learning framework that sequentially selects an informative sample to classify and checks its label in order to maximize the classification accuracy with fewer labeled data. Thus, we speculate that active learning is an essential tool to efficiently construct a phase diagram. A previous study employed an active learning method that uses a Gaussian process to sample the phase diagram [18]. Although this method dramatically reduces the number of sampling points, the demonstration was only performed using a phase diagram with only two kinds of phases. Furthermore, this method would be difficult to apply in cases where the multiple phases exist. To improve practicability, herein we propose a different active learning method that uses uncertainty sampling (US) [19] to efficiently construct a phase diagram.
US is a methodology that selects a sampling point with the most uncertainty as calculated by a machine learningbased classification model as an informative sample. The most uncertain data is typically located near classification boundaries (phase boundaries). This US approach can be applied to any numbers of parameters (dimensions in a phase diagram) and categories (kinds of phases).
In this paper, the US approach is used to construct phase diagrams. This study reveals the following: (1) Phase boundaries can be efficiently obtained and accurate phase diagram can be drawn even if the number of sampling points is small. (2) Because undetected new phases in a phase diagram can be rapidly found, this approach is more useful to construct complicated phase diagrams. (3) Fewer initial sampling points are sufficient, making the US approach well suited to construct phase diagrams from scratch. These facts strongly suggest that the US approach will be a powerful tool to construct phase diagrams in materials science. Our implementation is available on GitHub at https://github.com/tsudalab/PDC/.
The rest of the paper is organized as follows. Section II introduces details of our method based on US to efficiently construct phase diagrams. To estimate the probability of phases at each point from already checked points, the label propagation and label spreading methods are adopted. Furthermore, the evaluation methods of uncertainty, that is, least confident, margin sampling, and entropy-based approach, are explained. In Sec. III, the US approach is used to construct three known phase diagrams: H 2 O under lower and higher pressures, and a ternary phase diagram of glass-ceramic glazes. The US approach can efficiently sample to complete a complicated phase diagram from scratch. Section IV addresses the case with experimental constraints. The results with and without imposed constraints are comparable. Section V is the discussion and summary.

II. METHOD BASED ON UNCERTAINTY SAMPLING
This section presents the framework for phase diagram construction using US. Figure 1 overviews our procedure. First, several points are selected, and their phases are determined by experiments or simulations (1. Initialization). Next, the probability distributions of the phases are calculated for all the points in the parameter space using a machine learning technique (2. Phase estimation). From the probability distributions, the uncertainty scores are calculated for the all unchecked points in the parameter space (3. Uncertainty score). Afterwards, an experiment or a simulation of the point with the highest uncertainty score is performed (4. Experiment). Steps 2-4 are repeated to construct an accurate phase diagram with a smaller number of sampling points. Below each step is described in detail.

A. Initialization
The regions of parameter space and parameter candidates are prepared in advance. Parameter space can have two or more dimensions. As the initialization step, several points are selected, and their phases are determined by experiments or simulations. The points can be selected randomly or manually. This paper adopts random selection.

B. Phase estimation
The probabilities of the observed phases are estimated for all unchecked points. This probability distribution is written as P (p|x), where x is the position vector of each unchecked point and p is the label of phases, which are already observed. From this distribution, an estimated phase diagram is drawn by choosing the phase with the highest probability, e.g. arg max p P (p|x). Herein we adopt two representative estimation methods of probabilities: label propagation (LP) [20] and label spreading (LS) [21]. These are kinds of the semi-supervised learning, which makes use of not only labeled but also unlabeled data for learning. In these methods, the probability of each point is calculated by propagating the label information to nearby points. The probability of a phase p at x calculated by LP equals to one to reach the phase p first by random walk from x. In LP, the labels of the checked points are fixed. On the other hand, in LS, the labels of checked points can be changed depending on the surrounding circumstances. Thus, LS is effective when the label noise is large.

C. Uncertainty score
The uncertainty score defined as u(x) is calculated to determine the next candidate in a phase diagram from the estimation result of the probability distributions P (p|x). In this paper, we adopt three representative methods of the US strategy: Least Confident (LC) [22], Margin Sampling (MS) [23], and Entropy-based Approach (EA) [24]. For each point x, the LC u LC (x), MS u MS (x), and EA u EA (x) scores are calculated as follows: where P (p 1 |x) and P (p 2 |x) in u MS (x) mean the highest and second highest probabilities at x. From the definitions, the uncertainty scores become higher when the probabilities of each phase are all the same. The LC score is only influenced by the highest probability at each point, while the MS score is affected by the first and second highest probabilities. For the EA score, the whole distribution is taken into account. The next candidate is determined from the unchecked points with the highest uncertainty score, e.g. arg max x u(x). Then an experiment or a simulation is performed for this point. If an undetected phase is obtained, the next step performs a phase estimation that includes the new phase. To handle data uniformly, the parameters are normalized using the min-max normalization [25] for a phase estimation and an evaluation of uncertainty score.

III. PHASE DIAGRAM CONSTRUCTION BY UNCERTAINTY SAMPLING
We report the performances of the proposed strategies based on US compared to random sampling (RS) for three known phase diagrams: [26][27][28], and the ternary phase diagram of glass-ceramic glazes of SiO 2 , Al 2 O 3 , and MgO (SiO 2 -Al 2 O 3 -MgO) [29]. Here, the next point in RS is randomly selected from the unchecked points and the phase diagram is estimated using the phase estimation methods described above. However, the information of the estimated phase diagram is not used to select the next point in RS. The nine triangles denote the initial points, which are located at the same positions as the US approach. Since RS selects a number of points in regions away from the phase boundaries, efficient sampling is not realized. In addition, relatively small areas such as ice III in H 2 O-H and tridymite and sapphirine in SiO 2 -Al 2 O 3 -MgO are difficult to find by RS. However, these phases can be detected by the US approach, as shown in Figs. 2 (e) and (f). These results suggest that the US approach can efficiently sample near the phase boundaries, allowing smaller phases to be rapidly detected. As supplemental material, movies of sampling behaviors for each case by the LP+LC approach compared with LP+RS are prepared (see Supplemental Movie 1).

B. Quantitative comparison between uncertainty and random samplings
We quantitatively compare the US approach with the RS approach. To evaluate the quantitative accuracy of the estimated phase diagram by the LP or LS method, we adopted the macro average score based on the F1- score (Macro-F1), which is commonly used as an evaluation metric for classification problems in the machine learning community. This value denotes the difference between the experimentally obtained phase diagram and the estimated phase diagram. The F1 score for a phase indexed by p is the harmonic mean of precision P (p) and recall R(p), which is given as where precision P (p) is the number of points correctly estimated as p (true positives) divided by the total number of points estimated as p. On the other hand, recall R(p) is the number of true positives divided by the total number of true p points. A phase that has yet to be detected has an F1-score of 0. We calculated the Macro-F1 score by averaging the F1-scores of all the true phases. Thus, when the value of the Macro-F1 score is small value (≪ 1), the difference between the true and the estimated phase diagrams is very large. A Macro-F1 score of 1 indicates that the estimated phase diagram exactly reproduces the true one. The top row in Fig. 3 shows the results of the Macro-F1 scores as functions of the number of sampling points for H 2 O-L, H 2 O-H, and SiO 2 -Al 2 O 3 -MgO. Here, the initial sampling number is fixed to nine. Since these results depend on the selection of initial points, we repeated the trials 200 times using different initial points and averaged the results. The black lines (solid and dashed) depict the results of RS, and the other lines show the results by the US approach. The label of A+B means the combination of A as the phase estimation method (i.e., LP or LS) and B as the sampling method (i.e., LC, MS, EA, or RS). The combinations of LP+LC and LP+MS show relatively good performances compared with the other methods for the three phase diagrams. Compared with RS, any US approach can provide better Macro-F1 scores even if the number of sampling points is small. Table I summarizes the numbers of sampling points necessary to reach a Macro-F1 of 0.95. From the viewpoint of the av- erage from three phase diagrams, the number of sampling points could be reduced by 0.36, 0.20, and 0.20 for the three phase diagrams using the LP+LC method instead of the LP+RS method. This result implies that for complicated phase diagrams, the US approach is more useful to produce it quickly.
For these phase diagrams, the Macro-F1 results indicate that the LP method is better suited than LS. This demonstration employs cases where the phase boundaries are properly determined and the outliers do not appear. Since it is not necessary to consider the noise for the labels, LS does not work effectively. Furthermore, we found that EA is not useful. If the number of phases is small, MS is better suited, whereas LC is powerful when many phases exist in a phase diagram. Thus, an efficient selection can be realized using LC to construct complicated phase diagrams.

C. Capability of New Phase Detection
In Sec. III A, we showed that small phases are detected more quickly by using the US approaches than RS. In this subsection, we demonstrate how many sampling points are needed to detect all the phases in each phase diagram of H 2 O-L, H 2 O-H, and SiO 2 -Al 2 O 3 -MgO. The middle row in Fig. 3 shows the sampling number dependence of the numbers of detected phases which are averaged 200 independent runs. Since H 2 O-L has three large phases, the detection performances of the US approaches and RS are almost the same. In the case of H 2 O-H, which has one small phase (ice III), all the phases are detected by LP+LC using 30 sampling points at most, whereas RS requires more than 200 sampling points. For SiO 2 -Al 2 O 3 -MgO, over 600 sampling points are needed to find all the phases using RS because there are multiple small phases such as sapphirine, tridymite, and enstatite. These results indicate that a small phase connected to boundaries of other large phases can be found relatively early using the US approach due to preferential investigation of areas near boundaries. Thus, we conclude that our US approach is a powerful tool to detect new phases in complicated phase diagrams.

D. Effect of initial sampling
We discuss the dependency of the initial sampling. The bottom row in Fig. 3 shows the average number of sampling points to reach Macro-F1 of 0.95 using the LP+LC approach as a function of the number of initial sampling. To evaluate the average, 200 independent demonstrations are performed for different initial points. Interestingly, the accuracy remains almost the same even as the number of initial sampling points increases. Furthermore, in these cases, the optimum value of the initial sampling points is 1 or 4, and then at initial sampling, some phases are not detected. This finding indicates that it is better to use information of a phase estimation by machine learning than blindly selecting points from the earlier stage to construct a phase diagram. Consequently, the US approach is particularly useful when constructing new phase diagrams from scratch.

IV. SAMPLING WITH PARAMETER CONSTRAINT
In the US approach described above, there are no restrictions on the change in parameters to select the next point. However, there is often a huge cost to change all parameters in an experiment. To address this problem, we construct a sampling method called Uncertainty Sampling with Parameter Constraint (USPC). USPC constrains the changes in parameters. To select the next point, candidate points are chosen under the condition where only one parameter is changed from the previous point. That is, for example in H 2 O phase diagrams, candidate points are prepared along parallel or vertical directions from the previous point. Then the one with the highest uncertainty score among the candidates is selected. The other steps are the same as the US approach. Note that if there are no candidates satisfying the condition, the next point is selected after removing the constraint. Thus, USPC can reduce the cost associated with a parameter change in the experiment.  Supplemental Table S1). The USPC approach shows similar tendencies for new phase detection and the effects on the initial sampling (see Supplemental  Fig. S1). These results show that phase diagrams can be efficiently constructed even under the constraint suited for an experiment. As supplemental material, movies of the sampling behaviors for each cases by the LP+LC approach with parameter constraint compared with LP+RS are prepared (see Supplemental Movie 2).

V. DISCUSSION AND SUMMARY
We proposed an efficient method to construct phase diagrams using uncertainty sampling (US). This method employs the next point with most uncertainty in a phase diagram assisted by machine learning. In general, the next point selected by this approach is located near phase boundaries, allowing the true phase boundary to be rapidly drawn. In our method, the uncertainty is evaluated using the probabilities of the observed phases at each point, which are obtained by the label propagation (LP) or label spreading (LS) methods.
By comparing the US approach with the random sampling, we confirmed that our approach can decrease the number of sampling points to 20 % and still construct an accurate phase diagram. Furthermore, the US approach can find undetected new phase rapidly and smaller number of initial sampling points are sufficient to obtain an accurate phase diagram. These advantages indicate that our method can make significant contributions, es- pecially when deriving new complicated phase diagrams from scratch.
We also considered the case where only one parameter is changed from the previous point when selecting the next candidate point, which is fitted to the conventional experimental setting. Even if such a constraint is imposed, this approach can realize efficient sampling to complete a phase diagram. To strengthen the usability of our method, we should construct new experimental phase diagrams using the US approach. In this case, depending on the accuracy of the experiments to detect a phase, LS might be more useful than LP to evaluate the uncertainty score due to the existence of noise. These facts will be reported elsewhere.
From a different perspective, the US approach provides useful information about the reliability of the experiments when the phase at each point is determined. For all points in a phase diagram, the probabilities of each phase are evaluated in our approach. Thus, if the probability of the detected phase by experiments is extremely small, it may be an indicator that the experiment is wrong. This would be important information to construct valid phase diagrams.
The US approach can realize efficient sampling for phase diagrams. Therefore, we believe that the US ap-proach will accelerate the speed to discover new materials. Hence, this method will become an essential tool in materials science.