Bayesian optimization of chemical composition: a comprehensive framework and its application to $R$Fe$_{12}$-type magnet compounds

We propose a framework for optimization of the chemical composition of multinary compounds with the aid of machine learning. The scheme is based on first-principles calculation using the Korringa-Kohn-Rostoker method and the coherent potential approximation (KKR-CPA). We introduce a method for integrating datasets to reduce systematic errors in a dataset, where the data are corrected using a smaller and more accurate dataset. We apply this method to values of the formation energy calculated by KKR-CPA for nonstoichiometric systems to improve them using a small dataset for stoichiometric systems obtained by the projector-augmented-wave (PAW) method. We apply our framework to optimization of $R$Fe$_{12}$-type magnet compounds (R$_{1-\alpha}$Z$_{\alpha}$)(Fe$_{1-\beta}$Co$_{\beta}$)$_{12-\gamma}$Ti$_{\gamma}$, and benchmark the efficiency in determination of the optimal choice of elements (R and Z) and ratio ($\alpha$, $\beta$ and $\gamma$) with respect to magnetization, Curie temperature and formation energy. We find that the optimization efficiency depends on descriptors significantly. The variable $\beta$, $\gamma$ and the number of electrons from the R and Z elements per cell are important in improving the efficiency. When the descriptor is appropriately chosen, the Bayesian optimization becomes much more efficient than random sampling.


I. INTRODUCTION
Machine learning is attracting much attention these days, and its application to data obtained by firstprinciples calculation is a promising way to accelerate the exploration of novel materials.The basic idea is as follows: (i) introduce a numerical representation x for the materials, which is called descriptor, (ii) calculate a property y for materials from the search space of the descriptor by first-principles calculation, and (iii) infer a relation y = f (x) between x and y from thus obtained data by modeling f .Many efforts have been made to develop models and descriptors that work in materials discovery.[1][2][3][4][5][6][7][8][9] These models can be used to identify promising candidates by predicting the property f (x ) for unknown materials x .It is also possible to perform the modeling and the sampling alternately to obtain the optimal x as quickly as possible, which is called optimization.
Bayesian optimization (BO) is a powerful technique to find the maximum (or the minimum) of an unknown function along this idea.It is based on Bayesian modeling using a dataset collected in the previous samplingmodeling iterations, and does not require explicit form of the function y = f (x).This method is efficient because it takes account of the uncertainty of a model in addition to the mean value.Figure 1 illustrates a typical situation in which BO is efficient.The dashed line is the true model.Suppose we have four sampled points which are denoted by black circles.By Bayesian modeling, we obtain the mean value (solid line) and the uncertainty (gray FIG. 1. Schematic illustration of a typical situation in which Bayesian optimization works efficiently.The dashed line represents the true model; the black circles denote sampling points; the solid line shows a model obtained from the sampling points, and the gray region shows a confidence interval of the model.region).In this situation, the mean value does not give good prediction for the highest-score point.However, by considering the information of the uncertainty, one can find a significant probability that the true model has the maximum between the two rightmost data points.
BO has been recently applied in various problems in materials science.[10][11][12] It has also a potential for application to optimization of a chemical composition, [13] but there was no reports on quantitative estimation of efficiency in such a problem avoiding possible overestimation by mere luck to our knowledge.In such applications, we need to properly choose a search space, a descriptor for the candidate systems, and a score to describe the performance that are suitable for the problem.Otherwise, the efficiency of the scheme is deteriorated.A descriptor-a form to which the input data is encoded-is especially crucial.[2,[5][6][7][8]14] Accuracy of the first-principles calculation is also of great importance in the computer-aided materials search.However, conventional methods are often insufficient to achieve enough accuracy while sophisticated schemes are too much time-consuming for the purpose.For example, the magnetic transition temperature is overestimated in the mean-field approximation.Systematic errors also come in from numerical factors, such as a limited number of basis functions.
It is one of promising ideas to improve the data by using a smaller dataset from more accurate calculations or experiments.[15,16] This idea is also seen in the notion of transfer learning, which uses referential datasets that are different from the target dataset, and transfer the knowledge from the reference to the target.[17,18] However, there is no method that works for any purposes, and we need to devise a method that is suitable for each of the problems on the basis of knowledge about the origin of the error.
In this paper, we propose a practical framework for optimizing nonstoichiometric composition of multinary compounds based on Bayesian optimization and firstprinciples calculation.We perform a benchmark of our scheme and discuss its efficiency in the optimization using a dataset obtained by first-principles calculation with the KKR-CPA method for systems with nonstoichiometric compositions.We investigate the performance of descriptors, and discuss how we can choose an efficient one for problems.To set up a pragmatic problem, we deal with an issue from the cutting-edge of the materials study on hard magnets.We also present a method for correcting systematic errors in the formation energy by using smaller but more accurate dataset.Our idea is to construct a model of errors on the basis of our understanding of it.
This paper is organized as follows: in section II, we describe our problem setup for the benchmark, providing the background.In the first part of Section III, we present a brief summary of the whole framework.We then provide details of Bayesian optimization and the first-principles calculation in the subsequent subsections.Section III C is devoted to the method for integrating datasets that we use to improve the formation energy.We present the results of the benchmark and the data integration in Section IV.Finally, we conclude the paper with a summary in Section V.

II. PROBLEM SETUP AND ITS BACKGROUND
RFe 12 -type compounds having the ThMn 12 structure have been considered as a possible main phase of a strong hard-magnet because they are expected to have high magnetization due to its high Fe content, and to have high magnetocrystalline anisotropy if the R element is properly chosen.[19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37] The magnetic properties of NdFe 12 N was evaluated theoretically a few years ago [38], and its high magnetization and anisotropy field were confirmed by a successful synthesis of NdFe 12 N x film.[39,40] Unfortunately, NdFe 12 N and its mother compound NdFe 12 are thermally unstable.They cannot be synthesized as a bulk without substituting another element for a part of the Fe elements.Titanium is typical of such a stabilizing element.[41,42] Introduction of Ti, however, reduces the magnetization significantly.Co also has a potential to work as a stabilizing element according to a prediction by first-principles calculation.[43] Compared to Ti, Co is favorable in terms of magnetization.In fact, a recent experiment on Sm(Fe 0.8 Co 0.2 ) 12 film showed that it has superior saturation magnetization and anisotropy field to Nd 2 Fe 14 B, the main phase of the current strongest magnet.[44] Chemical substitution at the R site also affects structural stability.Zirconium has been attracted attention as a stabilizing element at the rare-earth site.[45][46][47][48] Recent first-principles calculation predicted that Dy also works as a stabilizer.[49] Therefore, optimization of chemical composition of RFe 12 -type compounds in terms of stability and magnetic properties is an important issue for the development of nextgeneration permanent magnets.
Bearing these in mind, we set RFe 12 -type magnet compounds as target systems.Especially speaking, we optimize chemical formula of (R 1−α Z α )(Fe 1−β Co β ) 12−γ Ti γ (R = Y, Nd, Sm; Z = Zr, Dy) so that it maximizes magnetization, the Curie temperature, or minimizes the formation energy from the unary systems in the benchmark.Therefore, the problem is a combination of optimization with respect to the compositions (α, β and γ) and optimization with respect to the choice of elements for R and Z.We discuss the efficiency of the optimization by comparing a number of iterations required in Bayesian optimization with that in random sampling.We also study how the efficiency is affected by the choice of descriptor.

III. METHODS
Figure 2 shows the workflow in our optimization framework.At the beginning of the scheme, the user prepares a list of candidate compounds.The candidates are expressed in the form of a descriptor.In this study, we prepare 11 types of descriptors for the (R 1−α Z α )(Fe 1−β Co β ) 12−γ Ti γ (R=Y, Nd, Sm; Z=Zr, Dy) systems, which we discuss in Section III A.
Then, the candidate list is passed to the optimizer.The role of the optimizer is to pick one system from the candidate list so that a system with a high score is quickly found in the whole scheme.Because it does not have enough data to perform Bayesian optimization at the beginning, it randomly chooses a system from the list.It receives a feedback from a scorer later in the scheme, and record it.When the record reaches a certain size, the sampling method is switched to Bayesian optimization.To cover the role of the optimizer, we use a Python module called "COMmon Bayesian Optimization library" (COMBO).[10,50] We present the parameters used in our benchmark in Section III A.
In the next stage, a quantum simulator calculates physical properties for the system chosen, which is the most time-demanding process in the scheme.The details in our simulation is described in Section III B.
Then, the scorer integrates the calculated properties to a score.It also performs improvement of the estimated values by using the referential data before generating the score.In our application, we use the value of magnetization, Curie temperature or the formation energy as a score.As for the formation energy, we improve it by the method for integrating datasets presented in Section III C. The preparation of the referential data is described in Section III B. The score is fed back to the optimizer to increase the size of the data used in Bayesian optimization.
The iteration loop is repeated until the number of iterations reaches a criterion.Otherwise, the workflow goes back to the optimizer.After the loop has ended, the candidate with the best score found in the iterations is output.

A. Bayesian optimization
As mentioned above, we use COMBO [10,50] to cover the role of the optimizer in Fig. 2. We use Thompson sampling as a heuristic to the exploration-exploitation problem in optimization.The dimension of the random feature maps, which determines the degree of approximation for the Gaussian kernel, is set to 5000.The first 10 samples are chosen randomly without using Bayesian optimization.The number of iterations is set to 100, including the first 10 iterations with the random sampling.
We use 11 different sets of descriptors listed in Table I.The descriptors consist of the number of electrons per cell (N ), the number of electrons from the R element per cell (N R ), the number of electrons from the Z element per cell (N Z ), N R + N Z (≡ N 2a ), the number of electrons from the transition elements-namely Fe,Co,Ti-per cell (N T ), the atomic number of the R element (Z R ), the atomic number of the Z element (Z Z ), an index for the R element (n R = 0, 1, 2 corresponding to R = Y, Nd, Sm), an index for the Z element (n Z = 0, 1 corresponding to Z = Zr, Dy), the Z content per cell (α), the Co/(Fe+Co) ratio (β), the Ti content (γ), and the values of α 1 , α 2 , α 3 , and α 4 when the chemical formula is expressed in the form of ( See the text for description of the variables, N , NR, NZ, N2a, NT, ZR, ZZ, nR, nZ, α, β, γ, α1, α2, α3 and α4.

B. First-principles calculation
In the "Simulator" block in Fig. 2, we perform firstprinciples calculation based on density functional theory with the local density approximation.[51,52] We use the open-core approximation [53][54][55] to the f-electrons in Nd, Sm and Dy and apply the self-interaction correction.[56] We assume the ThMn 12 structure (Fig. 3) for the (R 1−α Z α )(Fe 1−β Co β ) 12−γ Ti γ systems.The lattice parameters are determined by linear interpolation from those for RFe 12 , RFe 11 Ti, ZFe 12 , ZFe 11 Ti and RCo 12 .These values for the stoichiometric systems were calculate with the PAW method [57,58] using a software package QMAS [59].We use the PBE exchange-correlation functional [60] of the Generalized Gradient Approximation (GGA) to obtain adequate structures.The values for In the treatment with the coherent potential approximation (CPA) [63][64][65], we assume quenched randomness for random occupation of the elements: the element R and Z are assumed to occupy the 2a site (see Fig. 3 for the Wycoff positions).Titanium is assumed to occupy the 8i site.Iron (cobalt) is assumed to occupy the 8f, 8i and 8j site with a common probability of 1 − β (β) to these sites.
We calculate the magnetization, the Curie temperature, and the formation energy from the unary systems.We use the raw value of magnetization from KKR-CPA.To estimate the Curie temperature, we calculate intersite magnetic couplings by using Liechtenstein's method, [66] and convert them to the Curie temperature using the mean-field approximation.[61] Although this procedure overestimates the Curie temperature, we can expect from previous results that it can capture materials trend because theoretical values within the mean-field approximation had a significant correlation with experimental Curie temperatures.[67] As for the formation energy, KKR needs too large computational resources to obtain an accurate energy difference between systems when they have far different structures from each other.We use the method that we describe in the following subsection to correct the energy difference calculated by KKR-CPA with referential data of total energy obtained by PAW.

C. A method for integration datasets
Let us consider the formation energy from the unary systems defined as follows: where "(the unary systems)" is defined as and E[•] denotes the total energy of the system in the square bracket.
Because the structures of (R 1−α Z α )(Fe 1−β Co β ) 12−γ Ti γ and each of the unary systems are much different from one another, it is not efficient to directly calculate this formation energy with the KKR method, although it can deal with nonstoichiometric systems with CPA.Our idea is to calculate the formation energy of stoichiometric systems more accurately by another method, and use calculated energies as reference data.
We construct a stochastic model for the total energy of (R 1−α Z α )(Fe 1−β Co β ) 12−γ Ti γ based on the expectation that the smaller structural difference two systems have, the more accurate energy difference KKR-CPA gives.To quantify the difference of systems, we consider a descriptor with which the difference between the systems ( x and y) is well-described by the distance (| x − y|).Let us denote the reference systems in the form of the descriptor by where M is the number of the reference systems.The descriptor here does not have to be identical to the descriptor used in the Bayesian optimization.In the demonstration, we use a set of (α , β , γ ) ≡ (α, β(12 − γ)/12, γ/12) with which the search space can be expressed as (R 1−α Z α )(Fe 1−β −γ Co β Ti γ ) 12 irrespective of the choice of the descriptor in the optimization.
For each of the reference points x R i , we construct a stochastic model Ẽi [ y] for the total energy of a system y (see also the graphs outside the box in Fig. 4): The two E's in the right-hand side (to which primes are attached) are the total energy calculated with KKR-CPA, whereas E[ x R i ] in the left-hand side is evaluated by a more accurate method, for which we use PAW in the present work.ε i is a random variable whose distribution is N (0, S 2 i ), i.e. the normal distribution whose mean is zero and variance is S 2 i , where and σ 2 is a parameter we will estimate later.This model describes the expectation that the deviation of the energy difference (the first two terms in the right-hand side) from the true difference (the left-hand side) tends to be large when the difference, y − x R i , is large.The graphs outside the box in Fig. 4 depict how Ẽi behave: there are three models corresponding to the three reference systems 3 ), and the error ε i in each model is large when the distance of the reference point from y is large.FIG. 4. A schematic diagram of the procedure for constructing an integrated model.The three graphs outside the box denote the models defined by Eq. ( 3).They are integrated into the model described by Eq. (10).
We then integrate these models, Ẽ1 , • • • , ẼM , into a single model Ẽ by imposing the following condition to the distribution of ε i : This condition can be rewritten as follows: Therefore, the conditional probability distribution of Ẽ is where P i denotes the probability distribution function of N (0, S 2 i ).It is straightforward to see that this conditional distribution is the normal distribution whose mean μ and variance S2 are where ω i is a weight and Ω is a normalization factor that are defined as In another expression, our integrated model is where ε is a random variable whose distribution is N (0, S2 ).Algorithm 1 summarizes the construction of the model explained so far in a form of a pseudocode.Note that the value for the input variable σ 2 has not yet been determined.The characteristics of Ẽ[ y] is illustrated in the right-bottom panel in Fig. 4.Although Ẽ[ y] is singular at y = x R i , it is easy to see that this is removable and lim Algorithm 1 A pseudocode of the algorithm for determination of μ in Eq. ( 7) and S2 in Eq. ( 8).This needs σ 2 and ∆ i as inputs, which are calculated by the algorithm shown in Alg. 2.
The formation energy from the unary systems can be calculated as ∆E Ẽ[ y] − E[(the unary systems)], where E[(the unary systems)] is calculated by an accurate method.
To complete the formulation, we discuss estimation of σ 2 based on the data for the reference systems.We use the maximum likelihood estimation.However, it cannot be directly applied to our model because the distribution of ε becomes the delta function in the limit of y → x R i , regardless of the value of σ 2 .To avoid this singularity, FIG. 5. A schematic plot of ẼLOO,i for i = 1 and M = 3.
we consider a model ẼLOO,i that is constructed from all reference systems but x R i .Figure 5 depicts the construction of such a model.Now, we consider the probability of ẼLOO,i [ y] at y = x R i .Regarding the probability for as a likelihood L i , we select the value of σ 2 that maximizes L = i L i .We then obtain where The determination of σ 2 is summarized in a form of a pseudocode in Alg. 2. The output values of σ 2 and ∆ i corresponds to those in the algorithm described in Alg. 1.
Algorithm 2 A pseudocode of the algorithm for determination of σ 2 in Eq. ( 8).The energy difference ∆ i is also output to be used in the algorithm described in Alg. 1.
In the actual application, we calculate E [ x R i ] with the KKR-LDA method and E[ x R i ] with the PAW-GGA method.We need to calibrate E[ x R i ] with a linear term because the difference in the treatment of the core electrons is another major source of error in the total energy, which is described well by a linear function.We deal with it by extending our model to include an adjustable linear term, and use it in the actual calculation.We discuss this extension of the model in Appendix B.

A. Integration method
We show how the integration model explained in Section III C works before we present the benchmark of the whole scheme in the next subsection.Let us take a simple example first: we consider a one dimensional function E[x] = sin(πx) as a true model.Assume that we have many data about E [x], which is an approximate function and actually obey E [x] = sin(πx) + 2x.Because their derivatives differ from each other by 2, the differ- The assumption of the model described in Section III C was that the error in E [x] is large when |x − x R i | is large.This is correct in the magnitude of the errors, but there is a bias in its sign.In this example, we prepare a dataset of E [x] for x = −0.5 to 0.5 with an interval of 0.02.We choose five reference points x R i (i = 1, • • • , 5) from the x-values, and prepare a dataset of E[x R i ].The result of the prediction is shown in Fig. 6.The original points of E [x] are shown as points in cyan.The purple points show the mean value of the model; the light purple region shows the values of the standard deviation.The green curve shows E[x], the true model, and the green circles are the reference points used in the prediction.
The mean values, which may be used as values of a point estimation, are improved from the original data in most of the region.The error region also covers the line of the true model except in the region x > 0.26.
It is noteworthy that the bias in the sign of the error mentioned above makes the uncertainty of the model large.This would be improved if we assume an asymmetric distribution for ε i in Eq. ( 3).However, the resultant form of the integrated model becomes more complicated.Next, let us see the results of the formation energy obtained by this method.Figure 7   the right in γ = 0.This is more obvious in the plot for γ = 0.5 (the right bottom panel).
Although we used the mean values (μ) in the Bayesian optimization in Section IV B, it is important to check how the model has uncertainty in its prediction, which is typically represented by the standard deviation S: the model needs more reference points when S is too large compared with the value of μ.One can also make use of the upper (lower) confidence bound μ + k S (μ − k S)where k is a positive adjustable parameter-instead of μ to take account of the uncertainty in a pessimistic (optimistic) manner.

B. Bayesian Optimization
In this subsection, we show the performance of the Bayesian optimization and discuss the results.Figure 8 shows one of the optimization processes with respect to magnetization using the descriptor #8.The highest magnetization found in the first i iterations, max j≤i µ 0 M (j), is plotted against the number of iterations, i.In this run, the highest µ 0 M in all the 3630 systems was found at 30th iteration where we define the zeroth iteration as that with the first sample.This process depends on a random sequence which is used in the sampling from the candidate lists.In order to take statistical profiles, we  To analyze the efficiency as a function of the number of iterations, we consider a cumulative distribution D i (s) that is defined as the number of the sessions in which a system with a higher score than s is found in the first i iterations.We show a plot of D i (s) in Bayesian optimization of magnetization using the descriptor #8 as the left figure in Fig. 9, where the horizontal axis shows the number of iterations, i, and the vertical axis shows the score variable, s.We also show a plot of the cumulative probability P i (s) in the random sampling that is analytically obtained at the right-hand side.Because we took an enough number of sessions, there is negligible difference between the two figures for the first 10 iterations.The efficiency in the left figure suddenly becomes improved when the Bayesian optimization is switched on.
Figure 10 shows the success rate of finding the systems with the top 10 values of the target propertiesmagnetization (µ 0 M ), Curie temperature (T C ) and formation energy from the unary systems (∆E)-within 50 steps.The results with Bayesian optimization (BO) are compared with the search by the random sampling (RS).The numbers with # in the figure denote the descriptors listed in Table I.We find that the efficiency significantly depends on the choice of the descriptor.It is obvious from the figure that the descriptors #7-11 are very efficient in Bayesian optimization and much superior to those with #1-6 and the random sampling.This clearly shows that β and γ are important factors because the descriptors #7-11 differs from #2-6 only by β and γ used instead of N T .
This example demonstrates how we can incorporate domain knowledge into the machine learning.It is known that the magnetization of ferromagnetic random alloys of Fe and another transition metal is usually a well-behaved function of the number of the valence electrons.It is called Slater-Pauling curve.This curve can be reproduced well by first-principles calculation with CPA [68], and so is the Slater-Pauling-like curve for the Curie temperature.[69] These effects have also been observed in ThMn 12 -type compounds experimentally, [27,44] and explained theoretically.[38,43,49,61,70] On the basis of these previous researches, we were able to expect that including β [the Co/(Fe+Co) ratio] and γ (the Ti content) into the descriptor would improve the efficiency of the search in advance.
However, we also find that β and γ alone do not work as an efficient descriptor.The results of the Bayesian optimization with using β, γ, and the pair of them as descriptors are also shown in Fig. 10.Those success rates are significantly lower than the rates with the descriptors #7-11.This is because there are 66[= 3 (for R) × 2 (for Z) × 11 (for α)] candidates that has common values of β and γ on the list, and 50 steps are not enough to obtain an adequate Bayesian model and draw one of the top 10 systems by chance.It is noteworthy that the efficiency of the Bayesian optimization is largely improved by adding only N 2a as in the descriptor #7.
Figure 11 shows the 95% (950 sessions) contour of D i (s) obtained with the descriptors #1-11.The best 95% of the data points are laid above those curves.The panels show results with the random search and the Bayesian optimization with the descriptors #1-11.As shown in these graphs, the efficiency of the search depends much on the target property to optimize.The de-scriptors #7-11, which are with β and γ, has a satisfying efficiency, with which even 20 iterations are enough to obtain a nearly best score regardless of the target property.The situation is quite different with the descriptors #1-6.In the optimization of the Curie temperature, these descriptors works efficiently.However, the optimizations of the magnetization and the formation energy progress even more slowly than the random sampling.
This difference between the Curie temperature and the other targets is also seen in Fig. 12, where the pair of β and γ is used as descriptors.The pair descriptor works as efficient as the descriptors #7-11 in the optimization of the magnetization and the formation energy.However, discernible difference in efficiency exists in the optimization of the Curie temperature.This suggests importance of information about elements at the 2a site (R and Z), which is consistent with Dam et al's observation that the concentration of rare-earth elements is important in explaining the Curie temperature of binary alloys composed of a rare-earth element and a 3d transition-metal.[6] Dependency of the search efficiency on the the dimension of the descriptor is also noteworthy.When the dimension of a descriptor is large, the descriptor can accommodate a large search space on the one hand.However, modeling tends to be difficult with a higher dimensional space, which is referred as "curse of dimensionality", on the other hand.Note that we have descriptors with 4 different dimensions in the groups of the descriptors #1-6 and #7-11.Figure 11 shows that the dimension has only a minor effect on the efficiency.This dependence is magnified by the more stringent criterion of the top-10 benchmark as shown in Fig. 10, especially in the results with the descriptor #1-6.However, the difference among the descriptor #7-11 are still subtle.Therefore, we expect that it would be safe to include six or a little more variables into the descriptor when we optimize for another target property.

V. CONCLUSION
In this paper, we presented a machine-learning scheme for searching high-performance magnetic compounds.Our scheme is based on Bayesian optimization, and has a much higher efficiency than random sampling.We demonstrated its efficiency by taking the example of optimization of magnetization, Curie temperature and formation energy for the search space of magnet compounds having the ThMn 12 structure.One of the typical results is the success rate of finding the top 10 systems with the highest properties when 50 systems are sampled from a candidate list of 3630 systems (Fig. 10).The success rate is more than 90% with our scheme when the descriptor is appropriately chosen while it is approximately 10% in the random sampling.
The efficiency is maximized when we include the Co content (β), the Ti content (γ), and the information of the R and Z elements (e.g.N 2a ) into the descriptor.This improvement is what we could expect from the previous studies of magnet compounds.We stressed that it is important to incorporate domain knowledge into the choice of a descriptor.We also discussed how many variables a descriptor can accommodate without deteriorating the search efficiency.Although an excessive addition of variables to the descriptor can lose the efficiency of the search, there was not a significant loss when we doubled the dimension of the optimal descriptor.
We also proposed an integration scheme of two datasets to improve accuracy of an inexpensive large-sized dataset with using an accurate and small-sized dataset (reference dataset).The algorithm (Alg. 1 and 2) is easy to implement and fast.Prediction with a confidence bound (or the standard deviation S) is another feature of the scheme.We have also shown how it worked in the calculation of the formation energy (Fig. 7).

FIG. 11.
The 95% contours of the cumulative distribution Di(s) in the random sampling (gray dashed) and in the Bayesian optimizations with the descriptor #1 (black), the descriptors #2-6 (green) and the descriptors #7-11 (blue).The horizontal axis shows i; the vertical axis shows s.The top, middle and bottom figures show results in the optimization of the magnetization, the Curie temperature within the mean-field approximation, and the formation energy from the unary systems, respectively.FIG.12.
The 95% contours of the cumulative distribution Di(s) in the Bayesian optimizations with the descriptor β (cyan), the descriptor γ (purple) and the descriptor (β, γ) (orange).Those in the random sampling and in the Bayesian optimization with the other descriptors are also shown as gray lines.The horizontal axis shows i; the vertical axis shows s.The top, middle and bottom figures show results in the optimization of the magnetization, the Curie temperature within the mean-field approximation, and the formation energy from the unary systems, respectively.was partly conducted using the facilities of the Supercomputer Center, the Institute for Solid State Physics, the University of Tokyo, and the supercomputer of ACCMS, Kyoto University.This research also used computational resources of the K computer provided by the RIKEN Advanced Institute for Computational Science through the HPCI System Research project (Project ID:hp170100).Equation ( 7) is modified as follows: B2) while Eq. ( 8) is left unchanged.
The estimation by the maximum likelihood method with ẼLOO,i described in Section III C is applicable to the variable a and σ 2 .The equation for a is where ξ i is defined as and the symbol ⊗ denotes the dyadic product, with which (α 1 , α 2 , α 3 ) ⊗ (β 1 , β 2 , β 3 ) is defined as the matrix whose (i, j) component is α i β j .This equation can be solved for a without determining σ 2 .
The equation for σ 2 is modified as .
Appendix C: Dimensions for R and Z We prepared dimensions to include information of the R and Z elements in our design of the descriptors in Table I.Because the corresponding choice is only 6 in combi-nation (R = Y, Nd, Sm; Z = Zr, Du) while it is known that high-dimensionality often causes problems in modeling, readers may doubt its necessity.We compare the efficiency of search for the smaller space in which those elements are fixed to R = Y and Z = Zr in Fig. 13.

Fe(8f)
Fe(8i) Fe(8j) x y z x y z x y z DyFe11Ti 0.255 0.251 0.251 (0.375 0.000 0.000) 0.272 0.500 0.000 0.255 0.749 0.749 −0.348 0.000 0.000 −0.266 0.500 0.000 0.755 0.249 0.751 0.005 0.355 0.000 0.506 0.286 0.000 0.755 0.751 0.249 0.005 −0.355 0.000 0.506 −0.286 0.000 ZrFe11Ti 0.256 0.250 0.251 (0.381 0.000 0.000) 0.276 0.500 0.000 0.256 0.750 0.750 −0.342 0.000 0.000 −0.264 0.500 0.000 0.756 0.250 0.751 0.007 0.352 0.000 0.507 0.300 0.000 0.756 0.750 0.250 0.007 −0.352 0.000 0.507 −0.300 0.000 The solid line along the cross symbols in the figures denote the frequency of finding the system with the best score out of 1000 sessions in the restricted space when we use the set of α, β and γ as a descriptor.We also show the curve elongated 6 times along x-axis (the dotted line) because one has to optimize also for the other combinations of R and Z to obtain the optimal system for the full space, which approximately takes 6 times larger.For comparison, we show the result of optimization for the full space using the descriptor #9 by the line along the circle points.We set the number of iterations before Bayesian optimization to 5, which is smaller than the number we use above (=10), because we know that the full-space search is so fast that the search in the smaller space cannot catch up if it starts 60 (=10 × 6) iterations behind (Fig. 10 and 11).We see from Fig. 13 that the search with the full space is more efficient, even with this setup, than repeating the search for the small space 6 times.Therefore, the dimensions for R and Z actually contribute to the search efficiency.Frequency of finding the system with the highest magnetization (top), Curie temperature (middle), and the lowest formation energy (bottom) among 1000 sessions.Those in the search for the small space where R and Z are fixed to R = Y and Z = Zr are denoted by the solid line along cross symbols.The dotted line along cross symbols show the curve elongated 6 times along x-axis.The line along the circle symbols show the result of the search in the full space using the descriptor #9 in Table I.

ZFe 11
FIG. 3. The 2a, 8j, 8i, and 8f Wycoff positions in the ThMn12 structure.Some bonds are shown as eye-guides to see the three-dimensional structure.

FIG. 6 .
FIG. 6.Results of the integration scheme when E[x] = sin(πx) (Green line) and E [x] = sin(πx) + 2x (cyan points).The green circle denote the points of references, E[x R i ].The mean value of the improved model, Ẽ[x], is denoted by the purple points.The standard deviation is denoted by the region in light purple.
shows the formation energy of (Sm 1−α Zr α )(Fe 1−β Co β ) 12−γ Ti γ from the unary systems for γ = 0 (left) and γ = 0.5 (right).The mean values are shown in the top two figures, and the standard deviations are shown in the bottom two figures.In each figures, the horizontal axes show values of the Co/(Fe+Co) ratio (β), and the vertical axes show values of the Zr concentration (α).The standard deviation of the model is zero at the corners of the figure in γ = 0 (left bottom) because they correspond to the reference points, namely SmFe 12 , ZrFe 12 , SmCo 12 and ZrCo 12 .Therefore, the mean values at the corners of the left top figure are identical with the reference data (obtained by PAW-GGA).Because the other reference points are SmFe 11 Ti and ZrFe 11 Ti (both at β = 0, γ = 1), the contours are placed slightly to

FIG. 7 .
FIG. 7. The mean values (top) and the standard deviations (bottom) of the integrated model for Sm(Fe,Co)12 (left) and Sm(Fe,Co)11.5Ti0.5 (right).The intervals of the contours are 0.2 eV for the mean values and 0.02 eV for the standard deviations.

FIG. 8 .
FIG.8.The best magnetization found in the loop of a session as a function of the number of iterations.The first result in a session corresponds to the point at the 0th iteration.
repeat the optimization scheme (which we call a session) 1,000 times.

FIG. 9 .
FIG. 9.(Left) The cumulative distribution of frequency Di(s) in the optimization of magnetization using the Bayesian optimization with the descriptor #8.(Right) The cumulative probability of probability Pi(s) that is analytically calculated for the optimization of magnetization with the random sampling.

i+
ε i .(B1)Therefore, our model does not depend on the variable b.
TABLE IV.Inner coordinates (x, y, z) ofFe and Ti in DyFe11Ti and ZrFe11Ti where Dy and Zr are placed at (0, 0, 0).The position of Ti is denoted by a parenthesis.These values corresponds to the point (ax, by, cz) in Cartesian coordinates where a = 8.455, b = 8.262 and c = 4.715 in ZrFe11Ti, and a = 8.518, b = 8.410 and c = 4.727 in DyFe11Ti.
FIG. 13.Frequency of finding the system with the highest magnetization (top), Curie temperature (middle), and the lowest formation energy (bottom) among 1000 sessions.Those in the search for the small space where R and Z are fixed to R = Y and Z = Zr are denoted by the solid line along cross symbols.The dotted line along cross symbols show the curve elongated 6 times along x-axis.The line along the circle symbols show the result of the search in the full space using the descriptor #9 in TableI.

TABLE I .
11Descriptors used in the Bayesian optimization.