Active Learning Algorithm for Computational Physics

In large-scale computation of physics problems, one often encounters the problem of determining a multi-dimensional function, which can be time-consuming when computing each point in this multi-dimensional space is already time-demanding. In the work, we propose that the active learning algorithm can speed up such calculations. The basic idea is to fit a multi-dimensional function by neural networks, and the key point is to make the query of labeled data economically by using a stratagem called"query by committee". We present the general protocol of this fitting scheme, as well as the procedure of how to further compute physical observables with the fitted functions. We show that this method can work well with two examples, which are quantum three-body problem in atomic physics and the anomalous Hall conductivity in condensed matter physics, respectively. In these examples, we show that one reaches an accuracy of few percent error for computing physical observables with less than $10\%$ of total data points compared with uniform sampling. With these two examples, we also visualize that by using the active learning algorithm, the required data are added mostly in the regime where the function varies most rapidly, which explains the mechanism for the efficiency of the algorithm. We expect broad applications of our method on various kind of computational physics problems.

Neural network (NN) based supervised learning methods has nowadays found broad applications in studying quantum physics of condensed matter materials and atomic, molecular and optical systems [1,2].On the theoretical side, applications include, for example, finding orders and topological invariants in quantum phases [3][4][5][6][7][8][9], generating variational wave functions for quantum many-body states [10][11][12][13][14] and speeding up quantum Monte Carlo sampling [15,16].On the experimental side, these methods can help both optimizing experimental protocols [17,18] and analyzing experimental data [19][20][21].Usually the supervised learning scheme requires a huge set of labelled data.However, in many physics applications, labelling data can be quite expensive, for instance, performing computation or experiments repeatedly can be time-and resources-demanding.Therefore, in many cases labelled data are not abundant, which is a challenge that have prevented many applications.
The active learning is a scheme to solve this problem [22].It starts from training a NN with a small initial dataset, and then actively queries the labelled data based on the prediction of the NN and iteratively improves the performance of the NN until the goal of the task is reached.With this approach, sampling the large parameter space can be made more efficiently, and the request of labelled data is usually much more economical than normal supervised learning methods.Recently a few works have applied the active learning algorithm to determine the inter-atomic potentials in quantum materials [23-25]    * Electronic address: hzhai@tsinghua.edu.cn and to optimize control in quantum experiments [26,27], where labelled data have to be obtained either by ab initio calculation or by repeating experiments, which are both time consuming.
In this work we focus on a class of general and common task in computational physics that is to numerically determining a multi-dimensional function, say, F(α 1 , α 2 , . . ., α n ), where α i are parameters.Considering a uniform discretization, suppose we discretize each parameter into L points, there are totally L n number of data points that need to be calculated.In many cases, calculation of each point already takes quite some time, and thus the total computational cost is massive.Nevertheless, for most functions, there are regimes where the function varies smoothly and regimes where the function varies rapidly.Ideally, one should sample more points in the steep regimes and less points in the smooth regimes in order to obtain a good fitting in the entire parameter space efficiently.However, it seems to be paradox because one does not know which regimes the function varies more rapidly prior to computing the function.
Here we show that this goal can actually be achieved by using the active learning algorithm and the "query by committee" stratagem [28] to add data points iteratively.Below we will first introduce the general protocol and then demonstrate the algorithm in two concrete problems.One is the quantum three-body problem and the other is the anomalous Hall conductivity problem.These two are very representative examples in atomic and condensed matter physics, respectively.Using these two examples, we will illustrate how the active learning and "query by committee" guide adding data to steep regimes of the fitting function.We will also discuss how to compute physical observables from the fitted functions.
S 0 initial data < l a t e x i t s h a 1 _ b a s e 6 4 = " q Z 8 s 9 m X 2 b 8 a 1 P F r E f K A 7 u j r 8 g n j + S Z v J I 3 6 8 l 6 s d 6 t j 3 n r i p X N H J E / s D 5 / A I E 9 l T E = < / l a t e x i t > NN 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " / O Y 2 y x / x L b u n 6 r e r C f r x X q 3 P m a t G W s + c 0 j + w P r 8 A U H S l 9 g = < / l a t e x i t > . . . . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " P h 2 e S B e H q v 0 5 5 S + 9 j g U 6 U S Y w v / w = " > A Prediction on all points < l a t e x i t s h a 1 _ b a s e 6 4 = " t D 0 a r x c o A C t

Compute variances
< l a t e x i t s h a 1 _ b a s e 6 4 = " z e 9 s S Z s k j e S a v 5 M 1 6 s l 6 s d + t j 3 p q z F j P H 5 A + s z x + 1 D 5 X 7 < / l a t e x i t > Add < l a t e x i t s h a 1 _ b a s e 6 4 = " c d 6 FIG. 1: Active learning protocol for fitting a multiply dimensional function.Here "NN" represents "neural network".

II. GENERAL PROTOCOL.
The main procedure is summarized in Fig. 1 and explained as follows: 1) We start with an initial dataset with the number of data S 0 L n , whose values of F have been computed exactly.We use this dataset to train a number of different NNs.
2) We ask all NNs to make predictions of F on all L n number of data points, and for each point we compute the variance among predications made by different NNs.
3) We select S t number of data point with the largest variance, again with S t L n , and we query the accurate value of F of these points by numerical calculation.4) We add the S t number of new data into the training set to train all NNs again, and then repeat from step 2).We repeat the procedure for m-epochs until the results from all NNs converge.5) We use NN to calculate the value of F on all L n data points.
In this protocol, one only needs to query the value of F on S 0 + mS t number of data points, and we should keep S 0 + mS t L n .The trade-off is that we need to train a number of different NNs and keep using all NN to make predications on all data points.It can save computational cost if the computational cost for training NN and making predication with NN is much less than the computational cost of F, which is often the case in many applications.
In this protocol, the key idea is "query by committee" [28].Here the committee is made of different NNs, which contain different number of layers, different number of nodes at each layer and different activation functions.Thanks to the great expressive power of NN, we do not need to assume a specific form of the fitting functions.In l a t e x i t > k < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > k 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > N , f 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > U < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > FIG.2: Schematic of the fully connected neural network used in these two examples.
the regime where the function varies smoothly, different NNs can quickly reach a consensus and the variance will be small.On the other hand, in the regime where the function varies rapidly, it is hard for different NNs to converge and the variance will be large.Hence, we can use this variance to guide sampling data point.In fact, as we will see in the examples below, the data points added in later epoch are all added in the regime where the function changes rapidly.

III. THREE-BODY PROBLEM
In the first example, we consider a quantum threeboson problem.These bosons interact with a two-body pairwise potential described by an s-wave scattering length a s and a high-energy cut-off Λ.When a s is positive, there exists a two-body bound dimer state, and it is important to compute the atom-dimer scattering length a ad by solving the three-body problem.Here a key quantity is the scattering kernel U(k, k ).Focusing on the swave scattering only, U only depends on the amplitude of momentum k = |k| and k = |k |, and we can simplify U as U(k, k ).Once we know U(k, k ), one can calculate a ad through the Skorniakov-Ter-Martirosion (STM) equation [29].The details of how to compute U(k, k ) and how to obtain a ad from U(k, k ) are summarized in Appendix A.
In the conventional method, we uniformly discretize both k and k into L = 100 points between zero and Λ, and we need to compute U for all 10 4 data points.We then solve the STM equation with these U (k, k ) to obtain a exact ad , and we have checked that such a discretization can ensure reaching the convergence, which is referred as the exact result and is shown by the dash line as reference in the Fig. 3.Here we will follow the general procedure described above to fit U(k, k ) using NNs.We will show that i) at most only 300 number of data points are needed in order to obtain a reliable fitting, and ii) most of the queried data occupy the parameter regime where the function U(k, k ) is steep, and iii) we use the trained NN to generate U(k, k ) on all data points to solve STM equation, and we find that the error is only few percent compared with the exact result.
To be more concrete, we start with an initial dataset with uniformly sampled S 0 = 100 points.Here we design five different fully connected NNs, whose structures are schematically shown in Fig. 2. The input of all NN are two numbers k and k , and the output is U.Each layer of a NN is characterized by the number of nodes N α and an activation function f α , and we describe each NN with by (N 1 , f 1 ; N 2 , f 2 ; . . .).The five different NN used in this work are 1 : (20, tanh; 20, tanh; 6, LS) 2 : (20, tanh; 20, LS; 6, tanh) 3 : (30, tanh; 20, tanh; 6, LS) 4 : (30, tanh; 20, LS; 4, tanh) 5 : (30, tanh; 20, tanh; 10, LS; 4, tanh), (1) where LS denotes the logistic sigmoid activation function with f (a) = 1/(1 + e −a ).It is important to keep these NNs different but their detailed structures are not important for final results.For each NN, the training results also depend on the initialization, which is particularly so when the number of data points are not enough.Therefore, for each NN, we also consider 20 different initialization.
Therefore, at each iteration, we totally have 5 × 20 trained NN to predict U on all {k, k } points in the set M. For each point i ≡ {k, k }, we can compute variance σ i for all 100 predications.We will select S t = 10 points with the largest variance to compute U at these points, and add them into the training set.At each iteration, we can compute the mean variance When the mean variance σ mean is small enough, we start to evaluate the physical quantity a ad .
Here it comes to another important consideration of our method.On one hand, we have utilized the discrepancy between NNs to guide the query processes more efficiently, but on the other hand, when we start to compute observables, we need to properly average out these variances between different NNs to obtain a converged result.Hence, we compute a ad as follows: 1) For each NN, since there are 20 copies depending on different initializations, we first average the predications over these 20 copies to obtain a mean value Ūt (k, k ) for each point.We then choose one of U j t (k, k ) that is the closest to Ūt (k, k ), and we use this U t (k, k ) (ignoring the upper index) as the representative of the t-th NN.
2) We solve the STM equation with U t (k, k ) and obtain a t ad .Among all five a t ad , we discard the largest and the smallest one, and take an average over the rest three, which is taken as a ad predicted by the active learning approach.The results are shown in Fig. 3 with blue circles.One can see that the results fluctuate around the exact value, and the fluctuation decreases as S increases.
3) To further suppress the fluctuation, we can further take a self-average of a ad , that is to average over a ad for five successive iterations.This result is denoted by āad and shown in Fig. 3 with yellow solid dots.Indeed, one can see that the fluctuation is already suppressed strongly at S ∼ 200.
When the result does not change with training epoch, we take a self-average over the last five iterations to obtain āad , and we compare this result with a exact ad .For instance, when a s Λ = 10, the relative error (ā ad − a exact ad )/a exact ad is about 1%.< l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > U (k, k 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > U (k, k 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > We apply this method to scan different values of a s .We stop the iteration at S = 300, a number much smaller than uniformly sampled 10 4 data points, and we take a self-average over the last five iterations to obtain āad .The results are shown in Fig. 4. Our calculation can obtain the right location for the atom-dimer resonances, and our active learning method can well reproduce the Efimov scaling law.By compared with the exact result, we define a relative error as (ā ad − a exact ad )/a exact ad ≡ ‰.The around the first resonance is shown in the inset of Fig. 4, which shows that the relative error does increases nearby the Efimov resonance, but overall it is kept within a few thousandths.
To further reveal the mechanism of how our method works, we plot in Fig. 5

IV. ANOMALOUS HALL CONDUCTIVITY PROBLEM
In the second example, we consider the anomalous Hall conductivity of a magnetic Weyl semimetal Mn 3 Ge [30], which has been extensively studied recently [31,32].The anomalous Hall conductivity (σ Hall ) is an intrinsic quantity induced by the Berry curvature of the band structure.We have computed σ H = k σ(k x , k y , k z ) based on the ab initio band structure calculations.The model of this material and the details of computing the Berry curvature are shown in Appendix B. For each k x and k z point, we can obtain a value σ(k x , k z ) by performing an integration over k y .Here without the active learning method, when we discretize both k x and k z into 100 points, there are totally 10 4 data points to be computed.The total Hall conductivity is obtained by summing over k x and k z as σ H = kx,kz σ(k x , k z ).
As one can see from Fig. 6(a), the function σ(k x , k z ) is much more singular than U(k, k ) in the previous case.There are four broad peaks and four narrow peaks located around (0.5 ± 0.1, 0.5 ± 0.4) and (0.5 ± 0.4, 0.5 ± 0.15).We follow the same active learning procedure discussed above.In this case we take five different initializations for each NN.Compared with the first example, computing physical observable is easier because we only need to average over all five different initializations for each NN.Similarly, to suppress the fluctuation, at each iteration, we also average over three different predictions of σ H by discarding the largest and the smallest results.We also take a self-average over five successive iterations.As   we show in Fig. 7, the prediction of the active learning method approaches the exact value when S max > 700, which is less than 10% of the total number of uniformly sampled data, and the relative error is about 0.8‰.As we shown in Fig. 6(b), the "query by committee" stratagem guides most of the queried date distributed in the areas of four broad peaks.In Fig. 6(c) and (d), we show the function σ(k x , k z ) generated by NNs.For S = 1200 shown in Fig. 6(c), the fitting has already captured the four main peaks, and since the contribution to the Hall conductance mainly comes from the four main peaks, the results of σ H already converges very well.Nevertheless, when we continue to add labelled data until S = 4000, as shown in Fig. 6(d), one can see that these extra data are mainly added in the four narrow peaks such that these four small peaks can also be generated very well.

V. CONCLUSION AND OUTLOOK
In summary, we have developed a neural network based machine learning method to determine a multidimensional function efficiently.The method combines the great expressivity of NN and the advantage of the active learning scheme to reduce the demand for labeled data to a minimum.There are two key ingredients of this method.On the one hand, we make use of the variances between NNs to guide the queried data to be sampled into the regime where the function varies rapidly, which makes the calculation more efficiently.On the other hand, we need to properly average out these variances when calculating physical observables using the fitted results.With two examples, we show our method can work remarkably well.Compared with uniform sampling, our method can achieve an accuracy of less than 1% error with only less than 10% of total data points, even when the function has multiple sharp peaks.We believe our method can find broad applications in many areas of computational physics.

FIG. 3 :
FIG. 3: The atom-dimer scattering length a ad calculated with the active learning method.a ad converges with the increasing number of queried dataset S. The blue empty circles are the results averaged over different NNs at each step, and the yellow solid dots are results averaged over the adjacent five steps.The dashed line denotes the exact results obtained with all 10 4 data points.Here we take the number of initial dataset S0 = 100 and at each step St = 10 data points are added.asΛ = 10.

FIG. 4 :
FIG.4: a ad as a function of asΛ.The divergence of a ad is the atom-dimer resonance due to the Efimov effect.The black dots are calculated by evaluating U with uniform sampling all data points, and the yellow diamonds are results from the active learning algorithm with self-averaging.Arrow mark the asΛ where the training processes are demonstrated Fig.3.The inset plots the relative error ‰ around the first atom-dimer resonant point.

FIG. 5 :
FIG. 5: (a) Profile of the two-dimensional function U(k, k ) in the whole momentum space, plotted with all uniformly sampled data points.(b) Visualization of the dataset queried during the active learning iterations.The blue dots denote the queried dataset with 100 < S 300.The red lines are the contour-plotting of (b).(c-d) Difference between function U(k, k ) generated by the NNs and U(k, k ) generated by uniform sampling.Here we have averaged over all five different NNs and twenty different initializations.The labelled data S = 120 for (c) and S = 320 for (d).
(a) the function U(k, k ) generated by the uniform sampling of all 10 4 points.All the data points added during iteration (with 100 < S 300) are shown in Fig. 3(b).It is very clear that nearly all points are added in the regime around k ∼ 0 and k ∼ 0, where U (k, k ) is the steepest as one can see from Fig. 5(a) and the equal number contour in Fig. 5(b).In Fig. 5(c) and (d), we compare the function U(k, k ) generated by NNs and the function generated by uniform sampling.Here the NN results are averaged over five different NNs and different initializations of different NNs.It shows that the fitting is perfect when the calculation of a ad converges.

FIG. 6 :
FIG. 6: (a) Profile of the two-dimensional function σ(kx, kz) in the whole momentum space, plotted with all uniformly sampled 10 4 data points.(b) Visualization of the dataset queried during the active learning iterations.The blue dots denote the queried dataset with 100 < S 900.The red lines are the contour-plotting of (b).σH is in unit of S/cm.(c-d) The function σ(kx, kz) generated by the NNs, averaged over all five different NNs and five different initializations for each NN.The labelled data S = 1200 for (c) and S = 4000 for (d).

FIG. 7 :
FIG. 7: The Hall conductivity σH calculated with the active learning method.σH converges with the increasing number of queried dataset S. The blue empty circles are the results averaged over different NNs at each step, and the yellow solid dots are results averaged over the adjacent five steps.The dashed line denotes the exact results obtained with uniformly sampled all 10 4 data points.Here we take the number of initial dataset S0 = 100 and at each step St = 50 data points are added.