Fast, hierarchical, and adaptive algorithm for Metropolis Monte Carlo simulations of long-range interacting systems

We present a fast, hierarchical, and adaptive algorithm for Metropolis Monte Carlo simulations of systems with long-range interactions that reproduces the dynamics of a standard implementation exactly, i.e., the generated configurations and consequently all measured observables are identical, allowing in particular for nonequilibrium studies. The method is demonstrated for the power-law interacting long-range Ising model with nonconserved order parameter and a Lennard-Jones system both in two dimensions. The measured runtimes support an average complexity $O(N\log N)$, where $N$ is the number of spins or particles. Importantly, prefactors of this scaling behavior are small, which in practice manifests in speedup factors larger than $10^4$. The method is general and will allow the treatment of large systems that were out of reach before, likely enabling a more detailed understanding of physical phenomena rooted in long-range interactions.


I. INTRODUCTION
The statistical physics of interacting N -body systems poses many important scientific problems that can be solved by analytic methods only approximately or in certain limits.Therefore, they are nowadays often investigated by means of computer simulations which can be categorized into two main groups: Molecular Dynamics (MD) simulations solve a system's equations of motion numerically and Monte Carlo (MC) simulations explore its phase space in a stochastic manner.In both cases the interaction among the constituents of the system has to be taken into account.For MD as forces and for MC as energy changes associated with random moves of the components.With short-range interactions only a few other partners have to be considered while in the long-range case all the other constituents of the system are involved.This severely limits the accessible system size N , since updating all constituents once naively requires ∼ N 2 operations, usually labeled as complexity O(N 2 ).Since systems with long-range interactions are omnipresent in nature [1][2][3][4][5][6][7][8], fast algorithms for their investigation are highly desirable.Consequently, there has been a lot of research proposing several methods addressing this inherent computational challenge.
Two major classes of such methods are: i) Methods based on splitting the evaluation of the potential into short-and long-range contributions, with one important example being the Ewald summation [9] and ii) hierarchical methods where groups of components are treated collectively such as the Barnes-Hut algorithm [10].Algorithms from these two classes reduce the computational complexity to O(N 3/2 ) and O(N log N ), respectively.However, they have some disadvantages in certain situations (periodic vs. free boundaries, very large prefactors, control of systematic errors) and cannot all equally well be employed in MD and MC.Only in MD, where all components progress synchronously, advanced algorithms based on fast multipole methods, particle mesh Ewald, and multigrid techniques which calculate all forces at the same time lead to even further reduced computational complexity.Most of these studies focus on Coulomb interactions; for reviews see [11,12].In contrast, typically MC algorithms work asynchronously, i.e., they change only small parts of the system at a time.There, these advanced methods cannot be used successfully, since calculating all interactions after each local update is wasteful even if done efficiently.To achieve a similar improvement also for (asynchronous) Metropolis MC simulations of long-range systems, we here present a hierarchical adaptive algorithm that for the here considered systems reduces the computational complexity while maintaining a small prefactor without introducing any systematic errors.The basic idea of our algorithm is the combination of the inverted Metropolis criterion with an adaptive tree-like spatial decomposition of the interaction energy.Our algorithm reproduces exactly the same Markov chain as a traditional Metropolis implementation and can therefore be used as a one-to-one replacement, enabling in particular nonequilibrium studies as well.This is a major conceptual difference to MC methods that deviate from conventional Metropolis dynamics.A prominent example are non-local cluster algorithms [13][14][15] for the simulation of spin systems which can reduce complexity to O(N ), overcome critical slowing down and hence be more efficient for equilibrium studies close to criticality than any algorithm with local dynamics including the one presented here.Furthermore, there is the rejection-free event-chain MC method for systems with continuous degrees of freedom [16].It was first applied with great success to hard-sphere systems and has later been developed further to treat systems with general interactions [17,18].It exploits that additive terms in the Hamiltonian transpose to factors in the Boltzmann weight and thus allow the application of a factorized Metropolis filter [17,19].This idea has also been used in the recent development of the clock MC method [20] which in contrast to event-chain MC is applicable to Ising systems as well and has a reported complexity of O(N ).
In this study, we demonstrate the details of our algorithm by applying it to the nonconserved long-range Ising model (LRIM) with power-law decaying potential, where we focus on integrable interactions, and an off-lattice Lennard-Jones (LJ) system both in two dimensions.In equilibrium, we find that the algorithm's performance is excellent and runtimes are in agreement with a complexity of O(N log N ) for both systems.For the LRIM which undergoes a second-order phase transition at the critical temperature T c we also consider two nonequilibrium processes: First quenches from a disordered configuration to a temperature T < T c into the ordered phase and secondly to T c itself.The first case is referred to as phase-ordering kinetics where coarsening and aging phenomena [21][22][23] occur.In the latter setup critical aging [23,24] can be investigated.Such nonequilibrium processes are governed by local dynamics and consequently need to be modeled by local algorithms like the one presented here.We will show that in all cases a significant speedup is achieved and will point out other systems where a similar performance can be expected.
The rest of the paper is structured as follows: In Section II we first sketch the general procedure of our algorithm.We then apply it to the LRIM in Section III and the Lennard-Jones system in Section IV.In Section V we outline the applicability for a variety of other important (spin) models.Finally, we conclude and give an outlook in Section VI.

II. METHOD
In this section we present a general formulation of the algorithm, with no reference to any system-specific properties.We consider a system of N components q = (q 1 , . . ., q i , . . ., q N ) where q i can for example stand for the spatial position r i , a binary Ising spin s i = ±1, or other internal degrees of freedom (which may also be vector valued).The components q i and q j interact via a symmetric pairwise potential V i,j = V j,i such that the total energy reads For a traditional Metropolis MC simulation with local dynamics at each step the update of a single randomly chosen component q i is proposed q old = (q 1 , . . ., q old i , . . ., q N ) → q new = (q 1 , . . ., q new i , . . ., q N ) . ( The proposed update is accepted with the Metropolis probability, where ∆E = E new − E old is the energy difference resulting from the update and β is the inverse temperature.This acceptance probability is then compared to a (pseudo-)random number ρ ∈ [0, 1).If the calculated probability is larger than this random number the update is accepted and otherwise rejected.As only one component q i of the system is updated we can write the change in energy of the whole system as, Since in long-range interacting systems all constituents of the system interact with each other the calculation of ∆E requires the evaluation of 2 To construct the bounds ∆E min/max we perform a spatial decomposition of the simulation domain which is based on an extrinsic tree-like structure, in contrast to the intrinsic decomposition for self-avoiding walks [25,26] and polymers [27].We note that any d-dimensional simulation box of linear size L can be split into 2 d boxes of size L/2.Of course, each of these boxes can again be split into 2 d boxes of size L/4 and so on.This is repeated until each box contains no more than one constituent.All theses boxes are thus automatically arranged hierarchically on a tree T .Inner nodes contain only the collective information needed for the estimation of the interaction, whereas within each leaf the single contained constituent is stored.The construction of T has complexity O(N ) and rebuilding T completely at each update step would, therefore, be inefficient.Instead we update T locally FIG. 1.A visual sketch of an example progression for the decomposition of the interaction.The red dot marks the position of the component for which the update is proposed.The interacting boxes are enumerated in ascending order in which they are placed into the decomposition.From left to right always one box with high uncertainty is split into smaller boxes leading to more accurate bounds for ∆E.
after an accepted update.This requires O(log N ) operations, since only the collective information of all the ancestor nodes of the leaf containing the updated component needs to be modified.
This spatial decomposition of the simulation domain allows us to split the energy difference which follows from an update, where D is the set of currently selected, non-overlapping boxes (which may be of different size) covering the simulation space and ∆E B = E new B − E old B is the exact change in energy contributed by the interaction with the constituents of box B. Accordingly, if we can find strict lower and upper bounds ∆E General albeit not very tight bounds can be constructed by assuming that all constituents of a box are located at the points of minimal or maximal interaction, respectively.
We aim at finding bounds ∆E min/max that are just accurate enough to decide about the acceptance/rejection of a proposed update.The general strategy is illustrated with an example progression of the decomposition D in Fig. 1.For easy visualization it is shown in two dimensions and for a non-moving component such as an Ising spin.We start with the initial decomposition which is just the box containing the whole system D = {B 0 }, see panel (a) of the sketch in Fig. 1, where the position of the component q i to be updated is marked by a red dot.The bounds ∆E min/max = ∆E min/max B0 of this initial decomposition are in most cases not accurate enough to take the decision about the proposed update.Thus we replace B 0 in D with the child boxes B 1 , . . ., B 2d of B 0 in T , see Fig. 1(b).The bounds ∆E min/max are updated according to Eq. (8) In order to do so, a strategy is needed to decide which box to split next.It is generally beneficial to split boxes which have a high uncertainty ∆ B , since there the potential for improving the bounds is greatest.A natural approach is, therefore, to always select the box with the greatest ∆ B for splitting.Since searching an unordered set is computationally very expensive this would require to keep all elements of D strictly ordered with respect to their uncertainties and new boxes could only be added to D in logarithmic time O(log |D|).An overall faster way is to group boxes according to the integer part of their logarithmic uncertainties δ B = ⌊log 2 ∆ B ⌋ and always select some box from the non-empty set with the highest δ B .The number of the necessary operations is now independent of the size of D and a significant computational overhead can be avoided this way.
The process of sequential decomposition of boxes is sketched out in Fig. 1.The box B 2 in Fig. 1(b) had a high uncertainty and was replaced by its four child boxes in T , see Fig. 1(c).The next box to be split was B 4 .Such an adaptive spatial decomposition can be performed in most cases: on lattices, on graphs with different geometries and even in case of continuous spatial degrees of freedom although the nodes of T might not always correspond to simple square or cubic boxes.
With this refinement protocol, we have effectively constructed a hierarchical, adaptive (spatial) decomposition of the interactions, which depends strongly on the current configuration, the energy difference due to the proposed update ∆E, and the threshold energy ∆E th .After the decision has been made the next update starts again with D = {B 0 }, since if a different component is to be updated the final decomposition will likely be completely different.

A. Model
As a concrete application we first consider the LRIM on a two-dimensional (d = 2) L × L square lattice where every spin s i = ±1 interacts with every other spin of the system via a decaying power-law potential.This model has mostly been investigated for nonconserved order parameter, i.e., with varying magnetization M = s i , in numerous equilibrium studies [13][14][15][28][29][30] and has recently gained renewed interest in phase-ordering kinetics [31][32][33][34][35][36].
The Hamiltonian of this system is given by where for free boundary conditions the interaction couplings decay with distance like Here, r(i, j) = |r(i, j)| is the Euclidean distance, d is the spatial dimension and σ > 0 is a tuneable parameter controlling the decay of the potential.The self-interaction of the individual spins is set to zero, i.e., J i,i = 0. To reduce finite-size effects we employ periodic boundary conditions, implemented via Ewald summation [37], which takes all periodic images of the system into account.For spins on fixed lattice positions the Ewald summation can be incorporated directly into the couplings [30]: which can be used in conjunction with the simple minimum image convention, avoiding a significant computational overhead.
For lattices with linear size L = 2 n with n being a positive integer it is intuitive to decompose the lattice into smaller squares, which for the purpose of hierarchical access are arranged on a quad-tree.Starting with the full lattice as the original box B 0 (cf.Fig. 1) which encloses all the spins, we decompose the lattice into four boxes with half the linear size of the original box.This is repeated until reaching the single spin level, where each box contains only a single spin.We denote the level of decomposition as Γ ∈ {0, . . ., n} where Γ = 0 corresponds to the full lattice and Γ = n is the single spin level.The linear size of a box is L Γ = 2 n−Γ and the number of spins inside this box is for the interaction of a test spin pointing in positive direction with a box B, normalized by the total interaction strength J int Γ defined in Eq. (13).A pair of bounds always consists of an upper and a lower bound which are kept in the same style.N + B is the number of spins pointing up and NΓ the total number of spins in the box.Bounds 1 are given by 2NΓ multiplied with the maximum interaction.Bounds 2 are refined by making use of the magnetization of the box.Bounds 3 additionally employ the value of the interaction with the fully magnetized box J int Γ (R).Bounds 4 are the tightest bounds which can be obtained with the knowledge of the magnetization and the set of interactions involved in a spin-box interaction.For more information see text.

B. Bounds ∆E min/max B for the Spin-Box Interactions
Proposing a flip of spin s i , the contribution of the box B to the energy difference can be written as where we identify the box B with the set of the indices of the contained spins.Here, we exploit that upon flipping s i , For a lattice spin system with periodic boundary conditions the set of couplings which are involved in a spin-box interaction is always uniquely determined by the vector R from the spin to the center of the box and the size of the box N Γ .For the construction of the bounds ∆E min/max B , we need the minimal/maximal coupling in this set J min/max Γ (R) (for monotonically decaying couplings they can be calculated in constant complexity O(1) and can, therefore, be determined on the fly) and the integrated interaction which corresponds to the total interaction strength with a fully magnetized box B.
For Bounds 1 we do not discriminate between spins pointing up or down and only use the number of spins N Γ contained in the box.It is clear that for each summand of Eq. ( 12) holds.This allow us to formulate the following bounds for Eq. ( 12), which are plotted in Fig. 2 as horizontal dashed-dotted lines as a function of N + B /N Γ for one example situation, where N + B is the number of positive spins in the box which is related to the magnetization of the box to rewrite Eq. ( 12) as For each box of the spatial decomposition we keep track of the number of elements of B ± , i.e., the number of spins pointing up or down N ± B .For the sums in Eq. ( 16) it follows that Assuming that s old i points into the positive direction, the rhs of Eq. ( 16) becomes maximal if the first term is maximal and the second one is minimal and vice versa for its minimum.This yields the following lower and upper bounds for the spin-box interaction: If the test spin points in the negative direction, N + B and N − B have to be exchanged accordingly.In this picture the upper bound ∆E max B corresponds to the situation where all spins pointing in the same direction as the test spin are placed at the position of the strongest interaction and the spins pointing in the opposite direction at the spot of the weakest interaction.For the lower bound ∆E min B the positions are switched.Looking at the dashed lines in Fig. 2, one clearly sees that these bounds indeed depend on the box magnetization M B and their uncertainty ∆ B is much smaller than for the rather loose Bounds 1.
For Bounds 3 we use J int Γ (R) from Eq. ( 13).Since the J int Γ (R) do not depend on the spin configuration they can be precalculated.The required computational effort is negligible compared to the typical simulation time, whereas the memory demands scale as O(N log N ).This can become challenging for systems that are significantly larger than the here considered system sizes.Using J int Γ (R), Eq. ( 12) can be rewritten as This equation shows that the interaction of the test spin with a box with some arbitrary configuration can be seen as the sum of the interaction with the fully magnetized box where all spins point in the same direction as the test spin and twice the interaction of the spins inside the box which point in the opposite direction.Alternatively, the interaction can also be calculated using the interaction with the fully magnetized box with all spins pointing in the opposite direction of the test spin and adding twice the interaction with the spins parallel to the test spin, In order to derive two new pairs of bounds for the spinbox interaction we can again use Eq. ( 17), inserting it into Eq.( 19) and Eq. ( 20).As both pairs of bounds are valid we can combine the two bounds for the minimum and the two bounds for the maximum by taking the tighter of the two and obtain (again for s old i = 1), These bounds are exact for fully magnetized boxes and still very accurate for almost fully magnetized ones (see the dotted lines in Fig. 2).This greatly enhances the performance of the algorithm in presence of large magnetic domains, because the bounds of the boxes which fully lay inside a domain are very accurate and thus the decomposition can be coarser.At very large distances )/2 so that the crossing of the pairs of bounds for the minimum or the maximum would occur at N + B /N Γ ≈ 0.5.Exploiting the fact that the set of J i,j can be sorted we can obtain even narrower Bounds 4. For the second term in the brackets of Eq. ( 19) we replace the bounds from Eq. ( 17), for which we assumed that all spins interact with J min/max Γ (R), with the sum of the first N − B of the couplings sorted in ascending/descending order.These are the tightest bounds which can be established using only the magnetization of the box and the set of couplings involved in the spin-box interaction.These bounds are plotted in Fig. 2 as solid lines and as the previous bounds can be calculated in O(1) complexity if the above-mentioned sums are all computed and stored before the simulation.The memory complexity of the algorithm using these bounds would scale as O(N 2 log N ), however, which limits the applicability for large system sizes.Therefore, in the following we will employ Bounds 3 which embody the best compromise between performance and memory requirements.For the iterative refinement of the decomposition of the interaction we proceed as described in the general outline of the algorithm, with one modification.It turned out to be beneficial to evaluate the interactions with small boxes via a direct summation of the spin-spin interactions since this is of comparable speed and has no uncertainty.

C. An Example Decomposition
In Fig. 3 we demonstrate the basic principle of our algorithm using Bounds 3 by showing a single example snapshot and the corresponding spatial decomposition of the interaction for a simulation with σ = 0.8 and L = 256 at T = T c in equilibrium (the system is chosen to be relatively small, so that details can still be observed).Here, the spin under consideration is positioned close to the center of the snapshot and is marked in red.In the vicinity of the test spin -the green-shaded region -maximal resolution is reached, all the boxes are of size 1 × 1 and contain only a single spin each.To ensure the possibility of arbitrarily precise estimation of ∆E, interactions with these spins have to be considered exactly irrespective of which bounds are used, although we note that this is formally equivalent to the use of Bounds 2-4.
From top left to bottom right, the estimates of ∆E max and ∆E min are more refined and approach ∆E.As one can see, this is achieved by reducing the box sizes.The decomposition adapts to the given configuration, i.e., regions which have a bigger influence on the decision are covered by smaller boxes.The scenario in Fig. 3 required a rather fine-grained decomposition, but this is not representative and usually decisions can be made with a much coarser decomposition.It is nonetheless a very illustrative example, since it nicely demonstrates the progression of the algorithm.

D. Analysis of the Runtimes
The speed of our algorithm strongly depends on the state of the simulation, i.e., the current configuration, the choice of the spin to be updated, the temperature, etc.The decision becomes harder the closer the involved change in energy ∆E is to the threshold energy ∆E th .For the extreme case of ∆E = ∆E th the interaction of the updated spin with all other individual spins has to be evaluated exactly, implying a worst case complexity of the algorithm of O(N 2 ) per sweep.In the case of high temperatures, one has ∆E th → ∞, which means that the actual change in energy of the proposed update becomes irrelevant, so that the initial bounds are sufficiently narrow to accept it.Only the updating of the tree T has to be performed, which yields the best case complexity O(N log N ).
It is not straightforward to predict the average complexity from the algorithm's design alone, so that, in the following, we will record the resulting runtimes τ per sweep for different physical settings.For all measurements we take care of minimizing possible hardwarerelated influences.To be on the safe side and to not draw any wrong conclusions in our analysis, we have nonetheless assumed an error of 10% on our estimated runtimes to account for remaining errors.

Equilibrium
For equilibrium simulations, the computational cost can relatively straightforwardly be extracted from rather short runs, which allows us to consider a broad temperature range.A further advantage is that the runtimes only weakly depend on the initial conditions and can be averaged over the full run after the equilibration.The benchmarking was performed on a single hardware configuration: Single socket motherboard equipped with an Intel Core i5-8500T CPU and 16GB DDR4-2667 dual-channel RAM.The algorithm was implemented in C++17 and compiled using GCC 8.3.Since modern processors do not run using a constant frequency and use speculative execution, results of benchmarks can fluctuate.This is especially a problem, if the compute nodes are occupied by other tasks.Therefore, we have taken care to exclusively run a single simulation at a time per compute node in order to minimize possible fluctuations.
In Figs.4(a) and (b) the runtimes per spin update τ /N (in units of µs) for different fractions of T c in dependence of the system size are presented in a semi-log plot for (a) σ = 0.8 and (b) σ = 1.5.In both cases the growth of the runtimes crosses over to linear behavior on the semi-log scale irrespective of the temperature T , in a manner compatible with O(N log N ) complexity.This we deem very plausible considering the hierarchical progression of the algorithm through the use of a tree.Based on this data (and Fig. 5 where we plot the runtimes on a log-log scale), a power-law complexity O(N 1+α with a small exponent α cannot, however, be completely ruled out.In order to corroborate either of the two hypotheses, significantly larger systems need to be considered.While a dedicated investigation of moderately larger system sizes could still be feasible in principle, significantly larger sizes are out of reach with the hardware used in this study, due to the large memory requirements, and a final assessment of the asymptotic scaling has to be left for future studies.We find that there is a clear dependence of the runtimes on the simulation temperature, which is also visualized in the inset of Fig. 4(b).This can be understood qualitatively from the considerations made before: The maximum close to the critical temperature stems from the fact that the average threshold energy E th is on average close to the actual energy difference ∆E involved in the proposed spin flips, which makes a fine decomposition of the interaction necessary.In the case of very high temperature one has ∆E th ≫ ∆E and the actual change in energy of the proposed spin flip becomes irrelevant.At low temperatures an opposite mechanism is at work.Since ∆E th → 0 almost only spin flips are accepted that do not increase the energy, but a typical configuration at these temperatures is (nearly) ordered.Thus, a single spin flip on average has ∆E ≫ ∆E th so that also this decision can be made with loose bounds.
The comparison most relevant is likely the achieved speedup factor, which can be visually appreciated from Figs. 5(a) and (b), where we show the runtimes on a log-log scale in comparison to an effective field simula- tion [31] (dashed colored lines in the same color as the original data points) and a direct summation (solid black line) of all interactions (for the sake of better comparability, here we have replaced T = 10T c by T = 0.3T c ).The effective field approach uses a relatively simple storage trick to save many calculations whenever an update is not accepted.This yields a massive speedup whenever low acceptance rates are encountered, but requires the same number O(N ) of operations for an accepted update as in the direct summation, resulting in the same computational complexity O(N 2 ) [38].Thus, the resulting runtimes are strongly dependent on the acceptance rate of the simulation, which means that this approach is especially fast at low temperatures where the acceptance rates are low [39].At high temperatures, many more proposed updates are accepted, which gives the effective field approach only a minor advantage over a direct summation.At the critical temperature (for which our algorithm has the highest runtime), we can report a speedup factor of ≈ 5500 for σ = 0.8 and ≈ 12300 for σ = 1.5 compared to the effective field approach (≈ 11000 resp.≈ 35000 compared to direct summation).For a temperature below T c , e.g., for T = 0.5T c we observe a speedup of ≈ 1700 for σ = 0.8 and ≈ 500 for σ = 1.5 (≈ 40000 resp.≈ 43000 compared to direct summation).For increas-ing system size all these factors grow steadily.For each fixed temperature there will be a crossover size above which our new algorithm is faster than the effective field method.From Fig. 5 one reads off that the crossover for T = 0.3T c occurs at N ≈ 10 6 for the considered values of σ and respectively N ≈ 10 5 for T = 0.5T c .For larger values of T the crossover happens already for much smaller systems while for the very low temperature T = 0.1T c the crossover cannot yet be observed for the considered system sizes of up to N = 10 8 due to the extremely low acceptance rates (≈ 10 −6 ) and hence a very small prefactor of the runtimes of the effective field approach.
For investigations of the phase transition, i.e., for equilibrium simulations in the proximity of T c , our method, like any other local algorithm, is not a serious contender, at least when non-local cluster algorithms are available as for the Ising model.In such cases, its main field of application are nonequilibrium scenarios.

Nonequilibrium
We focus on two cases: Quenches from a disordered start configuration i) to a temperature substantially below the critical temperature or ii) to the critical temperature.To mimic a physical evolution, in these cases only local dynamics that preserve the dynamical properties of the system may be used.Non-local update schemes, including cluster algorithms, are not allowed, which makes these scenarios the prime field of application for our algorithm.In the first case, the system undergoes an ordering process and consequently the dynamics of growth of ordered structures in the system is of interest, both from coarsening and aging perspective involving single-and two-time quantities, respectively [21][22][23].The physical properties of this model during this process have recently been investigated in Refs.[31,[34][35][36] and are not part of the discussion here.In Fig. 6(a) we present the time dependence of the runtime of our algorithm per spin update τ /N (in units of µs) as a function of simulation time for a phase-ordering quench with σ = 0.8 to T q = 0.1T c for large systems of linear sizes L = 4096 and 8192.Also shown, for sake of comparison, are the runtimes for the effective field method and a direct summation of the interactions.We could quench to arbitrary temperatures with our new algorithm, but choose T q = 0.1T c to have the same setting as in Ref. [31].
We observe that the time needed per update τ /N is strongly dependent on how far the system has proceeded in its ordering process for our algorithm and, even more, for the effective field approach.Since the temperature in our algorithm is set to a low temperature T q = 0.1T c , we typically draw threshold energies E th comparatively close to zero, i.e., spin flips which significantly increase the energy are usually not accepted.At the start of our simulation the configuration is completely disordered, and for many proposed spin flips ∆E ≈ 0 ≈ ∆E th , so that ∆E has to be known rather accurately.In the course of the simulation, when the configurations are already partly ordered, proposed spin flips in domains have typically a large ∆E and can mostly be rejected with very loose bounds.The decision is more involved for spins positioned at domain boundaries, where ∆E is typically much closer to zero, the interactions often have to be resolved in more detail, and the probability of acceptance is higher.With growing domains fewer spins are situated at the domain boundaries so that the average acceptance rate decreases, and the effective field simulations become faster.Ultimately, the runtimes of both methods will reach their respective equilibrium runtimes at T q .
Over the full process, our new approach is significantly faster than the already very fast effective field method, resulting in a ≈ 100 times faster total runtime until finitesize effects are reached.A direct summation becomes in these cases prohibitively expensive (a factor of ≈ 40000 slower than our new algorithm), and cannot be used to simulate systems of this size.This factor will grow significantly for increasing system size since the runtimes of the two algorithms scale differently.
The second case of interest is critical aging, i.e., the behavior of two-time quantities during quenches from a disordered starting configuration to the critical temperature [23,24].With some modifications, such as a small initial magnetization, it is also possible to investigate short-time dynamics during such processes [40], but we here focus on the completely disordered start.We present in Fig. 6(b) the obtained runtimes per spin update τ /N (again in units of µs) in our simulation with σ = 0.8, where we use the same notation as in (a) for the different methods.Here, the runtimes both for the new algorithm and the effective field approach remain more or less constant throughout the whole simulation, although the system has not yet reached equilibrium.In both cases only a small initial decay of the runtimes is visible.Here, the advantage of the effective field simulation over a direct summation is relatively small, since the acceptance rates are of the order of 1. Albeit the equilibrium simu-lation close to the critical temperature is also one of the most difficult situations for our algorithm, it nonetheless produces much smaller runtimes than the effective field approach and the direct summation.In this scenario the runtimes are close to those found in the equilibrium simulations.We find a speedup of ≈ 6000 compared to the effective field approach and ≈ 8000 to the direct summation, allowing for the investigation of this process for the presented system sizes, which was entirely out of reach before.A similar acceleration is also expected for other nonequilibrium simulations with comparably high acceptance rates.
Another local algorithm for long-range interacting spin systems is the clock MC method [20] based on the factorized Metropolis filter [17,19].It, too, is potentially applicable in the scenarios discussed above.Yet, so far it has only been applied to disordered long-range interacting spin systems in equilibrium.We have implemented this method for the ferromagnetic LRIM and tested it for the above nonequilibrium setup.In its basic form where spin-spin interactions are considered individually, we find that the times of crossover to the asymptotic scaling behavior become prohibitively large for quenches to low temperatures.This is due to drastically reduced acceptance rates of the factorized Metropolis filter, when compared to conventional Metropolis dynamics.In the framework of this method there is, however, the possibility to treat multiple factors collectively.This shifts the dynamics towards traditional Metropolis, increasing the acceptance rate, but also the computational effort.For each physical setting (i.e., combination of T , L, σ, . . . ) a different grouping of spins may yield the best performance.A detailed analysis and comparison is beyond the scope of this study and will be presented elsewhere [41].

IV. LENNARD-JONES SYSTEM WITH FULL INTERACTION RANGE
To highlight the power of our algorithm also for continuous degrees of freedom, we finally demonstrate the applicability of its general concept to a LJ system [42] with potential Here, we keep the full interaction range, i.e., do not truncate (and shift) the potential at the often employed cutoff r c = 2.5σ.It is well known that, e.g., the critical temperature and critical density of a LJ system do depend on r c [43][44][45].We consider N interacting particles in a volume L d whose linear extent L can be adjusted to yield the desired density ρ = N/L d .Periodic boundary conditions are applied which, due the fast decay of V LJ ∝ −r −6 ij can easily be realized by the minimum image convention, i.e., in this application Ewald summation is not necessary.
For the here presented simulations, we used the most general bound introduced in Sec.II, i.e., we virtually collect all particles of a box at the points of minimal or maximal interaction, respectively.This is analogous to Bounds 2 introduced in Sec.III B for the LRIM.In equilibrium simulations of particles the proposed MC moves can be freely chosen.Here we perform 90% local displacements within radius r = σ and 10% nonlocal moves where the particle's potential new position is chosen randomly in the whole simulation box.In Fig. 7 we show for d = 2 the runtimes per update τ /N for different N but fixed density ρ = 0.35 and varying temperature T , covering both the (oversaturated) vapor and vapor-liquid phase.We find clear evidence for O(N log N ) complexity.
The runtimes appear largely independent of the temperature, which is in contrast to the results for the LRIM in equilibrium.For smaller densities, we generally find faster runtimes for the same number of particles.An implementation for d = 3 is straightforward, too, and will be presented elsewhere [46].

V. APPLICABILITY & LIMITATIONS
Although in this paper we only presented in detail the application of the algorithm to two different models, it can be used for other lattice spin and off-lattice particle systems as well.Of course, we are neither restricted to two spatial dimensions nor to hypercubic lattices.The method is versatile; besides the classical Ising model general O(n) spin models can be treated efficiently in a very similar manner.In particular, systems with quenched disorder [47] are not excluded: Random field models [48] are trivially accommodated within the framework introduced above since the extra field simply enters as an offset to ∆E.Another class of models are site-diluted spin systems which model crystal defects through unoccupied lattice sites [49,50].To treat for instance a site-diluted Ising system with our algorithm, modified Bounds 3 can be used, where N 0 B is the number of vacancies in box B. This opens a way to treat q state Potts models as well [51], where the components which are inert to both the old and the newly proposed state would be treated as vacancies.Now, we need to store the population of each of the q spin states for all boxes.
Random field and site dilution are forms of disorder that can easily be managed since they affect individual spins and their interaction with the environment as a whole.More challenging are models where disorder manifests as variation of the interaction of pairs of spins such as systems with bond dilution [52] or the Edwards-Anderson spin-glass model [53,54].Here, the order parameter is not as closely related to the spin-box interaction energy as the magnetization in the case of the pure Ising model.This implies that it is difficult to formulate an estimator similar to Bounds 3 or Bounds 2. Also in this more difficult case we have checked that a reduction in complexity is achieved in simulations using the basic Bounds 1, although the speedup is less pronounced as compared to the pure Ising case.Another large class of problems are spin systems with competing interactions, e.g., antiferromagnetic short-range and ferromagnetic long-range interactions or vice versa [55,56].
Here, one could for example evaluate the short-range interactions directly and treat the long-range interactions using our algorithm.
The LJ system considered above can easily be generalized to two-body potentials of other functional forms and extended to multi-species systems.Depending on the specific details of the different interaction between the particles many scenarios can be imagined.A general approach would be to use a separate decomposition of the system for each particle type.
Yet, the question about possible limitations of the algorithm arises.Cases where it might not be successful are systems with interactions that do not decay sufficiently fast (and thereby cannot be grouped together with decaying interaction strength).For mean-field models the algorithm may thus not be used (efficiently).

VI. CONCLUSION & OUTLOOK
We have presented a general, hierarchical, and adaptive algorithm for Metropolis Monte Carlo simulations of long-range interacting systems.The range of possible applications of the algorithm is very broad.The formulation does not depend on the lattice structure and is thus valid for both general lattice spin models and systems with long-range interacting particles in continuous space as long as the interaction decays sufficiently fast with distance.In the two applications considered here, viz. the nonconserved long-range Ising model and a Lennard-Jones system in two dimensions we observe runtimes that support an average asymptotic complexity of O(N log N ) (where N is the number of spins or particles) but the existing data for the former may also be described by a power-law with a small exponent.However, the scenario of a logarithmic scaling seems more likely due to the hierarchical, tree-like nature of the algorithm.
Importantly, our method has small prefactors for the asymptotic scaling of the runtimes, resulting in speed-up factors which exceed 10000 in relevant physical scenarios.In a single day, we can perform simulations which before would have taken ≈ 30 years with any of the established methods, enabling the exploration of parameter ranges that were hitherto not accessible.
Until recently it was only possible to investigate the nonequilibrium properties of the long-range Ising model during quenches to low temperatures where the low acceptance rates allow an efficient simulation via the effective field approach [31].The application of the new algorithm is not limited to this scenario, proving very efficient also in case of large acceptance rates, as encountered, e.g., during quenches to the critical temperature.While here we exemplified our algorithm for the long-range Ising model with nonconserved order parameter and a Lennard-Jones system, our method can easily be applied to other spin and off-lattice systems.Another promising application is the phase separation in a conserved order parameter simulation of the long-range Ising model where the system evolves at the quench temperature through spin exchanges [57].Other nonequilibrium simulation settings where long-range interactions are of interest are field-driven hysteresis [58] and Kibble-Zurek like processes employing a slow quench [59][60][61].Especially interesting is the extension to quantum systems where the Suzuki-Trotter mapping [62,63] of a d-dimensional quantum system to the corresponding (d+1)-dimensional classical system allows the application of our algorithm, which is designed for general d.Very recently, motivated by D-Wave experiments [64], Bando and Nishimori [65] investigated the generalized quantum Kibble-Zurek mechanism in the transverse-field Ising model coupled to an external bath where long-range interactions arise naturally in Trotter direction.Also of great interest are models where the quantum spins themselves interact via a tunable long-range potential [66,67], which describe many experimental situations [68,69].The algorithm presented here constitutes an important step towards an efficient simulation of the corresponding classical system.
min/max B of ∆E B , we can establish bounds for the total energy change as well, ∆E min/max = B∈D ∆E min/max B .
The parameters for this example are linear lattice size L = 128, decay exponent of the potential σ = 0.8, distance R = (50.5,44.5) from the spin to the center of the box B, and box size N Γ = L 2 Γ = 256.For other parameter values the curves in Fig. 2 would look different, but the main features would remain unchanged.Since these simple bounds do not make use of the box magnetization M B they are constant and much wider than the more refined bounds considered next.In contrast, Bounds 2 do depend on the magnetization M B in a box B. To make use of M B we split the box B into the sets of indices of spins pointing up B + or down B − ,

FIG. 3 .
FIG. 3. Example decomposition of the LRIM lattice with σ = 0.8 and L = 256 simulated in equilibrium at T = Tc.The spin which is to be updated is marked by the red dot close to the center of the box.The boundaries of the boxes are represented as blue lines.Going from top left to bottom right, the accuracy of the bounds of ∆E min and ∆E max increases with shrinking size of the boxes in the decomposition.Boxes of size 8 × 8 are broken up directly into 1 × 1 boxes (green area); 2 × 2 and 4 × 4 boxes do therefore not occur.

5 FIG. 4 .
FIG.4.Runtime per spin update τ /N versus system size N on a semi-log scale for (a) σ = 0.8 and (b) σ = 1.5 and several (equilibrium) temperatures T .In the inset of (b) we show the combined data for both values of σ and the biggest system size L = 8192 as a function of T , demonstrating the maximum around Tc.

FIG. 5 .
FIG.5.Runtimes per spin update τ /N versus system size N plotted on a log-log scale for (a) σ = 0.8 and (b) σ = 1.5 and several (equilibrium) temperatures T .As dashed lines in the same color as the original data points, we have included the calculated runtimes for an effective field simulation.Also included is the runtime obtained from a naive Metropolis MC simulation using direct summation.

FIG. 6 .
FIG.6.Runtime per spin update τ /N versus simulation time t for σ = 0.8 in quenches from a disordered start to (a) Tq = 0.1Tc and (b) Tq = Tc on large lattices with L = 4096 and 8192.In both plots, we present as solid lines with markers the runtimes obtained by our algorithm.The dashed lines in the same color show the runtimes calculated for the effective field (eff.field) algorithm and the solid lines are the runtimes of the (naive) direct summation (dir.sum).

FIG. 7 .
FIG.7.Runtime per update τ /N versus system size N plotted on a log-normal scale for a Lennard-Jones system with untruncated interactions at constant density ρ = 0.35.The inset shows the same data on a log-log scale together with the runtimes obtained from a standard Metropolis simulation using a direct summation of all interactions.
. With each pair of bounds ∆E