Neural Monte Carlo Renormalization Group

The key idea behind the renormalization group (RG) transformation is that properties of physical systems with very different microscopic makeups can be characterized by a few universal parameters. However, finding the optimal RG transformation remains difficult due to the many possible choices of the weight factors in the RG procedure. Here we show, by identifying the conditional distribution in the restricted Boltzmann machine (RBM) and the weight factor distribution in the RG procedure, an optimal real-space RG transformation can be learned without prior knowledge of the physical system. This neural Monte Carlo RG algorithm allows for direct computation of the RG flow and critical exponents. This scheme naturally generates a transformation that maximizes the real-space mutual information between the coarse-grained region and the environment. Our results establish a solid connection between the RG transformation in physics and the deep architecture in machine learning, paving the way to further interdisciplinary research.


I. INTRODUCTION
The renormalization group (RG) [1] formalism provides a systematic method for quantitative analysis of critical phenomena. Among all the RG schemes, the realspace renormalization group (RSRG), first proposed by Kadanoff [2], is the most intuitive and natural way to perform RG transformations on lattice models [3]. These methods allow for a straightforward construction of the critical surface and calculation of the critical exponents using numerical methods such as Monte Carlo renormalization group (MCRG) [4][5][6]. However, the RSRG transformation typically generates long-range couplings not present in the original Hamiltonian and truncation is necessary to make the method manageable. From the physical point of view, we expect the range of the renormalized interactions of a physical lattice system near the fixed point should not increase. Finding the optimal way to coarse-grain the Hamiltonian to systematically eliminate the irrelevant degrees of freedom is crucial for the success of any RSRG scheme. The fundamental difficulty lies in the enormous degrees of freedom in choosing the weight factors for the RG transformation. Several attempts in the past have been made to find the optimal transformation. Swendsen proposes an optimal MCRG scheme by introducing variational parameters into the RG procedure [7]. Blöte et al. propose to modify the Hamiltonian and the weight factors such that the corrections to scaling are small [8]. Ron et al. propose to choose parameters such that the critical exponent of interest was nearly constant during the MCRG iterations [9]. However, it remains unclear how to determine the weight factors without prior knowledge of the system.
The general guideline in searching for an optimal RG transformation is to identify and eliminate the irrelevant degrees of freedom in the RG flow while retaining the relevant ones. However, it is difficult a priori to determine which degrees of freedom should be eliminated. This resembles the question in machine learning (ML) on how to extract relevant features from raw data. Deep learning (DL) [10] using deep neural networks (DNN) has significantly improved machine's ability in many areas such as speech recognition [11], object recognition [12], Go and video game playing [13][14][15], as well as aided discoveries in various fields of physics [16][17][18][19][20]. Multiple layers of representation are used to learn distinct features directly from the training data. The similarity between the structure of the DNN and the course-graining schemes in statistical physics inspires many efforts to establish connection between variational RG [21] and unsupervised learning of DNN [22][23][24][25][26][27][28][29][30]. Here, we want to address a different question: how can we train an DNN to obtain an optimal RSRG transformation? This issue is partially addressed from the informational theoretical perspective [25,26], where an optimal RG transformation is obtained by maximizing the real-space mutual information (RSMI). However, the proposed RSMI algorithm requires a mutual information proxy in order to probe the effective temperature(coupling) of the system along the RG flow, rendering it less practical. A more direct and transparent method that enables direct computation of the corresponding RG flow and critical exponents is thus highly coveted.
Here we present a scheme called neural Monte Carlo RG (NMCRG) that parametrizes the RG transformation in terms of a restricted Boltzmann machine (RBM) [31]. The optimal RG transformation can be learned by minimizing the Kullback-Leibler (KL) divergence between the system distribution and the marginal weight factor distribution (defined in Eq. (5)). This provides an explicit link between the RG transformation and the RBM, allowing us to use the modern ML techniques to find the optimal RG transformation. In addition, the scheme is readily integrated with the MCRG techniques to directly determine the effective couplings along the RG flow, and critical exponents. We demonstrate the accuracy of this approach on the two-and three-dimensional classical Ising models. We find the optimal transformation leads to an efficient RG flow to the fixed point with short-range renormalized couplings, and saturates the mutual information toward the upper bound.

II. PARAMETRIZATION OF REAL-SPACE RENORMALIZATION GROUP
Consider a generic lattice Hamiltonian, where the interactions S α are combinations of the original spins σ and the K α are the corresponding coupling constants. A general RG transformation [3,26] can be written as with parametrized weight factors, where µ = ±1 correspond to the renormalized spins in the renormalized Hamiltonian H (µ) = α K α S α (µ) with renormalized couplings K α . W ij are variational parameters to be optimized. In particular, if W ij are infinite in a local block of spins and zero everywhere else, then we recover the majority-rule transformation [4]. Importantly, this parameterization satisfies the so-called trace condition µ P (µ|σ) = 1, which is required to correctly reproduce thermodynamics [3,24,26]. To make connection with the RBM in the following discussion, we define the weight factor distribution as where Z = σ,µ e ij Wij σiµj . The weight factor Eq. (3) is then simply the condition distribution of the weight factor distribution, that is, we have P (µ|σ) = P (σ, µ)/ µ P (σ, µ).
An RBM is a generative model that is a main staple deep learning tool to solve tasks that involve unsupervised learning [32,33]. Hidden layers of an RBM can extract meaningful features from the data [34]. In this regard, an RBM with fewer hidden variables than the visible variables resembles coarse-graining in RG, first pointed out by Mehta and Schwab [22]. However, their proposed mapping from the variational RG procedure to unsupervised training of a DNN does not satisfy the trace condition Eq. (4), and thus does not constitute a proper RG (See appendix for a detailed comparison). Here we propose a direct mapping between the RBM and the weight factors such that Eq. (4) is naturally satisfied.
An RBM can be written in terms of weights W ij , hidden variables h j and visible variables v i as where Z RBM = v,h e ij Wij vihj . The empirical feature distributionp (h) can be extracted from the empirical distributionp(v) througĥ where Q(h|v) = Q(v, h)/ h Q(v, h) is the conditional distribution of the hidden variables, given the values of the visible variables [32]. The optimal parameters for the RBM are chosen by minimizing the KL divergence between the empirical distributionp(v) and the marginal distribution h Q(v, h), where D(p q) = σ p(σ) log(p(σ)/q(σ)) for two discrete distribution p(σ) and q(σ). Motivated by the similarity between Eqs.
(2) and (7), we identify the conditional distribution Q(h|v) in the RBM with our parametrized weight factor P (µ|σ) and associate the hidden and visible variables in the RBM with the renormalized and original spins, respectively. In analogy to the optimization scheme of an RBM, we propose an optimal choice of the parameters in the weight factors by minimize the KL divergence between the system distribution and the marginal weight factor distribution which can be carried out using standard ML techniques.

III. STOCHASTIC OPTIMIZATION FOR THE OPTIMAL CRITERION
The optimization problem is solved by the stochastic gradient descent, where the parameters are updated through decrementing them in the direction of the gradient of the KL divergence. We replace the system distribution e H(σ) /Z by its empirical distributionp(σ) over Monte Carlo samples drawn from the Wolff algorithm [35] and write the KL divergence Eq. (9) as an expectation value over the empirical distribution The gradient G ij of the KL divergence Eq. (10) with respect to W ij can be derived as where F (σ) is the free energy defined as F (σ) = log µ e ij Wij σiµj . The first term in Eq. (11) is simply a sample average of the derivative of the free energy and can be readily computed. The second term is approximated using the contrastive divergence algorithm [36] (CD k ) where the expectation value is calculated from samples drawn from a Markov chain initialized with data distribution and implemented by Gibbs sampling with k Markov steps. We update the weights in the direction of negative gradients where the superscript of the weight W (k) indicates the number of training epochs we have descended the weight. We initialize W (0) randomly around zero. Along the gradient descent we obtain a sequence of weight factors, which can be used to compute critical exponents and renormalized couplings, to see what feature distribution (p (h) in Eq. (7)) the RBM is trying to learn. For translational-invariant systems, translational invariant parametrization of the weight factor distribution Eq. (5) can be achieved via convolution [37].

IV. TWO-DIMENSIONAL ISING MODEL
To validate our scheme, we first consider the twodimensional (2D) Ising model, where σ i = ±1, K 1 is the nearest-neighbor coupling and S nn denotes the collection of nearest-neighbor inter-spin interactions. In the following, we consider a 2D lattice of size 32×32 with the periodic boundary condition. We analyze the optimal weight factors' ability to remove longrange interactions by directly calculating the renormalized couplings and extract critical exponents [38]. The computational cost of finding the optimal representation takes seconds to several minutes with a single GPU computer.  Figure 1 shows the weight factors along the optimization process (at 10th, 30th and 50th epochs corresponding to Fig. 2 (a)) learned with a translational invariant filter of size 8 × 8. The filters are initialized uniformly around zero. Localized features emerge after a few epochs of training and progressively aggregate toward the center, in agreement with the conventional wisdom that renormalized and original spins close to one another should couple more strongly than those further apart [39]. On the other hand, the RBM also picks up non-local correlations between the renormalized and original spins, where the interaction strength falls off exponentially with distance.
We proceed to investigate the effect of the criterion of minimizing KL divergence to see what the machine is trying to learn. In Fig. 2 (a), we show the thermal critical exponents calculated from weight factors W (k) along the optimization flow. At the beginning of the training, the partially-optimized weight gives a poor estimate of the thermal critical exponent at the first step of RG transformation. After the 30th epoch, the value grows rapidly and converges to the exact value. In Fig. 2 (b,c), we use the weights obtained at each training epoch to calculate the renormalized coupling parameters along the training trajectory. The renormalized couplings, in machine-learning terms, completely describe the energy model underlying the empirical feature distribution (see Eq. (7)) extracted by the machine for the Ising empirical distribution. In Fig. 2 (b), we see that the interactions are dominated by nearest (K 1 ) and next-nearest (K 2 ) neighbor couplings. The values for the longer-range interactions flow progressively towards zero as shown in Fig. 2 (c). The trend shows that our optimal criterion aims to remove longer-range coupling parameters in the renormalized Hamiltonian.  point. Slightly away from the critical point, the coupling parameters flow away to the infinite (zero) temperature trivial fixed points. Figure. 3(b) and (c), show the renormalized coupling parameters along the RG flow. The coupling parameters coarse-grained with the optimal weight factors reached K 1 = 0.3109(3), K 2 = 0.1051 (2) and K 3 = −0.0184(2) at the third RG step. The values for longer-range interactions are much suppressed compared to those obtained by the majority-rule transformation. Since the renormalized Hamiltonians should be dominated by short-range couplings, our learned weight factors are superior than those for the majority-rule transformation. Table I shows the critical exponents of the 2D Ising model computed using both the RBM and majority-rule transformations. Surprisingly, although the weights are learned without any prior knowledge of the model, the exponent is very close to the exact value at the first step of renormalization transformation giving y t = 1.000 (2), consistent with the exact value within the statistical error. Equally surprising is that the RBM trained on such small training data with only 10 4 samples can generalize well. In contrast, the majority-rule transformation gives y t = 0.975(3) at the first RG iteration. Even though the convergence for the thermal critical exponents looks extremely good, the scheme overestimates the magnetic critical exponents in the first RG step. The discrepancy in the magnetic exponents is also noted previously [7,40].
The weight factors considered in the literature are mostly short-range [41] (decimation and majority transformation), i.e., they only couple one renormalized spin to a few original spins in the immediate vicinity. However, despite the seeming locality, these weight factors generally lead to an infinite proliferation of interactions upon renormalizing. With our proposed criterion, the learned weight factors contain non-local terms that work as counter terms, making the renormalization transformation more local; therefore, only a few short-range interactions are produced during the RG transformation. We note that the strategy along this line of transferring the complexity in renormalized Hamiltonian to the weight factors has yielded the first exactly soluble RG transformation [39].

V. THREE-DIMENSIONAL ISING MODEL
The scheme can be easily generalized to higher dimensions as long as we can train an RBM to represent the optimal RG transformation. Table II shows the thermal critical exponents computed using optimal filters starting at a system size of 64 × 64 × 64. The trailing numbers in the parentheses indicate the linear size of the filters. The filters at the first (64 → 32) and second (32 → 16) steps are learned. The filters in the following RG steps (16 → 8 and 8 → 4) use the same filter obtained in the second step. We compare the results with the values obtained from the majority rule [42]. Only the first twenty couplings out of the total 53 couplings in Ref. [42] are used. The 2 × 2 × 2 optimal filter gives the exponent closest to the best estimate from the Monte Carlo y t = 1.587 [43]. The 2 × 2 × 2 optimal filter is quite homogeneous, with an average value of 0.5254 (2), which is very close to the optimal choice 0.4314 in Ref. [9]. The weight values at the second, third and forth steps are 0.5057(9), 0.510(1) and 0.544(2) respectively.

VI. REAL-SPACE MUTUAL INFORMATION
We have now established that by parametrizing the weight factors as an RBM, we can learn the optimal RG transformation. On the other hand, the RSMI scheme argues that an optimal RG transformation can be obtained by maximizing the RSMI [25,26]. A natural question is how these two schemes are related. In particular, we would like to see if our optimal RG transformation also maximizes the RSMI.
The RSMI measures the information that the knowledge of environment degrees of freedom E gives about the If E completely determines H, then the information gained is maximized and the I(H; E) reduces to the selfinformation (the entropy) of the relevant degrees of freedom H, which itself is upper bounded by the logarithm of all possible configurations of H.
Adopting the definition in Refs. [25,26], we consider a system described by a quadripartite distribution P (V, E, H, O) ( Fig. 4(a)). We define the RSMI of the system as I(H; E), i.e., the mutual information between hidden and environment random variables. The relevant distributions needed to compute I(H; E) are appropriate marginals of P (V, E, H, O).
Here we consider a 4 × 4 Ising model with the periodic boundary condition where RSMI can be can computed exactly. We train a 3 × 3 filter on the system to obtain an optimal weight factor distribution.  Figure 4 (c) shows the evolution of RSMI during training. Random initialization of the filters gives a zero RSMI, and as the training progresses, the RSMI saturates to the upper bound ln 2 0.693. This shows clearly that the optimal weight factors obtained from our algorithm saturate the RSMI as proposed in Refs. [25,26]. However, our scheme allows for a direct calculation of the renormalized coupling parameters and critical exponents using the MCRG algorithms without resorting to proxy systems.

VII. CONCLUSIONS
We demonstrate a scheme based on RBM that is capable of learning the optimal RG transformation from Monte Carlo samples. The similarity between the standard RBM and the weight factors means that we can take advantage of the progress in the ML architectures and techniques to parameterize and train the filters for RG. This algorithm is flexible and can be easily applied to disordered systems [44]. Although we focus on the RBM with binary variables, for models with continuous variables such as XY or Heisenberg models, one can use Gaussian-Bernoulli RBMs to better model the RG transformation [45]. Generalization of the current scheme to quantum systems should be straightforward by the quantum-to-classical mapping of the d-dimensional quantum system to d + 1-dimensional classical system [46]. It would be interesting to test the NMCRG scheme on fermionic systems to see how fermionic sign manifests itself. Finally, we note that in the 2D Ising model, the filter has to reach the size of 8 × 8 to obtain reasonable critical exponents, while in the 3D case, a 2 × 2 × 2 filter suffices to give the best result. Wether this can be associated with the logarithmic correction in the 2D Ising model warrants further studies [47].

This work was supported by Ministry of Science and Technology (MOST) of Taiwan under Grants No.
108-2112-M-002-020-MY3 and No. 107-2112-M-002-016-MY3, and partly supported by National Center of Theoretical Science (NCTS) of Taiwan. We are grateful to the National Center for High-performance Computing for computer time and facilities. The code that generates data used in this paper is available at https: //github.com/unixtomato/nmcrg.

Appendix A: Monte Carlo Renormalization Group
Here we summarize the MCRG method used to calculate the critical exponents and renormalized coupling parameters from Monte Carlo samples for a given filter [38].
To determine the critical exponents, we need to calculate the derivatives of transformation which is given by the solution of the linear equation [4] ∂ S Here S (n) γ is the expectation of the spin combinations at the nth RG iterations. The derivatives of these expectation value of the spin combinations are obtained from the correlation functions (A4) Given a set of spin configurations sampled from some Hamiltonian H = α K α S α , we would like to infer back the coupling parameters of H. Define a specific spindependent expectation where z l = σ l e H l and H l = α K α S α,l and S α,l are combination of spins in S α that includes only σ l . Here z l and H l and hence S α,l l depend on spins neighboring to σ l . The summation of σ l can be carried out analytically and we obtain the formula S α,l l = S α,l tanh where S α,l ≡ σ l S α,l . The correlation functions can then be written in another form as where m α is the number of spins in the combination S α . Introducing a second set of coupling parameters { K α } we define Figure 5 shows the couplings used for the calculation of the renormalized coupling parameters for the twodimensional Ising model. The first seven even couplings in (a) are used to compute the thermal critical exponent. The odd couplings in (b) are used to compute the magnetic critical exponent.
Appendix B: Comparison with Other RBM-based Schemes

RG transformation and Normalizing Condition
Consider again a general RG transformation where P (µ|σ) is the weight factor. The weight factor is required to satisfy the trace condition µ P (µ|σ) = 1.
We argue that the trace condition is indispensable, since the condition leads to the invariance of free energy under renormalization and the following fundamental relation where f (K) is the free energy density of the system in the thermodynamic limit. For K consisting of nearest-neighbor coupling and magnetic field, under suitable transformation, we could arrive at where y t and y h are the often soughtafter critical thermal and magnetic exponents.
In the following, we review the schemes proposed in Refs. [22] and [25] and point out the shortcomings in each scheme.

Variational RG and Mehta and Schwab's Mapping
In Ref. [22], the weight factor is defined as Here H(σ) is the original Hamiltonian, e.g., H(σ) = K ij σ i σ j . The W ij 's are the variational parameters. The form of the weight factor does not satisfy the trace condition and, in general, it is not possible to choose the parameters W ij to satisfy the trace condition (B2). The fundamental relation (B3) is only approximated.
We note that in the original procedure of variational renormalization group [21], the form of the weight factor is chosen with variational parameters such that for all values of variational parameters the weight factor must satisfy the trace condition. The variational parameters are used, instead, to optimize the lower bound of the approximated free energy density.
Define a distribution of the weight factor with variational parameters W ij , P W (σ) = µ e ij Wij σiµj σ µ e ij Wij σiµj .

(B5)
In Ref. [22] the variational parameters are chosen to make as small as possible. This completely fixes the variational parameters, leaving no room for optimizing the lower bound free energy approximation. That is to say, the variational approximation in machine learning (B6) and the variational approximation of thevariational renormalization theory work at completely different levels.
The rationale of the criterion (B6) for choosing the variational parameters is that it is a necessary but not sufficient condition for the trace condition to be satisfied The normalization factor σ µ e ij Wij σiµj is equal to the partition function for the original Hamiltonian, denoted as Z. Therefore the divergence (B6) is exactly zero. The criterion is not sufficient since when we have where the trace condition fails up to some unknown constant not necessarily equal to one. On the other hand, with the parametrized form of weight factor as in (B4), the renormalized Hamiltonian would then describe the marginal distribution P W (µ) of the RBM. Define P W (µ) to be (B8) The normalization factor σ µ e ij Wij σiµj is thus equal to the partition function, Z , for the renormalized Hamiltonian irrespective of the choice of the variational parameters W ij . Therefore In this respect, we can say that the hidden variables of the machine is described by the renormalized Hamiltonian.

Real-space Mutual Information Algorithm
In Ref. [25], the weight factor factorizes as

l a t e x i t s h a 1 _ b a s e 6 4 = " B A d n F + u n h E O R k r G s i h A 2 H o E g e h E = " > A A A B 8 n i c d V D L S g M x F M 3 4 r P V V d e k m W A R X Q 2 b a 0 Q o u C m 5 c V r A P m A 4 l k 2 b a 0 E x m S D J C G f o Z b l w o 4 t a v c e f f m G k r q O i B w O G c e 8 m 5 J 0 w 5 U x q h D 2 t l d W 1 9 Y 7 O 0 V d 7 e 2 d 3 b r x w c d l S S S U L b J O G J 7 I V Y U c 4 E b W u m O e 2 l k u I 4 5 L Q b T q 4 L v 3 t P p W K J u N P T l A Y x H g k W M Y K 1 k f x + j P W Y Y J 5 3 Z o N K F d m X y P F c D y K 7 j s 5 d r 2 6 I 4 y G n U Y O O j e a o g i V a g 8 p 7 f 5 i Q L K Z C E 4 6 V 8 h 2 U 6 i D H U j P C 6 a z c z x R N M Z n g E f U N F T i m K s j n k W f w 1 C h D G C X S P K H h X P 2 + k e N Y q W k c m s k i o v r t F e J f n p / p q B H k T K S Z p o I s P o o y D n U C i / v h k E l K N J 8 a g o l k J i s k Y y w x 0 a a l s i n h 6 1 L 4 P + m 4 t l O z 3 d t 6 t X m 1 r K M E j s E J O A M O u A B N c A N a o A 0 I S M A D e A L P l r Y e r R f r d T G 6 Y i 1 3 j s A P W G + f 6 B m R p w = = < / l a t e x i t >
H < l a t e x i t s h a 1 _ b a s e 6 4 = " S p 4 v o 8 T s X 5 e r k x 5 k j 5 h z K I L m 5 + g = " > A A A B 8 n i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F Z J W a g U X B T d d V r A P S E O Z T K f t 0 E k m z N w I J f Q z 3 L h Q x K 1 f 4 8 6 / c d J G U N E D A 4 d z 7 m X O P U E s u A b H + b A K a + s b m 1 v F 7 d L O 7 t 7 + Q f n w q K t l o i j r U C m k 6 g d E M 8 E j 1 g E O g v V j x U g Y C N Y L Z j e Z 3 7 t n S n M Z 3 c E 8 Z n 5 I J h E f c 0 r A S N 4 g J D C l R K S t x b B c c e z 6 V b V R c 7 B j O 0 t k x K 2 7 b h 2 7 u V J B O d r D 8 v t g J G k S s g i o I F p 7 r h O D n x I F n A q 2 K A 0 S z W J C Z 2 T C P E M j E j L t p 8 v I C 3 x m l B E e S 2 V e B H i p f t 9 I S a j 1 P A z M Z B Z R / / Y y 8 S / P S 2 D c 8 F M e x Q m w i K 4 + G i c C g 8 T Z / X j E F a M g 5 o Y Q q r j J i u m U K E L B t F Q y J X x d i v 8 n 3 a r t 1 u z q 7 U W l e Z 3 X U U Q n 6 B S d I x d d o i Z q o T b q I I o k e k B P 6 N k C 6 9 F 6 s V 5 X o w U r 3 z l G P 2 C 9 f Q K 5 K p G H < / l a t e x i t > x v l z c r W 9 s 7 u X n X / o K N E J j F p Y 8 G E 7 E V I E U Y 5 a W u q G e m l k q A k Y q Q b T a 4 K v X t P p K K C 3 + l p S s I E j T i N K U b a U E E / Q X q M E c u v Z 4 N q z b H 9 h u f 7 d e j Y d d 9 t u I 4 B n v n 9 c + j a z r x q Y F m t Q f W 9 P x Q 4 S w j X m C G l A t d J d Z g j q S l m Z F b p Z 4 q k C E / Q i A Q G c p Q Q F e Z z y z N 4 Y p g h j I U 0 j 2 s 4 Z 7 9 where H j = {µ j } consists of a single renormalized spin and V j = {σ 1 j , σ 2 j } consists of two original spins in the case of one-dimension system (and 2 × 2 in the case of two-dimensional system), see Fig. 6. The local weight factor is parametrized as

P 5 C h R a p p E p r O w q H 5 r B f m X F m Q 6 v g h z y t N M E 4 4 X i + K M Q S 1 g c T 8 c U k m w Z l M D E J b U e I V 4 j C T C 2 q R U M S F 8 X Q r / B x 3 P d n 3 b u z 2 r N S + X c Z T B E T g G p 8 A F D d A E N 6 A F 2 g A D A R 7 A E 3 i 2 t P V o v V i v i 9 a S t Z w 5 B D / K e v s E
The variational parameters Λ is obtained through only a single copy of the local weight factor and hence we omit the subscript j in the following. as large as possible, where the distributions needed in the right hand side are defined as above. Since P (E) is independent of Λ, we instead maximize a proxy A Λ = H,E P Λ (E, H) log (P Λ (E, H)/P Λ (H)) of mutual information. However, to evaluate the proxy A Λ , further approximations have to be made.
In order to perform quantitative analysis, the authors construct a "thermometer" function T (A Λ ) which maps the proxy A Λ to the temperature. The thermometer works to extract effective temperature of the renormalized system. To construct such a thermometer, it is required to generate sets of MC samples at different temperatures. For each set of samples, one can compute the proxy A Λ and hence know the mapping from A Λ to the temperature T for this set of samples. For a given type of system (e.g., Ising), we can write T (A Λ ) as T (T 0 , L, b, l) where T 0 is the temperature of the initially prepared system, L is the initial system size, b is the scale factor, and l is the scaling length (l = 0 means the original system and l = 1 means one-step renormalization and so on). We can then fit a function to these sets of samples and construct the thermometer. M. Koch-Janusz and Z. Ringel postulate a scaling function of the form f ((L/b l ) 1/ν ) related to the effective renormalized temperature T (T 0 , L, b, l) as where T c is the critical temperature of the original system. Finally one could collapse the plot of (T −T c )/(T 0 − T c ) as a function of (L/b l ) 1/ν to estimate the value of ν and T c .

Neural Monte Carlo Renormalization Group
In our work, we define the weight factor to be P W (µ|σ) = e ij Wij σiµj µ e ij Wij σiµj . (B14) where, for translational invariant system, the variational parameters are shift invariant, that is, for different j and j we have in the case of one-dimensional system. The weight factor satisfies the trace condition for all values of W ij 's. Let us define a joint distribution out of this weight factor P W (µ, σ) = e ij Wij σiµj σ µ e ij Wij σiµj . (B16) Here P W (µ, σ) has exactly the same form of a RBM and the weight factor can be viewd as the conditional distribution P W (µ|σ) = P W (µ, σ)/ µ P W (µ, σ). Consider one of the breakthrough in the realm of deep learning where Hinton introduced a greedy layer-wise unsupervised learning algorithm (See Sec. 2.3 of [32]). Denote P W (µ|σ) the posterior over µ associated with the trained RBM (we recall that σ is the observed input). This gives rise to a (feature) empirical distribution p (µ) over the hidden variables µ when σ is sampled from the data empirical distribution p(σ): we have p (µ) = σ P W (µ|σ)p(σ). (B17) The samples of µ with empirical distribution p (µ) become the input for another layer of RBM. We can view RBM to work as extracting features µ from inputs σ. Note the similarity between RG transformation (B1) and the feature extraction process (B17). We could postulate that the input distribution p(σ) is determined by some Hamiltonian H(σ) where p(σ) = e H(σ) /Z. We postulate that the posterior distribution P W (µ|σ) of an RBM works as a weight factor to do RG transformation: e H (µ) = σ P W (µ|σ)e H(σ) . Hence the feature extraction process (B17) becomes a necessary condition for the system to perform the RG transformation. In other words, the feature distribution extracted by the machine is described by the renormalized Hamiltonian. Now the variational parameters in the weight factor P W (µ|σ) are free to change. All choices of parameters should derive a well-defined RG transformation. The criterion for choosing the parameters is entirely arbitrary from the perspective of doing RG: we do not know a priori what weights W ij 's could give a "nicer" RG flow. A nice RG flow, however, should bring the original Hamiltonian closer to the fixed point fast. Also, it should remove long-range coupling parameters for practical purposes of performing RG and, loosely speaking, for killing the irrelevant scaling fields. Critical exponents and the coupling parameters can be easily computed using the MCRG techniques described in the previous section.
In the realm of machine learning, the weights of an RBM are chosen to make the divergence (B6) as small as possible. We note that the criterion is entirely machinelearning-theoretical. In contrast, in Ref. [22], the criterion also serves as a necessary condition for the weight factor to satisfy the trace condition, a notion which is RG-theoretical.
In summary, our NMCRG scheme provides an ansatz for the weight factors in the RG transformation such that the trace condition is always satisfied and the optimal RG transformation can be learned. It also allows for a direct computation of the renormalized coupling parameters and critical exponents. As demonstrated in the main text, the MNCRG scheme also naturally saturates RSMI. The simplicity and flexibility of the scheme should find more applications in the future.