Optimal Renormalization Group Transformation from Information Theory

The connections between information theory, statistical physics and quantum field theory have been the focus of renewed attention. In particular, the renormalization group (RG) has been explored from this perspective. Recently, a variational algorithm employing machine learning tools to identify the relevant degrees of freedom of a statistical system by maximizing an information-theoretic quantity, the real-space mutual information (RSMI), was proposed for real-space RG. Here we investigate analytically the RG coarse-graining procedure and the renormalized Hamiltonian, which the RSMI algorithm defines. By a combination of general arguments, exact calculations and toy models we show that the RSMI coarse-graining is optimal in a sense we define. In particular, a perfect RSMI coarse-graining generically does not increase the range of a short-ranged Hamiltonian, in any dimension. For the case of the 1D Ising model we perturbatively derive the dependence of the coefficients of the renormalized Hamiltonian on the real-space mutual information retained by a generic coarse-graining procedure. We also study the dependence of the optimal coarse-graining on the prior constraints on the number and type of coarse-grained variables. We construct toy models illustrating our findings.

The connections between information theory, statistical physics and quantum field theory have been the focus of renewed attention. In particular, the renormalization group (RG) has been explored from this perspective. Recently, a variational algorithm employing machine learning tools to identify the relevant degrees of freedom of a statistical system by maximizing an information-theoretic quantity, the real-space mutual information (RSMI), was proposed for real-space RG. Here we investigate analytically the RG coarse-graining procedure and the renormalized Hamiltonian, which the RSMI algorithm defines. By a combination of general arguments, exact calculations and toy models we show that the RSMI coarse-graining is optimal in a sense we define. In particular, a perfect RSMI coarse-graining generically does not increase the range of a short-ranged Hamiltonian, in any dimension. For the case of the 1D Ising model we perturbatively derive the dependence of the coefficients of the renormalized Hamiltonian on the real-space mutual information retained by a generic coarse-graining procedure. We also study the dependence of the optimal coarse-graining on the prior constraints on the number and type of coarse-grained variables. We construct toy models illustrating our findings.

I. INTRODUCTION
The conceptual relations between physics and information theory date back to the very earliest days of statistical mechanics; they include the pioneering work of Boltzmann and Gibbs on entropy [1,2], finding its direct counterpart in Shannon's information entropy [3], and investigations of Szilard and Landauer [4,5]. In the quantum regime research initially focused on foundational challenges posed by the notion of entanglement, but soon gave rise to the wide discipline of quantum information theory [6], whose more practical aspects include quantum algorithms and computation.
In recent years there has been a renewed interest in applying the formalism and tools of information theory to fundamental problems of theoretical physics. The motivation mainly comes from two, not entirely unrelated, directions. On the one hand the high-energy community is actively investigating the idea of holography in quantum field theories [7][8][9], originally inspired by black-hole thermodynamics. On the other hand in condensed matter theory there is a growing appreciation of the role of the entanglement structure of quantum wave functions in determining the physical properties of the system. This is exemplified by the short-and long-range entanglement distinguishing the symmetry protected topological phases [10][11][12] (e.g. topological insulators) from genuine, fractionalized topological orders (e.g. Fractional Quantum Hall states). The conceptual advances led also to constructive developments in the form of new ansätze for wave functions (MPS [13], MERA [14]) and numerical algorithms (DMRG [15], NQS [16]).
The focus of this work is on the renormalization group (RG). One of the conceptually most profound developments in theoretical physics, in particular condensed matter theory, it provides -beyond more direct applications -a theoretical foundation for the notion of universality [17][18][19][20][21]. The possible connections of RG to infor-mation theory have been explored in a number of works [22][23][24][25][26][27][28] in both classical and quantum settings. In particular, in a previous work [28] some of the present authors introduced a numerical algorithm for real-space RG of classical statistical systems, based on the characterization of relevant degrees of freedom supported in a spatial block as the ones sharing the most mutual information with the environment of the block. The algorithm employs machine learning techniques to extract those degrees of freedom and combines it with an iterative sampling scheme, similar in spirit to Monte Carlo RG [29,30], though, in a crucial difference, the form of the RG coarsegraining rule is not given, but rather learned. Strikingly, the coarse-graining rules discovered by the algorithm for the test systems were in a certain sense optimal: they ignored irrelevant short-scale noise and they are known to result in simple effective Hamiltonians or matched nontrivial analytical results.
The above suggests, that real-space RG can be universally defined in terms of information theory, rather than based on problem-specific physical intuition. Here we develop a theoretical foundation inspired by, and underlying those numerical results. We show they were not accidental, but rather a consequence of general principles. To this end we study analytically the coarsegraining procedure maximizing the real-space mutual information with the environment (RSMI), and the effective Hamiltonian it defines. For the solvable example of the 1D Ising model we perturbatively derive the coupling constants of the renormalized Hamiltonian resulting from, and mutual information captured by, an arbitrary coarse-graining procedure and show decay of the higher-order and/or long-range terms with increased mutual information. We show that this holds true more generally: an ideal, full-RSMI-retaining coarse-graining of a nearest-neighbour Hamiltonian results in a strictly nearest-neighbour effective Hamiltonian, in any dimension. We then theoretically investigate the effects gen-erally imposed by the constraints on the number and type of coarse-grained variables, which force a non-ideal coarse-graining when part of the relevant information is lost. Furthermore, we construct simple toy models providing intuitive understanding of our results, in particular the difference between the preferred solutions in one and higher dimensions.
The combination of the analytical results on idealized and more realistic schemes, the toy models, as well as numerical results in Ref. [28] strongly supports the notion of RSMI-maximization as a model-independent variational principle defining the optimal RG coarse-graining. In contrast to fixed schemes, this RG transformation is, by construction, informed by the physics of the system under consideration, including the position in the phase diagram. This could allow application of well-behaved RG schemes to systems, for which they are currently not known, avoiding many of the pitfalls befalling fixed RG transformations.
The paper is organized as follows: in Sec.II the RSMI algorithm and the information-theoretic formalism it uses are reviewed, in Sec.III we prove that a RSMImaximizing Hamiltonian does not generate longer-range interactions in the situation of full information capture. In Sec.IV we discuss the derivation of the effective (renormalized) Hamiltonian and its properties. In Sec.V we investigate the complexity of the effective Hamiltonian in the realistic situation of partial information capture on the example of coarse-grainings of the 1D Ising model. In Sec.VI we study the effect of constraints on the number and type of coarse-grained degrees of freedom on the optimal RG procedure more generally. We introduce toy models explaining the differences in optimal coarsegraining procedures in 1D and 2D. Finally, in Sec.VII we discuss implications of the results, possible generalizations and open questions. A number of appendices give technical details of derivations of the statements in the main text and additional information.

II. THE RSMI ALGORITHM
The real-space mutual information (RSMI) algorithm is defined in the context of real-space RG, originally introduced by Kadanoff for lattice models [17]. The goal of real-space RG [21] is to coarse-grain a given set of degrees of freedom X in position space in order to integrate out short-range fluctuations and retain only long-range correlations, and in so doing to construct an effective theory. An iterative application of this procedure should result in recursive relations between coupling constants of the Hamiltonian at successive RG steps -those are the RG flow equations formalising the relationship between effective theories at different length scales. Consider a generic system with real-space degrees of freedom X described by the Hamiltonian H[X ] and a canonical partition function: with the inverse temperature β = 1/k B T and the reduced Hamiltonian K := −βH. Equivalently, the system is specified by a probability measure: The coarse-graining transformation X → X between the set of the original degrees of freedom and a (smaller) set of new degrees of freedom is given by a conditional probability distribution P Λ (X |X ), where Λ is a set of parameters completely specifying the rule (note, that the rule can be totally deterministic, in which case P Λ is a delta-function). The probability measure of the coarsegrained system is then: If P (X ) is (or at least can be approximated by) a Gibbs measure, then the requirement to correctly reproduce thermodynamics enforces Z = Z and a renormalized Hamiltonian H [X ] in the new variables X can be defined implicitly via: The procedure is often implemented in the form of block RG [21,31]. This corresponds to a factorization of the conditional probability distribution into independent contributions from equivalent (assuming translation invariance) blocks V ⊂ X : where {V j } n j=1 and {H j } n j=1 are partitions of X and X , respectively, and P Λ now defines the coarse-graining of a single block (and therefore Λ contains substantially fewer parameters). Concrete examples of such P Λ include the standard "decimation" or "majority-rule" transformations [see Eqs. (20,21)].
Not every choice of P Λ is physically meaningful. It should at least be consistent with the symmetries of the system under consideration, for instance. This is, however, not sufficient in practice. While it may be difficult to formulate a concise criterion for the choice of the coarse-graining transformation it is clear that in order to derive the recursive RG equations the effective Hamiltonian cannot proliferate new couplings at each step. If there is to be a chance of analytical control over the procedure, the interactions in the effective Hamiltonian should be tractable (short-ranged, for instance). That is to say, if one chooses the "correct" degrees of freedom to describe the system, the resulting theory should be "simple". Numerous examples of failure to achieve this can be found in the literature [32,33], and include cases as simple as decimation of the Ising model in 2D. Implicit in this is the notion that there does not exist a single RG transformation which does the job, but rather the transformation should be designed for the problem at hand [34].
Recently, some of us proposed the maximization of the real-space mutual information (introduced below) as a criterion for a physically meaningful RG transformation [28]. The idea behind it is that the effective block degrees of freedom, in whose terms the long-wavelength theory is simple, are those which retain the most of the information (already present in the block) about longwavelength properties of the system. This informally introduced "information" can be formalized by the following construction. Consider a single block V at a time and divide the system into four regions X = V ∪ B ∪ E ∪ O: the visibles (i.e. the block) V, the buffer B, the environment E and the remaining outer part of the system O (which is only introduced for algorithmic reasons, conceptually the environment E could also contain this part). Fig.(2) depicts this decomposition in the case of a 1D spin model, but it trivially generalizes to any dimension. The real-space mutual information between the new (coarse- Schematic decomposition of the system for the purpose of defining the mutual information IΛ(H : E) (in 1D, for concreteness). The full system is partitioned into blocks of visibles V (yellow) embedded into a buffer B (blue) and surrounded by the environment E (green). The remaining part of the system is denoted by O in the main text. The conditional probability distribution PΛ(H|V) couples V to the hiddens H (red). grained) degrees of freedom H and the environment E of the original ones (i.e. of the block) is then defined as: where P Λ (E, H) and P Λ (H) are marginal distributions of P Λ (H, X ) = P Λ (H|V)P (X ). Thus I Λ (H : E) is the standard mutual information between the random variables H and E. Exclusion of the buffer B (in contrast to other adaptive schemes, see for instance [35]), generally of linear extent comparable to V, is of fundamental importance: it filters out short-range correlations, leaving only the long-range contributions to I Λ (H : E). The RSMI satisfies the following bounds (see also Appendix A): where H(H) denotes the information entropy of H and I(V : E) is the mutual information of the visibles with the environment. The optimization algorithm starts with a set of samples drawn from P (X ) and a differentiable ansatz for P Λ (H|V), which in Ref. [28] takes the form of a Restricted Boltzmann Machine (RBM), parametrized by Λ (see Appendix C 2), and updates the parameters using a (stochastic) gradient descent procedure. The cost function to be maximized is precisely I Λ (H : E), which in the course of the training is increased towards the value of I(V : E). The iterative procedure is shown in Fig. 1. Using the trained P Λ (H|V) the original set of samples drawn from P (X ) can be coarse-grained and the full procedure re-computed for a subsequent RG step.

III. OPTIMALITY: THE MEASURE AND THE EFFECTIVE HAMILTONIAN
In what sense could the RSMI construction reviewed above be described as optimal ? As we have alluded to in the introduction this is linked to the short-rangeness and/or simplicity (i.e. not having arbitrary high-order interactions) of the resulting effective Hamiltonian. In fact, the question can also be approached at the level of the probability measure (which is the fundamental object the RSMI algorithm works with). Here we make those statements more concrete.
Let us first consider the following situation: given a 1D system specified by a short-ranged Hamiltonian we introduce a coarse-graining {V j } with a block size chosen so that the Hamiltonian is nearest-neighbour with respect to the blocks. Let us choose an arbitrary block V 0 , denote its immediate neighbours V ±1 as the buffer B and all the remaining blocks {V j<−1 } and {V j>1 } as the environment E 0 , or in more detail, as left-and right-environment E L/R (V 0 ), respectively. Assume now that H 0 , the coarsegrained variable for V 0 , is constructed in such a way that I(H 0 : E 0 ) = I(V 0 : E 0 ), i.e. the coarse-grained variable retains all of the information which the original block V 0 contained about the environment and, by extension, about any long-wavelength physics. In this idealized situation of full information capture the following holds true (see Appendix B): if one considers the probability measure on the coarse-grained variables P ({H j }) and integrates (traces) out the neighbours H ±1 of H 0 then, for H 0 clamped, the probability measure factorizes: where H * 0 denotes the clamped variable. In other words, for fixed H 0 the probabilities of its left and right environments E L/R (H 0 ) are independent. In the effective Hamiltonian language this corresponds to a statement that the energy contains no direct coupling terms between E L (H 0 ) and E R (H 0 ): Since the variables {H j<−1 } and {H j>1 } are decoupled after integrating out H ±1 there generically would not have been any longer-range interaction (in particular: next-nearest neighbour) involving H ±1 in the renormalized Hamiltonian, or the measure/Hamiltonian would not factorize. Furthermore, since the choice of V 0 , E 0 was arbitrary, we have that the full coarse-grained Hamiltonian is nearest-neighbour.
The argument, under very mild additional assumptions, generalizes to any dimension D. Taking a regular coarse-graining pattern with a radius of the block sufficiently large to make the short-ranged Hamiltonian nearest-neighbour with respect to the blocks, and under analogous assumption of full information capture, we can repeat the above reasoning, clamping -instead of a single arbitrary variable H 0 -a hyperplane of dimension D − 1 separating the coarse grained variables {H j } into two disconnected sets to show that no longer-ranged interactions across the hyperplane can exist. Since the choice of hyperplane is arbitrary we conclude that the effective Hamiltonian is nearest-neighbour, as the original one was (see Appendix B). A perfect RSMI scheme does not, therefore, increase the range of a short-ranged Hamiltonian. Note that we do not make any statements about the order of the interactions at this point (i.e. twospin, three spin, ...).
While very appealing, the above results have one serious shortcoming: in practical coarse-graining schemes we do not typically satisfy the assumption I(H 0 : E 0 ) = I(V 0 : E 0 ). This is due to the fact that the block size as well as the number and character (Ising spin, Potts spin, ...) of coarse-grained variables are usually chosen a priori. We therefore investigate a more realistic setup, in which the RSMI is maximized under the constraint of number and type of coarse-grained degrees of freedom. Additionally, since the RG rule is optimized iteratively, we can study the approach to the optimal solution by considering the properties of the renormalized Hamiltonian defined by the coarse-graining rule at any stage of the training. In order to do this analytically we now show how the effective Hamiltonian can be expressed by appropriate cumulant expansion [31] (though the RSMI algorithm deals with probability measure as the basic object, and at no point computes the Hamiltonian, the Hamiltonian picture is more interpretable physically). Subsequently we apply this machinery to the 1D Ising model.

IV. THE CUMULANT EXPANSION
Consider a generic Hamiltonian K[X ]. We split it into two parts [31]: where K 0 , contains intra-block terms, i.e. those which only couple spins within a single block, and K 1 contains inter -block terms, i.e. those that couple spins from different blocks. Such a decomposition simplifies the calculations significantly. Due to translation invariance the intra-block terms are all of the same form: Using the decomposition Eqs. (11) and (12) the definition of the renormalized Hamiltonian in Eq.(4) can be rewritten as an intra-block average of the inter-block part of the Hamiltonian: where the average · Λ,0 is over P Λ,0 (X |X ) as a probability distribution in X and thus introduces a dependence on the new spin variables X . We indicate this dependence by square brackets [.] after the average. Equation (13) lends itself to a cumulant expansion: with the standard expressions for the cumulants in terms of moments, the first few of which are given by: where for brevity we did not indicate the dependence on X . The powers of K 1 inside the averages induce couplings between multiple blocks and naturally lead to new coupling terms in the effective Hamiltonian. The cumulant expansion Eq. (14) allows to determine the new Hamiltonian by taking the logarithm of Eq.(13): The renormalized coupling constants are not apparent in Eq. (16). In order to identify them we introduce the following canonical form of the Hamiltonian: with α 1 = 1 and α ∈ {0, 1} for all > 1. Here, addition of the indices is to be understood modulo n (i.e. with periodic boundary conditions). Note that arbitrary orders k of the cumulant expansion C k contribute to each coupling constant K α1,α2,...,αn .

V. EXAMPLE: THE 1D ISING MODEL
To investigate analytically the relationship between the effective Hamiltonian and the real-space mutual information for practical coarse-graining procedures, we consider the example of the one-dimensional Ising model with nearest-neighbor interactions and periodic boundary conditions. Using the tools introduced in the previous section, we first derive the effective Hamiltonian corresponding to a RSMI-maximizing coarse-graining and show it remains nearest-neighbour. We then compute explicitly the amount of RSMI captured by an arbitrary coarse-graining and examine the corresponding effective Hamiltonian to establish a general relation between the two. The Ising Hamiltonian reads: The sizes of the block, buffer and environment regions, introduced in Sec. II are given by L V , L B and L E . Accordingly, there are n = N/L V blocks.
To best illustrate the results we now specialize to the (typical) case of blocks of two visible spins V = {v 1 , v 2 }, coarse-grained into a single hidden spin h (computations for general L V are analogous). The RG rule is parametrized by an RBM ansatz: with Λ = (λ 1 , λ 2 ) describing the quadratic coupling of visible to hidden spins (see Appendix C 2 for discussion of the ansatz). In Fig.(2) the decomposition of the system and the RG rule are schematically shown. The standard decimation and the majority rule coarsegraining schemes are given in our language by: and by: respectively. They are easily seen to correspond to the choice of Λ dec = (λ, 0) and Λ maj = (λ, λ) in the limit λ → ∞. For the case of decimation an exact calculation using the transfer matrix approach yields an effective Hamiltonian of the same nearest-neighbour form, albeit with a renormalized coupling constant [36,37]: For the majority rule, and any other choice of parameters Λ, a perturbative expansion of the effective Hamiltonian can be obtained via a cumulant expansion Eq.(16), as discussed in Sec.III. The cumulants can be expressed in terms of averages of the form K 1 [X ] k Λ,0 , which factorize into averages of operators from a single block. In our example the non-vanishing averages are: as shown in Appendix C 1.
The binary nature of h and the Z 2 symmetry of the Ising model, which implies with the effective block-parameters a 1 , a 2 , b independent of the coarse-grained variable h and functions of Λ and K only, whose closed form expressions can easily be found (see Appendix D). Consequently, the averages K k 1 Λ,0 , and thus also the Hamiltonian K , are polynomials in the new degrees of freedom X , the reduced temperature K and the block parameters, which gives rise to Eq. (17). In practice the cumulant expansion is terminated at a finite order M , which results in an expansion of K and thus of each coupling constant K α1,α2,...,αn up to that order in K . All the information about the RG rule (except for the size of H, which is fixed at the outset) is contained in the dependence of the effective block-parameters on Λ (and on N , K).
Expressing the moments K k 1 Λ,b appearing in the cumulant expansion in terms of the new variables X is a combinatorial problem. Each term in K 1 couples spins from neighboring blocks j and j + 1, so that: The average of each summand factorizes into contributions from each block, whose value [see Eq. (24)] is determined by the arrangement of j 1 , . . . , j k . Thus, the calculation is reduced to finding and grouping all equivalent (under the fact that for Ising variables x 2 j = 1) configurations (j 1 , . . . , j k ). Bringing the resulting polynomial in canonical form (17) is an inverse problem and is solved by recursively eliminating non-canonical terms. For a given M we can thus finally arrive at expression of coupling constants K α1,α2,...,αn as functions of Λ and K (see Appendix D for details).
We are now in a position to examine the effective Hamiltonian obtained by applying the RSMImaximization procedure Fig.(1) to the model Eq. (18). Anticipating the results in Fig.(4), in Fig.(3) we compare, for varying K and order of cumulant expansion M , the renormalized nearest-neighbour (NN) coupling obtained in the RSMI-favoured solution with the exact, nonperturbative one Eq. (22) [which we refer to as "exact decimation"]. The two results converge with increasing M , and the convergence is faster for weak coupling/higher temperatures, which is unsurprising since the cumulant expansion is in powers of K. We emphasize again that the RSMI algorithm itself works on the level of the probability measure, and at no point does it compute the effective Hamiltonian. It is only when we want to examine the renormalized Hamiltonian which the convergedin the sense of saturating the mutual information during optimization of the Λ parameters -RSMI solution corresponds to, that we are performing the cumulant expansion.
Since "exact decimation" leads to a strictly NN effective Hamiltonian in the 1D Ising case, and since perturbatively the RSMI-favoured solution converges to the decimation value for the NN coupling, it is instructive to inspect the behaviour of the m-body couplings in the effective Hamiltonian for larger order m. Denoting the m-spin coupling with distances 1 , 2 , . . . , m between the spins by K m ( 1 , 2 , . . . , m ), with K m ( ) short for K m ( , , . . . , ), we observe that, in the limit of weak coupling (small K), both → K 2 ( ), i.e. arbitrary range two-body interactions, as well as m → |K m (1)|, i.e. arbitrary order NN-interactions, decay exponentially. This is shown in Figs. (8) and (9) in Appendix D. The decay length is characterized by K 2 (2)/K 2 (1) and K m (1)/K 2 (1), respectively. Thus, the RSMI approach indeed converges to the "exact decimation" in this case, which is known to be the optimal choice.
To further strengthen the link between the amount of RSMI retained and the resulting properties of the effective Hamiltonian we now consider a generic coarsegraining, suboptimal from the RSMI perspective (i.e. away from the maximum the RSMI algorithm strives for). To this end we compute the mutual information I Λ (H : E) captured for the Ising model by a general coarse-graining rule Eq. (19) with parameters Λ = (λ 1 , λ 2 ). This calculation can be performed exactly using the transfer matrix method. Defining the correlator G(n) := tanh(K) n , the binary entropy: and the probability of the closest environment spins: the answer is given by: where L B is the buffer size, and there is no dependence on the environment size L E (see Appendix D 4). Equipped with the result Eq. (28), for an arbitrary coarse-graining defined by a choice of Λ, we can now compute both the amount of mutual information with the environment retained (RSMI), as well as the effective Hamiltonian generated. In Fig.(4a) the amount of information captured is shown as a function of (λ 1 , λ 2 ), in units of I(V : E) (for concreteness, all plots are for K = 0.1 and a single site buffer: L B = 1). A few observations can be made: the choices of Λ retaining more RSMI are not symmetric in |λ 1 | and |λ 2 |, but instead tend to (±λ, 0) and (0, ±λ) for large enough |λ|, i.e. they resemble decimation Eq. (20) [the four plateaux in Fig.(4) are not exactly flat, as also examined in Fig.(5)], as opposed to majority rule Eq.(21) which, in fact, captures the least information. The symmetries of the plot are due to global Z 2 Ising symmetry as well as an additional Z 2 symmetry of the mutual information: correlation and anti-correlation for random variables is equivalent from the point of view of information. Furthermore, the lack of information retained for small ||Λ|| 2 is due to the fact that in this case the coarse-graining Eq. (19) only weakly depends on the visible spins and is essentially randomly assigning the value of the hidden spin (i.e. it is dominated by random noise). In other words, it only makes sense to think of Eq.(19) as a coarse-graining if it strongly depends on the original spins, i.e. for large ||Λ|| 2 .
The properties of the corresponding effective Hamiltonians can be understood with the help of Figs.(4b) and (4c), where the ratio of next-nearest-neighbour (NNN) to NN terms as well as the ratio of NN 4-body to 2body terms in the effective Hamiltonian are plotted as a function of Λ (note the inverted color scale!). It is apparent that decimation-like choices, which maximize RSMI, result also in vanishing NNN and 4-body terms (and more generally long-range or high-order terms, as discussed previously and shown in Figs. (8) and (9) in Appendix D). This is examined in more detail in Fig.5: trajectories in the parameter space Λ are chosen according to λ(cos(θ), sin(θ)) with θ ∈ [0, π], for different magnitudes |λ|. The ratios in Figs.(4b) and (4c), which we dubbed "rangeness" and "m-bodyness" for brevity, are plotted against the mutual information along the trajec- The two proxy measures of complexity of the renormalized Hamiltonian discussed in the text are shown against mutual information retained: the "rangeness" i.e. the ratio of the NNN to the NN coupling constants, and the "m-bodyness" i.e. the ratio of the NN four-point to two-point coupling constants. The mutual information is scaled to the total mutual information the block V shares with the environment. The curves are obtained by parametrizing the RG rule as λ(cos(θ), sin(θ)) and varying θ ∈ [0, π] for different magnitude of λ. In the physically relevant limit of large λ the maximum of mutual information corresponds to a minimum of "rangeness" and "m-bodyness". The plots are discussed in more detail in Appendix D 3.
tories. The mutual information is maximized for θ = 0 and θ = π and the maximum increases with λ (though it saturates: there is little difference between λ = 3 and λ = 1000). Simultaneously, for large enough |λ| both ratios in Figs.(5b,c) vanish, rendering the effective Hamiltonian two-body and nearest-neighbour. It is now clear how the RSMI maximization results in a decimation coarsegraining for the 1D Ising model. A more detailed discussion of Figs. (4,5) [including asymmetries in Fig.(5a) and accidental vanishings in Fig.(5b)] can be found in the Appendix D, but it does not change the general picture: maximizing RSMI results in decay of longer-ranged and higher-order terms in the Hamiltonian.
The superiority of decimation over majority rule in our example can be understood intuitively from a physical perspective by considering fluctuations of the original (visible) spins for a fixed (clamped) configuration of the new variables X . In 1D decimation fixes every other spin in X , which prevents all but isolated fluctuations of the remaining degrees of freedom, which are being integrated out in the clamped averages of Eqs. (13,14). Consequently, only nearest neighbors in X are coupled in the effective Hamiltonian. In contrast, the majority-rule fixes a linear combination of the visibles (the average), thereby allowing fluctuations of orthogonal linear combinations. These fluctuations can span multiple blocks and thus generate higher order coupling terms. In the following section an alternative, information theory based intuition is offered, which also explains the difference between the optimal coarse-graining procedures in 1 and 2D.
Finally, we note that the results described above from a static perspective, i.e. considering properties of arbitrary coarse-graining, for a fixed, potentially suboptimal, choice of Λ, can also be interpeted dynamically. In this sense they would characterize the convergence of the RSMI algorithm of Ref. [28] as the Λ parameters are iteratively optimized during the training [see Fig.(1)].

VI. THE "SHAPE" OF THE COARSE-GRAINED VARIABLES
So far we motivated on physical grounds (the properties of the effective Hamiltonian) why maximizing RSMI generally provides a guiding principle for constructing a real-space RG procedure. We then investigated on the example of the 1D Ising system the properites of such a scheme in a typical situation, when the RSMI maximization problem is additionally constrained by the number and type of degrees of freedom the system is coarsegrained into. In particular, we gave physical intuitions which justify the solution RSMI converges to in the 1D case, i.e. decimation. This is to be contrasted with the situation in 2D, when the decimation procedure is known to immediately generate long-range and many-spin interactions and can be shown not to posses a nontrivial fixed-point at all [32]. For the square-lattice Ising model in two dimensions the majority rule transformation is preferable: numerical evidence, at least, points to the existence of a fixed point [38]. Remarkably, the RSMI solution in 2D converges (numerically) towards a majority- rule block transformation (for 2-by-2 blocks) [28]. In this section we provide an information-theory based explanation of these observations. In doing so we also elucidate and quantify the non-trivial influence on the RG scheme of the constraints imposed by the properties (type and number) of the new coarse-grained variables, for the general case. Finally, we exemplify our findings using simple and intuitive toy models.
To this end let us revisit the inequality Eq.(8). We refine it by explicitly introducing the random variables V Λ , which the hidden degrees of freedom h i ∈ H couple to in a RG scheme parametrized by Λ = {λ ij }. For instance, in the RBM-parametrization discussed previously, while generically H depends on the full V, the coarse-graining defined by the conditional probability P Λ (H|V) only makes each h i ∈ H dependent on the combination: Note that the overall normalization in the definition is not important, but only the relative stregths of λ ij which define the linear combination of degrees of freedom in the block. The following now holds: that is: the information about the environment carried by the particular chosen variables V Λ is potentially smaller that the overall information about the environment contained in the block I(V : E). Still less of the information may ultimetely be encoded in the degrees of freedom H.
Where do the inequalities Eq.(30) originate from? Formally this is because we have a Markov chain: but the more pertinent question is what can make those inequalities sharp. The second one is rather trivial: if we only decide to keep a few (one, as is often the case) variables V Λi , then their entropy may be simply too small to even store the full information I(V : E). Still, for the same entropy, there may be choices of Λ which result in bigger or smaller I(V Λi : E). Crucially though, I(V Λi : E) does not depend on the nature of h i ∈ H (i.e. on whether h i is a binary variable or not, for instance). It only characterizes how good the particular set of physical degrees of freedom V Λ is at describing fluctuations in the environment E. Whether this information can be efficiently encoded in H is a different question entirely. The answer, and the origin of the first inequality Eq.(30), is revealed by: where I(V Λ : E|H) is the conditional mutual information and we have used the chain rule and the Markov property Eq.(31). Since I(V Λ : E) is independent of H in the sense described above, I(V Λ : E|H) quantifies the failure of the encoding into H due to the properties of the H itself (conditional mutual information being always non-negative).
We have thus managed to identify the contributions to I Λ (H : E) resulting from coupling to a certain choice of physical modes in V, and to isolate them from the losses incurred due to impossibility of encoding this information perfectly in a particular type of H. The conditional probabilty distribution I(V Λ : E|H) can be thought of as describing the mismatch of the probability spaces of the random variables H and V Λ , it tells us how much information is still shared between V Λ and E after V Λ has been restricted to only values compatible with a given outcome of H. For example, in the 1D Ising case we examined previously, the majority rule defines V Λ = v 1 + v 2 , for which the set of possible outcomes is equivalent to {−1, 0, 1}. The entropy of V Λ is bounded by and possibly equal to log 2 (3). Since the system is Z 2 symmetric, then unless Prob[V Λ = 0] = 0, this cannot be faithfully encoded into any probability distribution of a single binary variable H. Below we construct simple toy models to provide more examples and intuitions for the somewhat abstract notions we introduced here.
First, let us stress though, that the RSMI prescription maximizes I Λ (H : E) as a whole, and that, for a type of H fixed at the outset, the procedure cannot be split into maximization of I(V Λ : E) followed by a linear coupling of H to the V Λ found. Such a naive greedy approach does not necessarily lead to an optimal solution -the toy models below provide an explicit counterexample. The RSMI-based solution of P Λ (H|V) thus converges to the optimal trade-off between finding the best modes in V to describe E, and finding those, whose description can be faithfully written in H of a given type.
To illustrate the above considerations we construct minimal toy models. In 1D this consists of four coupled Ising spins: v 1 , v 2 in the block V, and e 1 , e 2 representing the left-and right-environment (in 1D the environment is not simply connected), with the Hamiltonian: where, as before the coupling constants contain a factor of β = 1/k B T . The two spins in V are coupled to a single hidden spin H using an RBM-ansatz Eq. (19) and the random variable V Λ is defined as in Eq. (29). In Fig.(6) the results of the calculation of the mutual informations I Λ (H : E) and I(V Λ : E) for decimation and the majority rule are shown. In the regime of strong coupling to the environment K VE [see Fig.(6a)], for small K V both visible spins are nearly independent and almost copy the state of the left-and right-environments, respectively. Consequently, V Λ for the majority rule carries almost log 2 (3) bits of information about the environment while V Λ for decimation, being a binary variable, at most one bit. However, when I Λ (H : E) is examined it becomes aparent that for decimation it is exactly equal to I(V Λ : E), while for majority rule it is significantly lower, so much so, that overall decimation is better across the whole parameter regime! The difference between the solid and dashed curves in Fig.(6a) is precisely the mismatch of Eq.  34), as a function of the coupling KV between the visibles. The coupling pattern to the visibles in different RG rules Λ is shown schematically in the (physically relevant) large ||Λ|| 2 limit. The majority rule (and interestingly, also coupling to three spins, depicted by a red line, coinciding with the purple one) consistently retains more information than decimation, or any coupling to two spins (blue and yellow, coinciding with green). Again, the distinction vanishes for large KV when all visible spins are bound into a single one.
was mentioned previously. In the large K V limit both spins in V become bound into an effective single binary variable and the distinction between the two rules vanishes. In Fig.(6b) we show the same in the regime when the spins in V are only weakly coupled to the environment (or the temperature is high). Again, decimation perfectly encodes information I(V Λ : E) into H and is overall better. Let us contrast this with the situation in higher (in particular: two) dimensions, when the environment is simply connected. Based on the discussion above, we may anticipate that the optimal solution could be different, and that majority rule may instead be preferable. This is because, on the one hand, for the same coupling strength to the environment and the same linear dimensions L V = 2 of the block, the ratio of I(V Λ : E) for the majority rule to the one for decimation increases with increasing dimension (consequence of all visible spins interacting with the same environment). On the other hand the mismatch I(V Λ : E|H) for majority rule decreases, compared to 1D, since the probability of V Λ = i v i being zero is smaller. This fact is due both to dimensional considerations, as well as (again) the environment being simply connected, the importance of which, even in 1D, we illustrate in Appendix E.
We verify those expectations using a simple toy model of the 2D setting: the environment is represented by a single random variable E with a large number of states, to which all the spins in V couple. These states should be thought of intuitively as fluctuations of some large environment at wave-lenghts longer than the size of the coarse-graining cell. The Hamiltonian is: As before, the spins in block V are coupled to a single hidden spin H with an RBM-ansatz parametrized by Λ. In Fig.(7) the mutual information I Λ (H : E) is computed for the model Eq. (34) for different course-graining rules given by Λ. Indeed, the decimation is now inferior to the majority rule across the full parameter range. This is also consistent with the known properties of decimation and majority rule for the 2D Ising model, and suggests their information-theoretic origin.

VII. CONCLUSIONS AND OUTLOOK
In this work we have investigated the properties of a real-space RG procedure based on variational maximization of real-space mutual information (RSMI) [28]. We have shown that such a procedure does not generically increase the range of interactions of an arbitrary, initially short-ranged Hamiltonian in any dimension, provided all the relevant information is retained. We then relaxed this restriction and studied the effect on the optimal solution of the constraints imposed by a priori assumptions about the type of coarse-grained variables. We considered a detailed example: for the case of the 1D Ising model we explicitly calculated the mutual information retained by an arbitrary coarse-graining and we perturbatively derived the corresponding effective coarse-grained Hamiltonian. We showed the decay of longer-ranged and many-body interactions with increasing RSMI, and ultimate convergence to the theoretically best solution, a nearest-neighbour Hamiltonian. We also constructed intuitive toy models highlighting the differences between preferred solutions in one and higher dimensions, explaining the numerical findings on 2D Ising model in Ref. [28].
The above results provide a formal underpinning for the physical intuition behind the RSMI maximization: the effective long-wavelength description of the system is simple in terms of degrees of freedom which carry the most information about its large scale behaviour. They motivate us to postulate that the optimal RG coarsegraining procedure for a given statistical physical system is the one defined by maximizing the real-space mutual information with the environment. It has the advantage of being a model-independent variational principle, and results in operationally desirable properties (a tractable Hamiltonian).
This information-theoretic perspective may prove very useful, both conceptually and practically. On the one hand it reinforces the importance of information in physical systems, which, somewhat paradoxically given the history of the subject, is better appreciated in the quantum setting. This is especially pertinent to the RG programme, which very naturally can be thought of as a lossy compression scheme analogous to Ref. [39], and could possibly offer a way around the known analytical problems plaguing real-space procedures [32,33]. On the other hand, especially when combined with modern numerical optimization or machine learning techniques, it could allow for the extension and practical application of the RG toolbox to systems which have so far been beyond reach.
Consequently, a number of distinct further research directions are possible, and called for. On the more formal part of the spectrum, a mathematically rigourous investigation of the probability measure defined by the RSMI coarse-graining, in the spirit of Refs. [32,33,40], is desirable. Conceptually, an interesting question is whether the type and number of coarse-grained variables can also be variationally optimized (as opposed to being chosen at the outset, as is usually the case) to retain the most of the mutual information I(V Λ : E) studied in Sec.VI. This would have the interpretation of "discovering" whether the best variables to describe a system, originally given in terms of, say, Ising spins, are the same, or rather some emergent degrees of freedom are preferable (see also Refs. [41][42][43]). It is ultimately related to the question of a better physical understanding of the optimal variables in the block, which the hidden spins couple to. More practically, the results invite the application of the RSMI method to the study of further physical systems, in particular ones with more exotic phase transitions or disordered systems, where real-space RG is indispensable [44]. To this end a more numerical efficiency/stability-oriented investigation of the ansatz itself as well as the optimization procedure used would be beneficial, further leveraging the recent progress in machine learning algorithms.
Finally, another obvious frontier is the quantum case, where one can try to gradually coarse-grain the ground state wave function. Indeed, information bottleneck approaches [39] have been recently extended to the quantum realm [45]. In this setting the conditional probability we used becomes a quantum channel. It would be interesting to explore how the physics of quantum manifests itself in properties of these optimal channels, and to see whether this approach can be combined with MERA to facilitate its interpretation and practical implementation. X = ∪ X with X = ∪ j∈J V j . Thus, in terms of the hyperplanes we end up with a quasi-one-dimensional structure. Let us choose an arbitrary hyperplane X 0 , denote its immediate neighbours X ±1 as the buffer B, and the union of the remaining hyperplanes X <−1 and X >1 as the environment E 0 (X 0 ), or, in more detail, as left-and right-environment E L/R (X 0 ), respectively.
Assume now that the coarse-grained variables H j for the blocks V j in X 0 are constructed in such a way that I(X 0 : E 0 ) = I(X 0 : E 0 ), where X 0 = ∪ j∈J0 H j . This is the full information capture condition for the hyperplane, generalizing the condition for the single block in 1D (note though, that we still optimize variables H j for each block, and not some new collective hidden variables for the entire hyperplanes). Strictly speaking, this requires an additional assumption (compared to 1D) that it is equivalent to assuming I(H j : E 0 ) = I(V j : E 0 ) separately for each individual block in the hyperplane X 0 . This seems reasonable for a short-ranged Hamiltonian, at least in the isotropic case. Under those assumptions the probability measure on the coarse-grained variables P (X ), with X ±1 integrated out and X 0 clamped, factorizes: To show that, first note that from the full information capture assumption it follows that: where the first equality is the chain rule for mutual information and the second is due to the fact that the coarsegrained variables X 0 are a function of X 0 only. Vanishing of the mutual information is equivalent to the probability distribution factorizing and therefore: Locality implies that I(E L : E R |X 0 ) = 0 and thus: Comparing Eqs.(B3) and (B4), we find that: For a given X 0 let us denote the set of X 0 such that P (X 0 |X 0 ) = 0 by {X 0 (X 0 )}. For all such "compatible" X 0 ∈ {X 0 (H 0 )} we can divide by P (X 0 |X 0 ) and obtain: Crucially, the left hand side does not depend on X 0 , and so long as X 0 ∈ {X 0 (X 0 )} the equality holds and the conditional probability factorizes independently of particular X 0 . We can be a bit more careful in this argument: where we used Eq.(B4) in the second equality and Eq.(B6) to take the X 0 -independent product from under the restricted summation in the third. The last line holds for any X 0 ∈ {X 0 (X 0 )} and we thus constructed an explicit factorization of P (E L , E R |X 0 ), which implies: and hence we can simply write: The factorization of conditional probability Eq.(B9) under the full information capture assumption is the key element in showing Eq.(B1). Let us then consider the coarse-grained probability measure defined by Eqs. (3) and (5): Denoting the product of the block conditional probability distributions in the hyperplanes by P (X |X ) and integrating out X ±1 we have: (B11) Using the definition of conditional probability and the fact that X 0 only directly depends on X 0 we have: which allows us to write: where to obtain Eq.(B14) we conditioned on X 0 and used the full information capture assumption to write , to obtain Eq.(B15) we used the factorization Eq.(B9) and in the last line we performed the summation over X 0 . Equation (B15) shows that for a fixed (clamped) X 0 the probability P (X j≤2 , X 0 , X j≥2 ) factorizes into a product over left and right environments. As described in the main text, together with the arbitrariness of the choice of the hyperplane this implies (barring a pathological finetuned scenario in which integration over X ±1 exactly cancels all pre-existing NNN couplings) that the effective Hamiltonian in terms of new variables is still nearestneighbour (in all directions).
where the probability over which we average is: In particular, the factorization holds for the operators K k 1 that appear in the expressions for the cumulants Eqs. .
In the RBM ansatz the joint probability of the visible and hidden degrees of freedom is approximated by a Boltzmann distribution: with a quadratic energy energy function: where v i ∈ V, h j ∈ H and Λ collectively denotes the set of parameters {λ j i } i,j , {α i } i and {β j } j , which are to be variationally optimized so that P Λ (V, H) they define is as close as possible to the target distribution P (V, H). Note that the energy function only couples the visible to the hidden degrees of freedom, and includes no couplings within the visible or the hidden sets. This pecularity (which the word "restricted" in RBM refers to) is crucial to the existence of fast algorithms [58] for training and sampling from the trained distribution P Λ (V, H).
The conditional probability is then given by: It is easy to see that the parameters {α i } i drop out in P Λ (H|V). Additionally, because of the Ising Z 2 symmetry the bias (magnetic field) term for h j is not allowed: β i = 0 for all i. Due to the absence of interactions between hiddens, the expression factorizes and the summation over H is trivial. In the case of a 1D system and a single hidden spin H = {h} the conditional probability is then given explicitly by: with Λ = {λ i } i . The choice of the parameters defines the RG rule. It is intuitively clear that while one could, in principle, consider any choice of Λ, the physically meaningful choices would correspond to the limit ||Λ|| 2 → ∞, i.e. when the value of h actually strongly depends on v. In that limit Eq.(C6) becomes a Heaviside function. This is also what happens in practice during the RSMI training (see Supplemental Materials in Ref. [28]). Thus the virtue of the RBM ansatz is twofold: first, it provides an efficient tool from the algorithmic perspective of RSMI implementation, and second, it also provides a well-behaved, differentiable analytical ansatz, which we use to explicitly calculate the quantities of interest. We emphasize though, that conceptually the RBM ansatz is not essential to the RSMI approach. Any other parametrization of P Λ (H, V) can also be used, at the expense of having to devise efficient algorithms to fix parameters of this new ansatz.

Appendix D: The 1D Ising model
For the 1D Ising model, Eq.(18) and a single hidden spin we define: The Hamiltonian decomposition Eq.(11) gives: . The 1D Ising model with nearest neighbor interactions can be solved exactly using the method of transfer matrices. To this end define the transfer matrix T with components: x 1 | T |x 2 := e Kx1x2 . The matrix elements of arbitrary integer powers of T can be computed by diagonalization: (D4)

Exact decimation
For the purpose of numerical comparison with the RSMI solution we perform one step of the exact deci-mation RG transformation Eq. (20). Following Eq.(4): because for every block j the delta-like conditional probability P Λ (x i |{x 2j−1 , x 2j }) strictly enforces x j = x 2j−1 and does not involve x 2j . Thus, x 2j can simply be integrated out. The above has, up to a multiplicative constant e c , the same form as e K[X ] with a new coupling constant K , such that we can set e c T = T 2 . From that we obtain: such that the renormalized Hamiltonian is:

The effective Hamiltonian
Here we compute the effective block parameters Eq.(24) of the 1D Ising model for general block size L V .
Using Eq.(D4), the partition function of intra-block contribution to Hamiltonian is given by: The expectations of powers of inter-block couplings K 1 [X ] k appearing in the cumulant expansion can be written as a sum of products of operators acting on single blocks (see Appendix C 1). We have: We now consider one term in the above sum and rearrange the factors according to blocks: Depending on the values of k j−1 and k j , the blockoperator o j is one of the following three operators: Hence, the average O Λ,b factorizes into: The Z 2 symmetry of the 1D Ising model can be used to extract the dependence of P Λ,b (h) and the above three quantities on the single hidden spin h: where we used the fact that for Z 2 -symmetric system the coarse-graining satisfies P Λ (H|V) = P Λ (−H| − V). Since P Λ,b (h) is normalized we have: , we find using similar arguments that: also has definite h-parity p.
Since h only assumes values ±1, then p = +1 implies that the average is actually independent of h, while p = −1 implies it is linear in h. Thus: The last expression can actually be explicitly calculated, independently of the choice of RG rule: Since v i = ±1 we also have: Every term in the expanded expression is of the form O[V] tanh(K) m for an operator O, which is a product of several consecutive pairs v i v i+1 . If O has even V-parity is not independent of V, then: Thus, only O of odd V-parity and those for which contains only two such contributions: 1 and v 1 v L V tanh(K) L V −1 . It follows that: i.e. it is a Λ-independent constant. The remaining two averages depend on the choice of Λ, and closed expressions for them are given below for the case of block size L V = 2.
As discussed previously, the cumulants can be expressed in terms of the effective block-parameters Eqs. (24). The actual computations can be done by bruteforce summation of all possible terms in Eq. (25). This, however, is rather impractical for obtaining higher order cumulants. We have instead implemented a simple algorithm based on the combinatorial considerations discussed in the main text. Specializing to blocks of two visible spins results in: and the effective block-parameters are found to be: As discussed in the main text, both the two-point correlator as a function of distance between the spins [ Fig.(8)] and the m-point correlator as a function of the number of consecutive spins m [ Fig.(9)] decay exponentially for small K for the RSMI-favoured solution (i.e. decimation). This solution, unsurprisingly, is decimation, which can be seen from Figs.(4) and (5). Additionally in Fig.(10) we show the convergence to large-λ results shown in Fig.(5a) with increasing order of cumulant expansion.  We also comment on the asymmetry (around 0) of the curves in Figs.(5a,b). The curves result from traversing the path λ(cos θ, sin θ) in Fig.(4), which is not fourfold symmetric (instead there are two reflection symmetries with respect to the diagonals). Starting from θ = 0 at the peak, the trajectory traces out the lower branch of the curves in Figs.(5a,b) reaching the lowest point at θ = π/4, before turning around and exactly retracing the trajectory towards the peak at θ = π/2. The trajectory then moves on the upper branch reaching the uppermost point at θ = 3π/4 and retracing towards peak again at θ = π/2. This exact retracing is due to two independent Z 2 symmetries: that of the Ising model and that of the mutual information. Since Z 2 × Z 2 is not isomorphic to Z 4 we do not have a fourfold symmetry in Fig.(4) and consequently we do not have a symmetry around 0 in Figs.(5a,b). Physically this is easily understood: the mutual information in Fig.(4a) on the λ 1 = −λ 2 diagonal is lower than on the λ 1 = λ 2 one since for the ferromagnetic Ising model we simulated the neighbouring spins are more likely to be aligned than not. Then for the majority of the spin configurations we have: λ 1 v 1 −λ 2 v 2 = 0 on the λ 1 = −λ 2 diagonal and hence the coarse-graining rule decides the orientation of the effective spin at random, reducing the mutual information. We emphasized before that the physically relevant coarse-graining rules are in the limit of large ||Λ|| 2 . For small values of ||Λ|| 2 the coarse-graining rule is essentially independent of the underlying variables V (or equivalently the rule can be thought of as having a large white noise component). This manifests itself in Fig.(4a) by low mutual information in the centre. Nevertheless Figs.(4b,c) seem to have some (differently looking) areas of vanishing "rangeness" and "m-bodyness" ratios in the centre. Those are entirely accidental and nonuniversal. It is important to understand that since the central area corresponds to entirely randomly deciding the coarse-grained spin, the effective Hamiltonian (which would therefore have hardly anything to do with the physics of the underlying system) would not even contain nearest neighbour terms. The central areas in Figs.(4b,c) thus correspond to ratios of two vanishing quantities. Similarly in Fig.(5a) the position of the peak not being exactly at 0 for small ||Λ|| 2 is exactly due to the accidental features in the centre of Figs.(4b).
A slightly more practical lesson can be taken from Fig.(4c), where even for larger λ multiple crossing of the 0-axis can be observed (i.e. the "m-bodyness" ratio vanishes also for some smaller value of mutual information, compared to the value at the peak, when the "rangeness" ratio is still large). This is also accidental, but teaches us that the proper metric to observe is the saturation of the mutual information (corresponding to the peak) and not the vanishing of some particular coefficient in the Hamiltonian (which may be accidental).

Mutual information
Here we explicitly calculate the information-theoretic quantities studied in the main text for the case of the NN Ising model in 1D given by Eq. (18), with a visible region of size L V is coupled to a single hidden spin H = {h}. The system is split into four regions [see Fig.(2)] with their respective sizes satisfying N = L V +2L B +2L E +L O . We denote the spin variables in the three inner regions of the system by: The mutual information can be calculated from Eq.(A1). Since H is a binary random variable, the two entropies appearing in Eq.(A1) can be rewritten in terms of the binary entropy h 2 defined in Eq. (26): with the conditional probability distribution: Thus, the mutual information is given by: (D22) The relevant probability distributions, P Λ (h) and P Λ (h|E), can be computed using transfer matrices (the result is always given in the limit L O → ∞). For the former, we observe that: which implies that: in the thermodynamic limit. We have already found P Λ,b (h) in Eq. (D11) to be 1 2 , such that the first term in equation (D22) gives h 2 (1/2) = log (2). The other relevant probability distribution is: where P (h|V) is given by the RBM-ansatz Eq.(C6) and to obtain P (V|E) the two distributions P (V, E) and P (E) need to be computed. In the thermodynamic limit L O → ∞, we obtain by Eq.(D4): (1 + v 1 e −1 G(L B + 1)) (1 + v 2 e 1 G(L B + 1)) × e K e,e ee (2 cosh(K)) 2(L E −1) since tanh(K) m → 0 for m → ∞ and finite K, and Z = (2 cosh(K)) N in the thermodynamic limit. Similarly: Thus: P (V|E) = (1 + e −1 v 1 G(L B + 1)) (1 + v L V e 1 G(L B + 1)) 1 + e −1 e 1 G(L V + 2L B + 1) which results in: 1 + e −1 v 1 G(L B + 1) 1 + be −1 e 1 G(2(L B + 1)) , where we recognized P Λ,b (V|h) from Eq.(C2) and used the fact that: G(L V + 2L B + 1) = tanh(K) L V −1 G(2(L B + 1)) = bG(2(L B + 1)).

b. Mutual information between the visibles and the environment
Equation (8) states that the mutual information between the hiddens and the environment, I Λ (H : E), is bounded from above by the mutual information between the visibles and the environment, I(V : E). We now compute the latter explicitly. By definition: where all the probability distributions involved are already known, see Eqs.(D23), (D26) and (D27). Observe that the expression inside the logarithm only depends on the four spins e −1 , v 1 , v L V and e 1 , such that the sum over all other spins can be performed explicitly. We obtain: P (e −1 , v 1 , v L V , e 1 ) log (1 + e −1 v 1 G(L B + 1)) (1 + v L V e 1 G(L B + 1)) 1 + e −1 e 1 G(L V + 2L B + 1) , P (e −1 , v 1 , v L V , e 1 ) = 1 16 (1 + e −1 v 1 G(L B + 1)) (1 + v 1 v L V G(L V − 1)) (1 + v L V e 1 G(L B + 1)) . (D36)

The case of larger blocks
For the case of L V > 2 additional subtleties are present. These can be attributed to differently broken symmetries in the mutual information and in the effective Hamiltonian.
On the level of interactions, the translation symmetry is explicitly broken by the Hamiltonian decomposition in Eq.(11) and subsequent cumulant expansion. This is not merely a feature of the method of evaluation but rather a consequence of using a block-spin RG scheme: interactions of the spins in the same block are inherently treated differently from interactions of the spins from different blocks. However, the full translational symmetry may sometimes be effectively restored. This happens for instance in the case of a decimation, when for any block size L V it does not matter which single spin exactly is chosen in the block -the same effective Hamiltonian results.
When computing the mutual information, on the other hand, the full symmetry is not restored for L V > 2. The spins in the interior of the block are always coupled to the environment more weakly that the ones on the edges. Thus, we end up with two quantities, the renormalized Hamiltonian K and the mutual information I Λ (H : E), which have different symmetry properties. For example, for L V = 3 in the 1D Ising case, from the point of view of mutual information we have two equivalent optimal solutions (coupling to left-most and right-most spins in the block), but it is intuitively clear that coupling to the center spin is equally good. One important consequence is that the "rangeness", for instance, is not a monotonic function of mutual information in the full parameter space. Crucially though, any global maximum of mutual information corresponds to a global minimum of rangeness (but there could be additional equivalent solutions, just as the centre spin in the L V = 3 decimation). The RSMI maximization is thus a sufficient criterion for a good RG transformation. However, further investigation of these effects for larger coarse-graining blocks might prove useful (see also some numerical results for the 2D Ising model case in the Supplementary Materials of Ref. [28]).