Renormalized Mutual Information for Artificial Scientific Discovery

We derive a well-defined renormalized version of mutual information that allows to estimate the dependence between continuous random variables in the important case when one is deterministically dependent on the other. This is the situation relevant for feature extraction, where the goal is to produce a low-dimensional effective description of a high-dimensional system. Our approach enables the discovery of collective variables in physical systems, thus adding to the toolbox of artificial scientific discovery, while also aiding the analysis of information flow in artificial neural networks.

Introduction. -One of the most useful general concepts in the analysis of physical systems is the notion of collective coordinates. In many cases, ranging from statistical physics to hydrodynamics, the description of a complex many-particle system can be dramatically simplified by considering only a few collective variables like the center of mass, an order parameter, a flow field, or vortex positions. However, in new situations, it is not clear a priori which low-dimensional "feature" y = f (x) is best suited as a compact description of the high-dimensional data x. This is the domain of unsupervised feature extraction in computer science, where large datasets like images or time series are to be analyzed [1]. Future frameworks of artificial scientific discovery [2-5] will have to rely on general approaches like this, adding to the rapidly developing toolbox of machine learning for physics [6][7][8].
The simplest and most known algorithm to obtain such features is the Principal Component Analysis (PCA) [9]. The idea is to project the input into the directions of largest variance. However, its power is limited, since it can only extract linear features. A general approach to estimate the quality of a proposed feature is given by Mutual Information [10,11]. In general, the mutual information I(x, y) answers the following question: if two random variables y and x are dependent on one another, and we are provided with the value of y, how much do we learn about x? Technically, it is defined via I(x, y) = I(y, x) = H(y) − H(y|x), where H(y|x) is the conditional entropy of y given x [11]. Maximization of mutual information can be used to extract "optimal" features [12], as sketched in Fig. 1. There exists, however, a well-known important problem in evaluating the mutual information for continuous variables with a deterministic dependence [13,14], which is exactly the case relevant for feature extraction. In this case, I(x, y) diverges, and it is not clear how to properly cure this divergence without losing important properties of I. Specifically, reparametrization invariance turns out to be crucial: applying a bijective function to obtain y = g(y) does not change the information content, and thus I(x, y ) = I(x, y). * leopoldo.sarra@mpl.mpg.de  Figure 1. Feature extraction, where a high-dimensional "microscopic" description x (such as the configuration of a manyparticle system) is mapped to a low-dimensional feature y = f (x). This is the case where the renormalized mutual information presented in this article is needed for feature optimization.
In this work, we introduce a properly renormalized version of mutual information for the important case of feature extraction with continuous variables: I(x, y) = H(y) − dxP x (x) ln det ∇f (x) · ∇f (x) (1) where x ∈ R N , y = f (x) ∈ R K ; we use ∇f (x)·∇f (x) as a short-hand notation for ( i ∂ i f µ ∂ i f ν ) µν , with 1 ≤ i ≤ N and 1 ≤ µ, ν ≤ K, i.e. the K × K matrix resulting from the product of the (K × N ) Jacobian matrix ∇f (x) and its transpose. The quantityĨ is well-defined and finite. In addition, it preserves fundamental properties of mutual information -among which the invariance under reparametrization of the features: for a bijective function g : R K → R K . We will derive and discuss below the meaning and usefulness of the renormalized quantityĨ.
Mutual information is used in many cutting edge machine learning applications, helping to improve the intermediate layers of a neural network [15,16], to increase the interpretability of Generative Adversarial Networks [17], to analyze the behavior of neural networks during training [18,19] through the Information Bottleneck method [20,21], and for feature extraction via mutual information optimization [22]. It can be also used to characterize the variables in a renormalization group procedure [23]. Its practical estimation is not trivial [24], but recently derived bounds [25] permit its evaluation even in highdimensional spaces, with the help of neural networks [26].
However, there is a problem with deterministicallydependent continuous features: the conditional entropy H(y|x) formally diverges as − log δ(0) whenever y is a deterministic function of x. To understand why, it is enough to take its definition, H(y|x) = − dxdyP x (x)P (y|x) ln P (y|x), and plug in P (y|x) = δ(y − f (x)). This is specific to continuous variables: with discrete variables, conditional entropy would be zero and mutual information would coincide with the entropy of one of the variables. It is clear that, to deal with a deterministic continuous dependence, it is necessary to somehow redefine mutual information. Past remedies involved adding noise to the feature y or (equivalently) to simply consider the non-diverging term H(y) [22,27], as briefly suggested in the InfoMax seminal paper [12]. However, they all lead to a very undesireable property: they break the fundamental reparametrization invariance of mutual information. In this scheme, any two features can be made to have the same entropy H(y) simply by rescaling. Thus, in the context of feature optimization, they would be considered equally favorable, even if they represent very different information about x. The reason is that such a scheme completely ignores the diverging quantity H(y|x). In contrast, we show that H(y|x) contains a non-trivial finite dependence on the feature f (x), which must be taken into account to obtain consistent results.
Renormalized Mutual Information. -In any physical system, there are small pre-existing measurement uncertainties associated with extracting the microscopic observables x. Thus, loosely speaking, when trying to deduce information about x given the value of y, we have to be content with resolving x up to some spread ε. Motivated by this, we first consider a finite regularized quantity I ε (x, y). It is defined as the mutual information between the observable x and the feature function applied to a noisy version of the observable: y = f (x+ελ), where ε ∈ R is the noise strength and λ ∈ R N is a random multidimensional Gaussian of zero mean and unit covariance matrix. In the limit ε → 0 we recover the original definition of mutual information, which diverges logarithmically. Even in that limit, the nature of the adopted noise distribution (e.g. isotropy, independence of x) still matters, and corresponds to imposing some hypotheses about the observed quantities x (e.g. same measurement uncertainty in all variables). We discuss these generalizations at the end of this work. Consider When ε 1, we can expand f (x+ελ) f (x)+ελ·∇f (x). By explicit calculation, it can be easily found that P (y|x) is a Gaussian distribution of zero mean and covariance matrix ε 2 ∇f (x) · ∇f (x) = ε 2 ( i ∂ i f µ ∂ i f ν ) µν . We can calculate the conditional entropy and get where H ε is the entropy of a one-dimensional Gaussian with variance ε 2 . The first term only depends on the features, and the second only on the noise. Only this term diverges when ε → 0. Thereforẽ has a well defined limit ε → 0 and still contains all the dependence on f (x). By performing the limit we obtain our main result, Eq. (1). We can easily show that Eq. (1) is invariant under feature reparametrization. Consider an invertible function z = g(y) : R K → R K . We can rewrite the entropy of z as the entropy of y plus an extra term, which cancels with that obtained by differentiating ln det(∇g(f (x))), leading to Eq. (2). We emphasize the importance of this property: after an invertible transformation on the variable y, no information should be lost, and the new variable should have the same mutual information with x as the old one. In contrast, by adding Gaussian noise η to the feature y instead of to x, i.e. y = f (x) + εη, the final result would depend on the feature only via H(y). Reparametrization invariance would not hold anymore under this alternative regularization: The price for a finite mutual information between two deterministically-dependent variables is that when there is no dependence, e.g. y = const., we get −∞ instead of 0. In addition, given the different roles that x and y play, renormalized mutual information is no longer symmetric in its arguments. From a different perspective [28], Eq. (1) can be expressed as a particular kind of Information Loss [29,30].
Mutual information obeys inequalities like I(x, (y 1 , y 2 )) ≥ I(x, y 1 ), which translate to the regularized version I ε .
However, naively taking ε → 0 results in an empty inequalitỹ I(x, (y 1 , y 2 )) + ∞ ≥Ĩ(x, y 1 ). By contrast, starting from I(x, (y 1 , y 2 )) ≥ I(x, y 1 ) + I(x, y 2 ) − I(y 1 , y 2 ), we can take the same limit and obtain a useful finite result: In the special case where the dimensions of y 1 and y 2 add up to the dimension of x, and the mapping x → (y 1 , y 2 ) is bijective, reparametrization invariance producesĨ(x, (y 1 , y 2 )) =Ĩ(x, x) = H(x), and so If one constructs y 2 to be independent of y 1 , the third term on the right-hand side vanishes. However, it would

Renormalized MĨ I(x, y)
Fluctuating field Many-particle system Comparing renormalized mutual informationĨ for several features in two representative physical scenarios. (a) Fluctuating 1D field on a lattice, with a randomly placed "wave packet" (we depict one single sample). (b)Ĩ as a function of the size of the field fluctuations σ ξ for several features. Let Aj = 1 N N j=1 Aj. We consider: the average field f (x) = xj, the position j weighted by the field amplitude, jxj, or weighted by the field intensity, jx 2 j , as well as the "normalized" feature jx 2 j /x 2 i (similar to an expectation value in quantum mechanics) and the first PCA component. (c) Twodimensional "drops" with elliptical shapes of fixed area but with fluctuating deformation amplitude δr and orientation θ (we depict three samples). (d)Ĩ vs. max. deformation spread for the 2d-feature given by PCA and for two nonlinear features sensitive to shape deformations, fVar = ((x (2) j are the coordinates of particle j. In both (b) and (d): AE represents the bottleneck of a contractive autoencoder trained to reconstruct the input and NN corresponds to the feature given by a neural network optimized to maximizeĨ. In the insets, we show the entropy H(f (x)). This quantity is not reparametrization invariant.
be impermissible to dropĨ(x, y 2 ), since it can have any sign.
Feature comparison. -The renormalized mutual information can be used to find out how useful any given "macroscopic" quantity (i.e. a feature y = f (x)) would be in characterizing the system. The result depends on the statistical distribution of x. It might be the Boltzmann distribution in equilibrium or a distribution of "snapshots" of the system configuration during some arbitrary time evolution. When control parameters such as temperature or external fields change the distribution of x, the optimal feature can change. Intuitively, observing a feature with higherĨ is more effective in narrowing down the set of underlying configurations x compatible with the observed value, thus yielding more information about the system. We show proof-of-concept examples in the most common domains of physics that deal with many degrees of freedom: fluctuating fields and many-particle systems. One important goal is to discover, without prior knowledge, that a given fluctuating field is dominated by certain localized excitations (like solitons and vortices) and to robustly estimate their properties (position, shape, velocity, etc.). The simplest example is a 1D field on a lattice with a wave packet of fixed shape at a random position (Fig. 2a,b) [31]. For now, we evaluateĨ for a variety of handcrafted features, turning to feature optimization further below. Because of reparametrization invariance (Eq. (2)), the scaling of any of them is irrelevant, as is any bijective nonlinear transformation. For comparison, we also consider PCA [9], which in our context corresponds to a feature f (x) = j x j u j , where u is the eigenvector associated to the largest eigenvalue of the covariance matrix x i x j − x i x j , and the bottleneck of a contractive autoencoder [32].
In a many-particle system (molecule, star cluster, plasma, etc.), the goal is to discover the most meaningful collective coordinates. A simple prototypical example is a liquid drop of fluctuating shape and orientation, made of atoms with known force fields (Fig.ï¿oe2c,d).  Fig. 2a). (c) Predicting the orientation and deformation of the drop (example from Fig. 2c). The optimized NN feature achieves the best performance in (b) and a performance very close to that of our best handcrafted feature (c).
plausible features, we can consider a class of parametrized features and optimizeĨ over the parameters. We opted for a multilayer neural network [33], where f (x) = f θ (x) with θ representing the parameters of the network. Intuitively, meaningful features are those that provide the largest information without over-engineering. While handcrafted features, like in the previous section, are unarguably simple, the optimization of an excessively powerful feature function could lead to encode additional (non-relevant) information by means of very non-linear transformations. The tradeoff between the simplicity of the feature and the amount of preserved information can be adjusted both by the choice of network architecture and by adding a small additional regularization penalty (in practice, this can be achieved by punishing features with large gradients). The optimization ofĨ(x, f θ (x)) can be implemented easily with gradient ascent algorithms [33]. The first term in Eq. (1) can be estimated with a histogram; for the second term, one can immediately obtain the required ∇f , since neural networks are differentiable functions, and rely on statistical sampling of x. Note that also the extra degree of freedom of feature space due to reparametrization invariance (Eq. 2) can be exploited to enforce additional constraints [34].
In Fig. 3a we show the optimization of a nonlinear 1D feature for a 2D non-Gaussian distribution. Such a lowdimensional setting allows to visualize the shape of the feature and to compare it with PCA. We apply the same technique also to the physical examples (see "NN" in Fig.  2b, d).
One way to assess the quality of features is by suitable visualization (see Fig. 3b, c). The optimized NN feature is clearly able, better than (or at least as good as) other features, to identify the relevant properties of the system. A more quantitative, well-known approach is to perform supervised training for a regression task with the feature as input and analyze the resulting performance [35]. In the physics examples shown here, one is naturally interested in predicting underlying parameters, like the wave packet location. Fig.ï¿oe4b, c illustrate superior or very good performance of the network.
In our illustrative examples we only considered 1D or to 2D features. For higher-dimensional features, the numerical estimation of Eq. (1) is more challenging, but in principle still feasible [36], for example through adversarial techniques [37].
Also, all the components x j had the same physical meaning (e.g. particle coordinates). For components with different dimensions (e.g. positions and momenta), one needs to decide how to compare fluctuations along different components. A slight change in the regularization procedure is required. Most generally, we can consider the noise distribution P (λ|x) to have an arbitrary covariance matrix Σ(x), even allowing for a locationdependent "resolution". We find that it is necessary to replace the matrix ∇f (x) · ∇f (x) in Eq. (1) with ∇f (x)Σ(x)∇f (x), thus effectively introducing a metric on x-space [38]. This changes the inequality mentioned above (Eq. (7)).
Outlook. -Renormalized mutual information can be useful in many areas of statistical analysis, machine learning, and physics.
It can be directly applied in diverse physical scenarios, with many interesting variations and extensions. In statistical physics, one expects that different phases of matter yield different optimal features. Moreover, one could optimize for feature fields (order parameter fields) by using convolutional layers in the neural network. The locations of defects like domain walls and vortices could be discovered as relevant features. In general, an optimized low-dimensional description of a high-dimensional system can be used to make partial predictions for the time evolution. In dynamical systems the renormalized mutual information could help to discover the underlying regularities of the system. Even in the presence of chaos, the evolution of collective variables can be predictable (and still non-trivial) [39]. Quantum-mechanical systems could be analyzed as well, e.g. by sampling configurations x according to a many-body state, or sampling parameters in the Hamiltonian and looking at the expectation values x of a set of commuting observables in the corresponding ground state.
Renormalized mutual information can be used to analyze deterministic representations of a dataset. Here we illustrated the approach only in settings with at most two-dimensional features, but it should be feasible to efficiently evaluateĨ also with high-dimensional feature spaces. This approach could be used to study the behavior of a neural network from an information-theoretic perspective, for example by analyzing the renormalized mutual information between the input and an intermediate layer of a neural network. This could be helpful for concepts like the "information bottleneck" [20,40], which is known to be affected by the problems we discussed. Moreover, the important challenge of representation learning for high-dimensional datasets (like images) can benefit: our optimized features are purely defined by their information content and not by the capability to accomplish selected tasks. Thus, they could be useful in transfer learning scenarios, in which many classifiers are built from the same representation. We emphasize that the method advocated here should be especially useful when the dimensionality is so drastically reduced that autoencoders [32,41,42] would not plausibly work very well, since it would be impossible for a decoder to pro-duce an approximation of the input from so few latent variables (see Fig. 3c). This is precisely the situation important for collective variables and similar strongly reduced descriptions.
The code of this paper is publicly available [43].
Acknowledgements. -We thank Andreas Maier for discussions.
[1] Y. Bengio, A. Courville, and P. Vincent In this section, we derive the renormalized mutual information equation, in the general case in which the regularizing noise also depends on x. We consider the observable distribution x ∼ P x (x), with x ∈ R N . Let λ be the noise variable. It has a zero-mean Gaussian distribution with covariance matrix Σ(x). If we have no assumptions on the observables, we can just choose Σ(x) = I N . Let ε ∈ R represent the strength of the noise. At the end of the calculation, we perform the limit ε → 0. First of all, we define the feature Its probability distribution is given by By definition, P λ (λ|x) is a Gaussian distribution with zero mean. The contribution of large values of λ in the δfunction are suppressed by the factor P λ (λ|x). As a consequence, when ε ≈ 0, we can consider the expansion of the feature function, f (x + ελ) ≈ f (x) + ε∇f (x) · λ. We employ the Fourier representation of the δ-function δ(y) = 1 (2π) k dse isy and plug in the expression of the distribution of the noise, We get Now, we can also perform the Gaussian integral in s and get This is a Gaussian distribution with mean f (x) and covariance matrix ε∇f (x)Σ(x)∇f (x). By explicit calculation, the conditional entropy H(y|x) is given by We defineĨ with H ε = 1

Liquid Drop
The example in Fig.2c is built by first choosing a uniform random deformation δr ∈ [0, 0.8] and orientation θ ∈ [0, π]. An ellipse with axis A = R + δR and B = R 2 R+δR and rotated by an angle θ is considered. The axes are chosen so that the area of the ellipse is fixed to πR 2 for any δr. N particles are placed randomly inside the ellipse. Subsequently, we turn on a Lennard-Jones interaction between the particles where d ij is the distance between two particles, d coll is a cutoff at small distances and d eq represents the equilibrium distance between two particles. We also add a boundary potential that constrains the particles inside the drop i ) is the coordinate of a particle. We simulate the relaxation of the system by performing some gradient descent steps; a stochastic term is added to emulate a finite temperature T : we update the position of each particle as where ξ i is a random Gaussian variable (zero mean, unit variance) and η represents the step size of the thermalization. In particular, we chose R = 1, N = 60, n = 6, d eq = 0.27 , d coll = 0.06, W = 200, η = 10 −5 . We performed 2 · 10 3 thermalization steps.

Appendix E: Feature Extraction in a Low-Dimensional Setting
Here, we show how we implemented the optimization of Eq. (A1) to extract the features in the Feature optimization section of the main text. In particular, we extract a one-or two-dimensional feature. In a low-dimensional feature setting, it is feasible to discretize the feature space y in a lattice, with lattice constant ∆y in the 1d case or ∆y 1 , ∆y 2 in the 2d case. We keep f (x) continuous and we parametrize it with a neural network, i.e. f (x) = f θ (x), where θ are the weights and the biases of the neurons of the network. In particular, we quickly recall that a neural network is made of many concatenated layers, and that each layer involves the application of a non-linear function σ (called activation function) to a linear combination of the inputs: we can write the output of a layer of input x (and parameters θ = (w, b)) as To implement Eq. (A1) in TensorFlow [44], we need to write it as a differentiable function. By looking at Eq. (A1), we see that the first term H(y) must be approximated in some way; the second term can be directly used.
The easiest way to approximate H(y) is to first estimate P y (y), and then use it to compute the sum H(y) = −∆y k P y (y k ) log P y (y k ) (where ∆y = ∆y 1 ∆y 2 in the 2d feature case). We approximate the probability density P y (y) with a Kernel Density Estimation procedure [45]: we apply a kernel K j (y) on each point y j = f (x j ) and write where N is the number of points in the batch. Now we can discretize P y (y) by assigning to each element of the lattice y k P y (y k ) =  Fig. 2a of the main text, one by row, ordered by increasing value of the feature. Without the regularizing term in Eq. E2 the optimization can get stuck in local minima that have almost the sameĨ as the optimal feature (compare with Fig. 3b of the main text). As a consequence, the network can assign different feature values to very similar samples, while still being able to distinguish them.
the integral in Eq. (E1) as the difference of two Error functions. In practice, we use a Gaussian kernel with variance (s∆y) 2 ,i.e.
(where d is the dimension of y) and we empirically chose s = 1 in the 1D case and s = 2 in the 2D case. In the 1D case we discretize the feature in k f = 180 bins, in the 2D example we use k f = 100. We fix the bounds of the histogram so that it always includes all the points. The optimization of Eq. (A1) is performed through gradient descent in the following way. We define the cost function that we want to minimize, C = −Ĩ(x, f θ (x)). At each step, we consider a batch of samples, calculate f θ (x) for all the points in the batch and use it to estimate P y (y). We calculateĨ(x, y = f θ (x)) and use backpropagation to update each parameter of the neural network, i.e. an update rule where n is the training step and the learning rate η is a fixed parameter of the algorithm, trying to converge to the minimum of the cost function. In practice, one can obtain better performance by using more advanced algorithms like RMSprop or Adam [46]. We can improve the smoothness of the extracted feature by adding to the cost function an exponentially decaying term that penalizes large gradients: where a(x) x = dxP x (x)a(x), and A and τ are hyperparameters that should be chosen conveniently. This can prevent extracting features that almost have the optimal information content but present discontinuities (see for example Fig. 5).
In addition, we remind that because of reparametrization invariance (Eq. (B1)) we can arbitrarily choose the density distribution of the output feature. For example, we can enforce a Gaussian feature distribution (with zero mean and variance σ 2 ) by adding to the cost function the Kullback-Leibler divergence [33] of a Gaussian with the feature distribution: where B is a hyperparameter that can be chosen freely. This increases the accuracy of the feature entropy because it prevents the output distribution to condense in very small regions, making our estimate unreliable. To sum up, the cost function that we minimize, including all possible regularizations that we described, reads as C = −Ĩ(x, f θ (x)) + Ae − n τ ||∇f θ (x)|| x + B (KL(P y ||G)) .
(E4) Table I shows the parameters we used in the examples described in Fig. 3 of the main text.

Improving feature extraction
In this paper, we optimized renormalized mutual information to extract only a one-or two-dimensional feature. As shown in the previous section, we could use a histogram-like approximation of the entropy term in Eq. (A1). Generalizations to high-dimensional feature settings will clearly require to find a better way to estimate H(y). One more advanced option to estimate it is to exploit adversarial techniques. One could introduce a reference distribution P r (y) and rewrite H(y) = − ln P r (y) y∼Py(y) − KL(P y ||P r ), where the last term is the Kullback-Leibler divergence [33] between P r (y) and P y (y). This term can be estimated by means of adversarial techniques [37], i.e. by optimizing over an auxiliary "discriminator" neural network D(y): H(y) = − ln P r (y) y∼Py(y) + min D D(y) y∼Py(y) + e −D(y) y∼Pr(y) − 1 .
For convenience, P r (y) can be chosen as a Gaussian distribution. This is related to the technique proposed in [26] to estimate mutual information, while properly allowing for the renormalization discussed here.