Making sense of complex systems through resolution, relevance, and mapping entropy

Complex systems are characterised by a tight, nontrivial interplay of their constituents, which gives rise to a multi-scale spectrum of emergent properties. In this scenario, it is practically and conceptually difficult to identify those degrees of freedom that mostly determine the behaviour of the system and separate them from less prominent players. Here, we tackle this problem making use of three measures of statistical information: resolution, relevance, and mapping entropy. We address the links existing among them, taking the moves from the established relation between resolution and relevance and further developing novel connections between resolution and mapping entropy; by these means we can identify, in a quantitative manner, the number and selection of degrees of freedom of the system that preserve the largest information content about the generative process that underlies an empirical dataset. The method, which is implemented in a freely available software, is fully general, as it is shown through the application to three very diverse systems, namely a toy model of independent binary spins, a coarse-grained representation of the financial stock market, and a fully atomistic simulation of a protein.


I. INTRODUCTION
Complex systems challenge our understanding as they resist the reductionist breakdown. A complicated system can be decomposed into simpler parts and comprehended in terms of their behaviour; on the contrary, a complex system features a degree of interplay among its constituents that makes its emergent properties impossible to deduce from the study of the irreducible elements it is made of [1][2][3]. In principle, then, these elements should be investigated altogether, simultaneously accounting for their individual behaviour as well as their mutual interactions, correlations, and cooperations. Nonetheless, a system composed by a large number of degrees of freedom can rarely be understood through a holistic inspection of all of them (it is sufficient to have ≥ 2 degrees of freedom to have chaotic behaviour [4,5]). A substantial decrease of the amount of detail is necessary to attain two goals: on the one hand, the reduction in the sheer number of variables a human mind has to simultaneously cope with; on the other hand, the separation of the relevant information from the irrelevant noise, that is, those properties whose knowledge does not contribute significantly to comprehension. These operations constitute the core business of those methods devoted to dimensionality reduction.
Many examples of dimensionality reduction algorithms exist [6,7], such as principal component analysis (PCA), clustering, diffusion maps, intrinsic dimension, and machine learning (ML) approaches. All these provide information about the properties of the system by "condensing" the available data about it in to a smallersized number of variables that are easier to read, visualise, and interpret. The aforementioned methods are * raffaello.potestio@unitn.it very general in their applicability, and hence the kind of information they provide is similarly general and naturally requires some degree of interpretation to be understood. It is clearly desirable to have methods that are as parameter-free as possible, so as to minimise the amount of antecedent knowledge of the system one has to employ to aptly guide the procedure of simplification; however, "one-size-fits-all" approaches are either very hard to conceive or plainly inadequate to tackle the remarkable variety of complex systems that nature offers to those who aim at understanding them. A balance between generality and specificity has then to be found.
A very specific dimensionality reduction strategy is provided by coarse-graining (CG'ing) [8][9][10], which can make a synthesis between unsupervised feature extraction and a case-specific, easily intelligible analysis of a given system. Originating in the context of critical phenomena [11,12], coarse-graining was subsequently extended to soft matter modelling [8,13]. Here, one aims at constructing simplified representations of molecular systems in which a single super-atom, or bead, is representative of a number of physical atoms; taking advantage of the reduced number of degrees of freedom, the fewer interactions, and the simpler functional form of the latter it is possible to build computationally efficient models that retain the essential qualities of the original system of interest and allow the study of larger molecules for longer times.
Recently, techniques developed in the context of coarse-graining have been employed as instruments not only to model a system, but also to analyse a highresolution model of it, leveraging the fact that the effectiveness of the model largely depends on the appropriate selection of its fundamental constituents. It is in this context that an information-theoretic measure, dubbed mapping entropy [13][14][15][16][17][18], turned out to be a valuable tool to make sense of a high-resolution model by inspecting lower-resolution representations of it and ranking them according to their mapping entropy value. This quantity, in fact, measures the distance between the reference probability distribution of high-resolution configurations and the one obtained by looking at the system in coarsegrained terms: the lower the mapping entropy, the higher the amount of information retained by a reduced description of the system.
Another approach for studying complex systems is the resolution and relevance framework [19][20][21][22][23][24][25]: here, for a given set of features used to describe the system, the first quantity measures the level of detail this representation provides, while the second quantifies its useful information content. Together, resolution and relevance allow one to pinpoint the level of coarseness that optimally balances data parsimony and informativeness.
In this work we address the problem of identifying novel connections between these distinct measures of information content that have been developed independently in different contexts. We show that these quantities can be employed to differentiate between informative and non-informative features in a sensitive and unsupervised manner, with impactful implications for the comprehension of a large class of complex systems. In particular, we demonstrate that resolution and mapping entropy are strictly connected with one another, and that the combined usage of resolution-relevance first, and mapping entropy later, can constitute a useful data processing pipeline to extract information from empirical data sets.
The paper is organised as follows. In Sec. II we present a synthetic overview of the resolution and relevance framework, discuss the derivation and interpretation of mapping entropy, and report novel analytical results on the relation between resolution and mapping entropy. In Sec. III we present the results of applying the analysis based on resolution, relevance, and mapping entropy to three distinct systems of increasing complexity. Finally, in Sec. IV we sum up the results and discuss future perspectives.

II. THEORETICAL BACKGROUND
A. The resolution-relevance framework Consider a system composed of n degrees of freedom, e.g. n spins σ 1 , . . . , σ n , whose overall state is specified by the state of each spin. A specific realisation of these spins constitutes an element x of an n-dimensional vector space. A specific dataset of L configurations, { x 1 , x 2 , · · · x L }, constitutes the empirical sample that we aim to investigate.
The elements x i of the dataset can be categorised in terms of some labelling s i = s( x i ), where the labels s take values from a discrete set S of size |S| = C. Depending on the classification scheme induced by s( x), the same label can occur more than once in the same dataset; think, for example, of Ising spin strings classified in terms of their average magnetisation M = j σ j : the value M = 0 appears for each string in which half of the spins are up and the other half are down. The number of realisations x i corresponding to the label value s is denoted by k s . The following constraints apply: meaning that each label can occur a number of time between zero (it never appears) and the size of the empirical dataset (the same label is associated to all data points); furthermore, the occurrences of each label have to sum to the size of the dataset. The choice of the set of labels induces an empirical probability distribution over the sample given bŷ The Shannon entropy of this distribution is termed the resolution [20], as it provides a measure of the level of detail employed in the description of the sample. Indeed, a description given by a few labels corresponds to low resolution, as the number of terms in the sum in Eq. 3 is small. In contrast, the limiting case where each state has a different label corresponds to a uniform empirical probabilityp(s) = 1/L, leading to the maximal value of the resolution for a sample of L realisations, H[s] = ln L. Intuitively, these two extremes of very gross and very fine descriptions, corresponding to low and high resolution values, do not provide an informative view over the empirical sample; additionally, we observe that the resolution H[s], on average, grows monotonically with the number of labels C. In order to quantify the informativeness of the description given by the classification s( x), Marsili and coworkers [19,20,[22][23][24] proposed to employ the relevance: this is given by the Shannon entropy of the distribution of frequencies of labels s. Defining m k the number of labels that have frequency k, namely the relevance is then given by Note that we omit from the sum those terms for which m k = 0, so as to avoid zeros in the logarithm.
The description of an empirical sample in terms of the frequencies of labels k s provides a minimally sufficient representation of the sample [23]. This can be seen by the decomposition of the information content of the sample, the resolution H s , in two parts: The first term is the relevance H[k], and the second term is a measure of the noise: An intuitive view on this decomposition is the following. The frequency k s contains information about the label s; hence, so does the relevance, which is the entropy of the frequency distribution. Consider now two labels, s 1 , s 2 , having the same frequency k s1 = k s2 ; in view of the relevance, these labels are equivalent and thus H[k] alone cannot provide any information allowing one to tell them apart. Because of this ambiguity, the term H[s|k] quantifies the degeneracy of the choice of classification scheme s( x) that produces a specific frequency distribution, and hence it is a measure of noise.
It is now possible to rationalise the intuition for the non-informativeness associated with both extreme values of resolution showcased above. In fact, they both correspond to zero relevance: in particular, when the resolution is zero, all configurations correspond to a single label s, and thus k s = L and m i = δ i,L ; analogously, the maximum value of the resolution, ln L, corresponds to a single state per label, namely k s = 1 ∀ s and thus m i = Lδ i,1 ; making use of these values of m i for the relevance, Eq. 5, gives zero. The non-negativity of the entropy combined with Rolle's theorem implies that the relevance must have a maximum.
Resolution and relevance depend on the specific set of labels s as well as on their number C. In general, for small C the resolution is low, and each label has a unique empirical frequency k s different from that of the other labels. Therefore, knowledge of the frequency implies that of the label, and thus the noise H It is useful to consider the description of the system from the opposite direction, namely going from the maximal resolution and lowering it. This shows that, by reducing the resolution, we actually increase the relevance. The slope of the curve as a function of the resolution, µ = µ(H[s]), tells us how many bits of relevance we gain by lowering the resolution by one bit. The behaviour of the resolution-relevance curve is extensively discussed by Marsili and coworkers in their analysis of maximally informative samples [22,23], i.e., those sets of realisations of a complex system that maximise the relevance at each value of the resolution. In particular, they identify the threshold point with µ = −1 in these samples as especially interesting, since it provides the optimal tradeoff between the two entropies. In the right part of the resolution-relevance plot, the slope µ(H[s]) is generically a decreasing (negative) function of the resolution. Thus reducing the resolution, which corresponds to going from right to left in the resolution-relevance plot, further beyond µ = −1 corresponds to gaining less in relevance than what was lost in resolution. The point µ = −1 has also been put in relation with a scale-free distribution of frequencies m k ∼ k −2 , also known as Zipf's law [23].

B. Mapping Entropy
One of the goals of coarse-graining is to identify a reduced representation, called mapping, of a highresolution system that retains as much information as possible about it [13]. In general, the mapping consists of defining a number N < n of coarse-grained sites in terms of a linear combination of the n original degrees of freedom. For the sake of simplicity we here limit ourselves to decimation mappings [11-13, 17, 26], in which a degree of freedom σ j can be either retained or removed from the high-resolution description.
As in the previous section, it is possible to label the different realisations of the system. In this case, we possess a fine-grained label x, associated to a state of the highresolution system (σ 1 , . . . , σ n ), and a coarse-grained one s = s( x), referring to the same configuration, but observed at low-resolution: Our label s in this case is thus the (N -dimensional) string of spins that we retain from the whole. Given this prescription and a coarse-grained mapping M (Eq. 8), we can now associate the configuration x to the corresponding, unique label in the mapped space, s( x); assuming that the high-resolution states are distributed according to a probability p( x), we can define a mapped probability distribution in the coarse-grained space p(s), that is the probability of observing the CG label s, as: At this point one can introduce the mapping entropy [13][14][15][16][17][18], which is a Kullback-Leibler divergence measuring the quality of a CG mapping by comparing p(s) to its high-resolution space analogue, p( x), where Ω 1 (s) is the number of fully detailed, fine-grained configurations x mapping onto s: Specifically, the mapping entropy compares the reference, high-resolution probability, p( x), against another distribution [15,17], which assigns equal probability weight to all the finegrained configurations that map onto the same CG one.
As not all of these configurations are equally probable, these two distributions, p( x) and p( x), are not equivalent. Ideally, an "optimal" mapping M minimises the impact of the process of dimensionality reduction by aggregating high-resolution configurations with similar probability weight p( x) inside the same s. The mapping entropy is related to the resolution through (the detailed derivation is provided in Appendix V A): This equation holds because s = s( x), that is, the quantity H[s| x] = 0, since the knowledge of the configuration x implies the exact knowledge of the corresponding value of the label s. By definition of conditional entropy, the following holds (see Appendix V A for further details on the derivation of this result): = − s p(s) x p( x|s) ln p( x|s).
A specific category of classifications s exists that are sufficient representations [23]; these are those for which all configurations x mapping on a given label s have the same probability, that is: Consequently, the conditional probability p( x|s) of observing a given data point x given the value s of the label is just the inverse of the number of high-resolution configurations mapping on that label: where the Kronecker delta is needed to enforce the fact that the conditional probability is different from zero only for those configurations x that map onto s.
This demonstrates that the mapping entropy is the difference between the conditional entropy of the highresolution data subject to the labelling and the largest value that it can have, which corresponds to s being a sufficient representation. In general, however, resolution and mapping entropy have a nontrivial relation due to the last term in Eq. 14.
Changing mapping changes the definition of s, and hence the resolution H[s]. A mapping that induces a sufficient representation will then have zero mapping entropy; however, the distribution of label frequencies associated to such mapping might not be unique to it, in the sense that other (sufficient) representations might generate the same distribution and, hence, the same relevance. Irrespectively of the mapping entropy being zero, then, the value of the relevance can be smaller or larger depending on the degeneracy of the classifications that produce a given frequency distribution.

III. RESULTS
In this work we aim at investigating the behaviour of relevance, resolution, and mapping entropy on distinct systems at varying levels of complexity and abstraction, with the aim of devising a pipeline to process empirical data and extract information out of the dataset. To this end, we concentrated on three different case studies, each of which aims at clarifying specific aspects of the relation among, or possible usages of, these quantities.
First, we made use of a simple toy model to inspect resolution, relevance, and mapping entropy altogether. The system is constituted by a string of non-interacting binary spins; while its properties are trivial to understand once the underlying single-spin probabilities are known, the behaviour of resolution, relevance, and mapping entropy computed on various coarse-grained representations of it is not, as they critically depend on the empirical sample onto which they are computed. This is the ideal situation to (begin to) grasp the essence of these quantities, in that all the non-trivial features that emerge are only marginally due to the complexity of the system itself, and mainly emerging as a consequence of the finiteness of the dataset.
Second, we tackled a real-world case, namely a simplified model of the stock market based on real data. Here, we focus on the relationship between resolution, directly employed as a measure of the detail retained in a given low-detail description of the system, and the mapping entropy, which serves to identify nontrivial correlations within the dataset.
Third, we employed the resolution-relevance framework to reconstruct an empirical probability distribution to be investigated by means of the mapping entropy minimisation method. The latter, in fact, relies on the knowledge of a reference probability distribution of the high-resolution data, against which the low-resolution one is compared. Here we explored the possibility of reconstructing the reference probability from a dataset of protein conformations sampled in a molecular dynamics simulation; to this end, we coarsened the configurational space and identified the reference distribution as the one corresponding to the optimal resolution-relevance threshold (µ ∼ −1).
In the following sections, the results obtained in each of these three systems are presented and discussed.
A. Discrete, non-interacting case: a simple spin system The first model system is composed of n = 20 noninteracting spins, each characterised by its probability to be in the "up" state. These spins are partitioned into two subsets of biased and unbiased spins. The first 10 spins are biased in a linear descending order according to p A i (σ i = 1) = 1 − (i − 1)/20 for 1 ≤ i ≤ 10, while the last 10 spins are unbiased, namely p A i (σ i = 1) = 0.5 for 11 ≤ i ≤ 20, see Fig. 1.
The number of states of the system is, in principle, 2 20 ≈ 6 × 10 6 . However, not all of these are realisable as the first spin, σ 1 , has zero probability to be in the "down" state. To study this system, we generated a sample of L = 10 5 states given by σ j L j=1 . The sample provides an empirical probability distribution of system configurations:p In the limit of an infinite sample, the empirical dis- tributionp( σ) coincides with the underlying distribution p( σ), that is: where the last equality is due to the independence of spins. Let us next discuss the properties of the coarse-grained representations of this spin system, that is, those selections of N specific spins out of the total n. Such a coarse-grained representation is given by a mapping M : {0, 1} n → {0, 1} N which takes the state σ = (σ 1 , . . . , σ n ) and returns the CG state s(σ 1 , . . . , σ n ) = (σ j1 , . . . , σ j N ), for some specific choice of N indices j 1 , . . . , j N . Each choice of N spins corresponds to another empirical probability of the CG system,p(s), which comes about from marginalising over the spins that are not retained. The resolution (Eq. 3) and the relevance (Eq. 5) can be readily calculated: the former directly from the probabilitŷ p(s), the latter through the computation of the frequency distribution, Eq. 4. To calculate the mapping entropy one needs to compare the full empirical probabilityp( σ) with the "smeared" coarse-grained one, p( σ), see Eq. 13. For each decimation-based CG representation, the corresponding resolution, relevance, and mapping entropy are computed and reported in Fig. 2(a-d).
Specifically, the resolution-relevance values for all possible coarse-grainings of N = 1, . . . , 20 spins are reported in Fig. 2(a). The first observation we make is that the data follow the expected behaviour in spite of the system being composed of uncorrelated degrees of freedom. The reason for this is that, even if the probability of each spin being "up" is independent of the others, the pool of configurations on which resolution and relevance are computed is finite and smaller than the cardinality of possible states (10 5 randomly sampled strings vs. 2 19 ≈ 3 × 10 6 possible ones (recall that the 1st spin is always "up"); hence, for about half of the resolution range we are in the under-sampling regime: because of this, when the resolution is too high, we deal with too few data points to accurately reconstruct the underlying reference probability, and the relevance is lower than the resolution. In the intermediate regime, however, the finiteness of the sample enhances the relevance, and indicates the appropriate resolution level to describe the dataset in a synthetic manner that, nonetheless, allows one to extract nontrivial information about the generative process.
This result is inherently due to the finiteness of the dataset. In fact, if we were to compute resolution and relevance on an exhaustive list of configurations with the exact probability associated to them, the curves would turn out as a band of straight lines, trivially linking res-olution and relevance, and with the latter having values below those that are observed in the finite-sampling case (see Sec. V B and Fig. 10 in the Appendix).
Another interesting aspect revealed by Fig. 2(a) is the range of resolution and relevance values for different numbers N of retained spins. CG mappings such that N is close to n display little variations in resolution and relevance, while an intermediate coarse-graining is associated with a wide range of values. Figure 2(b) reports the results for the CG representations obtained retaining N = 10 sites. Such CG mappings are distributed in a clustered structure that can be captured by introducing a rank for each mapping, which quantifies the balance between biased and unbiased spins. The rank of a single spin σ j is given bỹ and the rank for a CG representation M (σ 1 , . . . , σ n ) = (σ j1 , . . . , σ j N ) is given by the average of the rank over all retained spins, that is For any choice of N , the rank takes a value between −1 and 1, measuring the proportion between biased and unbiased spins in the CG state: when r(M) = 1 all retained spins are biased; when r(M) = −1 all retained spins are unbiased; when r(M) = 0 there is an equal number of biased and unbiased spins. Figure 2(b) shows that CG configurations with positive rank provide higher relevance values, whereas negative rank CG configurations have lower relevance and possess higher resolution. Thus, high relevance values correspond to CG mappings that retain more biased spins than unbiased spins, but it is not very sensitive to the rank -having an equal number of biased and unbiased spins saturates the relevance, i.e. replacing an unbiased spin with a biased one, does not increase the relevance. Therefore, in regards to the question "which spins are more informative?" the relevance answers in an ambiguous manner: one should retain just enough biased spins (in this case five), and adding more spins does not change the outcome appreciably.
The reason for this result is a consequence of the marginalised empirical probability of the retained spins. Consider the case of retaining all the unbiased spins: this would provide an empirical sample of labels with a roughly uniform distribution, resulting in a large entropy of the sample and thus a high resolution. As for the relevance, one needs to consider the distribution of frequencies, which in this case would be narrow; as the relevance is the entropy of this distribution, it would correspond to low values. Replacing unbiased spins with biased spins would make the distribution of the sample less uniform, thereby decreasing the resolution. The frequency distribution would become broader, and so the relevance would increase. However, the relevance saturates when we have a rank of zero, i.e. when the number of biased and unbiased spins is equal. This indicates a qualitative feature of the relevance: it thrives when the probabilities of constituents are slightly rather than extremely biased. On the other hand, it increases when retaining constituents with different probabilities. Indeed, for a finite sample, the unbiased spins are sampled with finite precision, and therefore, from the empirical point of view, they are slightly biased. Since statistically they are biased in the same manner, retaining too many of them would result in a narrow distribution of frequencies and thus low relevance. However, retaining some of them, already provides enough variability in the frequency distribution to result in high relevance. For further discussion on the differences between infinite and finite samples see the Appendix V B.
In Figs. 2(c,d) the dependence of the mapping entropy on the resolution is reported. In contrast to the relevance, which tends to zero in the two limiting cases of low and high resolution, the mapping entropy is monotonically decreasing (on average) with the resolution; when all spins are retained, i.e. N = n, the smeared probability p( σ) (Eq. 13) is exactly equal to the distribution p( σ) and no coarse-graining is performed; on the other hand, if only one spin is retained, the resulting CG probability is as far as it can be from the full-system probability. For some intermediate values of N it is possible to observe a large range of mapping entropy values, which depend on the specific choice of the CG representation. Figure 2(d) shows that, for a given N , minimal values of the mapping entropy are obtained for high-rank CG configurations, that is, those displaying non-uniform probabilities. A closer look into the minimal values of Fig. 2(d) reveals that the CG mapping (denoted by M biased in Fig. 2(b,d)) with maximum rank, s = (σ 1 , σ 2 , . . . , σ 10 ), is not the absolute minimum of the mapping entropy. All of the mappings in which the first spin is replaced by one of the non-biased spins, namely s = (σ 2 , σ 3 , . . . , σ 10 , σ l ), correspond to lower values of mapping entropy (these are denoted by M opt in Fig. 2(b,d)). This is a consequence of the fact that the first spin, having p 1 = 1, is not informative at all (keeping track of its value does not carry any information since it is always "up"), while each of the non-biased spins provides a minimal advantage due to the finite sample size. In contrast, in the case of fully analytical calculations (which is equivalent to infinite sampling, see Eq. 22) the values of the mapping entropy obtained by retaining all the 2, . . . , 10 spins plus any one of the other eleven spins would be exactly equal.
These considerations allow one to rationalise a feature of Fig. 2(c), namely the fact that the minimum value of the mapping entropy remains approximately constant for a wide range of CG spin numbers, that is, for N = 9, . . . , 16. When N = 9, the minimum of this quantity is obtained for the CG mapping that retains the spins with indices 2 ≤ j ≤ 10, and adding other spins to this representation does not guarantee a substantial decrease in the mapping entropy, which is only obtained when the mapping gets closer to the fully detailed representation (when N ≥ 17). At the same time, some mappings with N = 18 exist, whose associated mapping entropy is higher than the minimum value obtained when N = 9: these are coarse-grained representations that do not retain two of the biased spins.
In conclusion of this section, a discrete system whose constituents are completely independent was analysed with the help of resolution, relevance, and mapping entropy. These three quantities shed light on some intrinsic features of the model at hand, thus making them promising candidate analysis tools for more complex systems. In particular, we find that the kind of information highlighted by relevance and mapping entropy is oriented to different goals. The relevance is focused on reconstructing the statistics of the specific empirical sample, and thus it is more "compression-oriented". In contrast, the mapping entropy is aimed at marginalising degrees of freedom which do not change the probabilistic description of the sample, and thus it is more "generationoriented". This different sensitivity results in the fact that the mapping entropy favours the biased spins (except σ 1 ) over the unbiased spins, while the relevance treats all mappings with zero rank as roughly equal.

B. Discrete interacting case: a model of a financial market
The second model considered here concerns a simplified description of a financial market, whose constituents are certainly interacting with a functional form that is not only unknown a priori, but also not representative of statistical equilibrium.
Common stock market indices, such as NASDAQ-100, FTSE MIB, DAX 30, are usually defined in terms of the value of the most traded stocks, or the ones with the highest market capitalisation. As an example, the NASDAQ-100 index considers the largest non-financial companies listed on the Nasdaq stock market [27]. It is well-known [28,29] that changes in the composition of such indices have an impact on the stock prices, temporarily favouring the stocks that are added to the index.
These indices can be considered as coarse-grained mappings of the high-resolution system, i.e., the full stock market, to a lower number of degrees of freedom. The natural question that arises is the following: are these indices always appropriate to coarse-grain the full market? Can one find a different subset of stocks that brings more information about the high-resolution system? Throughout this section, we consider two "highresolution" systems, namely m1 and m2, defined as the ten (for m1 ) and twelve (for m2 ) stocks with the highest market capitalisation (at the date 1/10/2021) in the NASDAQ-100 index, which are described in Tab. I. The values of these stocks are investigated over a ten year time window, for a total of 2225 days of sampling considered. For each day, a stock can assume three discrete values (see Fig. 3), namely +1 if the stock value increases during the day, 0 if it is stationary and −1 if it decreases. In this way the full market is mapped to a system of interacting, three-states spins with 3 10 (3 12 ) available realisations. As in the non-interacting case discussed in Sec. III A, many of these are impossible to observe in , a spin up (down) is assigned to the company for the specific date. If the two values coincide , the date is labelled as stationary for the considered stock.
a pool of real configurations: imagine for example how unlikely it is that 12 stocks of this importance are stationary in the same day. Indeed, it is possible to observe only 630 (1148) configurations of the system in the available sampling. As in Sec. III A, we use the set of degrees of freedom as the high-resolution labelling x, see Eq. 9, whose probability p( x) is defined as the number of times a full-system configuration {σ 1 , . . . , σ n } is observed divided by the number of days (Eq. 21). Next, we analyse the behaviour of resolution, relevance, and mapping entropy for all the 2 9 (2 11 ) CG decimation mappings that can be defined for the two models. The analysis follows Fig. 4, which reports the values of these three quantities for all possible CG mappings, as well as Fig. 5, where we show the probability that a stock is retained in a mapping that minimises the mapping entropy, as a function of the number of retained stocks.
First, looking into the behaviour of the relevance, we observe the expected bell shape, with a linear resolutionrelevance trend for 1 to 4 retained stocks. This is suggestive of the fact that the model is in the well-sampled regime, and the information content of the dataset is fully captured; for larger numbers of retained sites (N = 5, 6, 7), on the contrary, we find a regime where the empirical dataset is noisy, but the coarse representation gathers the largest amount of available information about the underlying statistics. Finally, for N > 8, the data are too noisy and the low-resolution representation is not informative.
We then investigate the behaviour of the resolution  that is observed in all panels of Fig. 4. For each value of 1 < N < n there exist two clouds of points separated by a gap in resolution. A direct inspection of the data shows that, at fixed N , the lower-resolution clouds of mappings are characterised by a common trait: all these representations retain both GOOG and GOOGL. As expected, these two stocks are highly interacting and correlated, displaying the same value in the 94.3% of the selected time window. Therefore, it is reasonable that a mapping containing both Google stocks provides a low-resolution coarse-graining of the system, which is comparable to the resolution of a coarse-grained system with N − 1 stocks. In Fig. 4(c-d) it is possible to appreciate how the choice of the model influences the average value of mapping entropy of the two clouds. For model m1 (Fig. 4(c)), the mappings containing both Google stocks (corresponding to the left cloud for each N ) display an average mapping entropy equal or lower compared to other mappings that contain only one Google stock (corresponding to the right cloud for each N ). This is not the case for m2 shown in Fig. 4(d): since two additional stocks are included in m2, p(s) is less biased by the presence of Google instances, and the mapping entropy of representations (i.e. mappings) containing both GOOG and GOOGL is consistently higher than that of the other mappings. Intuitively, one of the two Google stocks possesses a high level of information about the system, but the inclusion of both of them in a coarse-grained description of the full market is redundant.
A further interesting aspect revealed by an inspection of Figs. 4(c-d) and 5 is that all the mappings retaining GOOG, MSFT, and NVDA display a value of mapping entropy lower than the average. In particular it is possible to observe that, in both models, when 3 ≤ N ≤ n − 1, the mappings displaying the lowest value of mapping entropy at fixed N always include the combination of these three stocks. The reason behind the high informativeness of these companies can be attributed to their long-time, dominant presence in the stock market.
As for particularly uninformative mappings, that is, those with high mapping entropy, it is possible to observe that TSLA and NFLX (for m1 ) and TSLA and NTES (for m2 ) appear to be always retained in those representations. In particular, we note that, for m1 (resp. m2 ), (i ) when N = n−2 the mapping with lowest mapping entropy is the one that does not contain TSLA and NFLX (resp. TSLA and NTES); (ii ) when 2 ≤ N ≤ n − 2 the mapping with highest mapping entropy retains TSLA and NFLX (resp. TSLA and NTES). A possible explanation for this behaviour can be related to their marginal importance to the market for a vast majority of the sampling time (10 years), having experienced an exponential growth only in the latest years. In the case of TSLA, the corresponding company operates in a field that is neatly separated from the other stocks reported in Tab. I.
Lastly, we note that the "interacting" system considered in this section does not display the flatness in the mapping entropy minima that was observed in Fig. 2(c) describing the non-interacting spins system in Sec. III A. In fact, for an interacting system the addition of a new site to an optimal coarse-grained mapping is likely to result in a gain of information about the high-resolution system and, hence, in a decrease of the mapping entropy.
In summary, the information measures under examination, and in particular their joint usage, proved to constitute an informative instrument of analysis of our simple description of a subset of the Nasdaq financial market. Specifically, the resolution-relevance curve was shown to highlight interesting distinct regimes of the lowresolution description, providing a guide in assessing, qualitative and semi-quantitatively, the amount of useful information that a coarse picture of the system can retain; the mapping entropy, on the other hand, allowed us to rationalise the features observed in the resolution and to identify those specific stocks that contributed the most (or the least) to the overall behaviour of the model market. The proposed strategy can thus be generalised to the full stock market with the aim of selecting the most FIG. 4: Resolution, relevance ((a-b)) and mapping entropy ((c-d)) for the two models. Mappings in m2 can reach high values of resolution because adding information (two stocks) allows to define a higher number of high-resolution labels s out of the available sampling. In (a-b) there exists a CG mapping with N = 1 possessing a very low value of relevance (H k ∼ 0); this is the mapping that retains TSLA stock: by chance, the number of spins in the up and down configurations coincide (see Tab. I). appropriate low-resolution index, identified as the set of stocks with minimal mapping entropy at a fixed degree of coarse-graining N , the latter that can be determined with the help of the resolution-relevance curve.

C. Continuous system: a small protein in solution
As it has been illustrated in the previous sections, the mapping entropy is a measure of how much information about a reference, high-resolution system (and its configurational probability distribution) can be retrieved or inferred from a low-resolution representation of it. In particular, one computes the Kullback-Leibler divergence between the reference distribution p( x) and the recon-structed one,p( x), which is obtained from the former assuming that all configurations x mapping on the same coarse-grained label s have the same probability; the latter is defined as the average over the group G of configurations In this section we address the practical aspect of investigating systems with continuous degrees of freedom, whose reference empirical probability distribution p( x) has to be determined. The problem lies in the fact that, while systems with discrete degrees of freedom (such as the stock market model) are naturally prone to a histogramming procedure, systems described in terms of continuous variables are not: arbitrarily small discrepancies in the coordinates would make two configurations look different, and whether they really are or not is a FIG. 5: Probability P cons of finding a given stock in a selection that minimises the mapping entropy, for the model m1 (left panel) and m2 (right panel). At a given value of N , P cons is calculated as the probability of each stock to be present in the 10% of the mappings with lowest S map . Those stocks whose knowledge brings the least information about the overall behaviour of the model market appear in darker colour: the presence of dark bars that extend for a broad range of retained stocks numbers indicates that these specific stocks are consistently identified as little informative.
matter to be settled before addressing the computation of the mapping entropy.
Here, our objective is to employ the resolutionrelevance framework to perform an optimal clustering of the high-resolution configurations of the system, based on which we determine the reference empirical probability p( x). This is a key step for the calculation of the mapping entropy: in fact, in specific cases, e.g. molecular systems at thermal equilibrium, the mapping entropy can be computed by means of a cumulant expansion of the Kullback-Leibler divergence that relies on the assumption that the system follows Boltzmann statistics, and hence the underlying probability density of the micro-states is the well-known exp(−βH); this strategy was indeed employed by Giulini and coworkers (see Ref. [17] as well as Eq. 39 in the Appendix) to identify the representations of least mapping entropy for a set of proteins. This assumption, however, does not hold in general, and it might be the case that one finds themselves with a dataset of configurations defined on a continuum range of values, whose underlying probability density is not known. The computation of mapping entropy in these cases has to rely on the definition based on the Kullback-Leibler divergence, which, in turn, assumes the knowledge of a reference, high-resolution probability density. Hereafter, we show how to obtain such probability distribution for a dataset of configurations defined on the continuum, and demonstrate that the results so obtained are consistent with those derived from the cumulant expansion.
The system under examination here is a small protein in water, whose time evolution is obtained by means of a plain, all-atom molecular dynamics (MD) [31,32] simulation. Specifically, we consider 6D93 [33], a mutant of the tamapin protein, a toxin of the Indian red scorpion [34]. This small protein (230 heavy atoms, 31 amino acids) is simulated in the canonical ensemble at 300K for 200 nanoseconds. The Cartesian coordinates of the atoms are saved once every 20 picoseconds, thus creating a data sample (trajectory) of L = 10001 configurations. Details on the GROMACS 2018 [35,36] simulation can be found in the Supplementary Material of Ref. [17].
In this context, the state of the system is encoded in a vector r containing the positions of its n constituent atoms. Differently from the discrete model, the distribution of the labels x cannot be identified with a simple counting over the states of these 3n degrees of freedom, due to their continuous nature. Hence, the labels x have to be defined by lumping several, in principle different configurations r of the sample in the same (high resolution) state, thus defining a non-uniform probability p( x) of observing it.
To this end, we apply the UPGMA clustering algorithm with average linkage [37] to the fully atomistic, pairwise RMSD matrix between all the elements of the < l a t e x i t s h a 1 _ b a s e 6 4 = " P V d H K O h 0 s V d j Y 6 h Q 5 I G / R j 8 P v 7 Q = " > A A A B 9 3 i c b V A 9 S w N B E N 2 L X z F + R S 1 t F o N g F e 4 k q I 0 Q T G M Z w T O B 5 A h z m 0 l c s v f B 7 p w Q Q n 6 D r V Z 2 Y u v P s f C / e H d e o Y m v e r w 3 w 7 x 5 f q y k I d v + t E o r q 2 v r G + X N y t b 2 z u 5 e d f / g 3 k S J F u i K S E W 6 6 4 N B J U N 0 S Z L C b q w R A l 9 h x 5 + 0 M r / z i N r I K L y j a Y x e A O N Q j q Q A S i W 3 x a + 4 M 6 j W 7 L q d g y 8 T p y A 1 V q A 9 q H 7 1 h 5 F I A g x J K D C m 5 9 g x e T P Q J I X C e a W f G I x B T G C M v Z S G E K D x Z n n Y O T 9 J D F D E Y 9 R c K p 6 L + H t j B o E x 0 8 B P J w O g B 7 P o Z e J / X i + h 0 a U 3 k 2 G c E I Y i O 0 R S Y X 7 I C C 3 T F p A P p U Y i y J I j l y E X o I E I t e Q g R C o m a S 2 V t A 9 n 8 f t l c n 9 W d 8 7 r j d t G r X l d N F N m R + y Y n T K H X b A m u 2 F t 5 j L B J H t i z + z F m l q v 1 p v 1 / j N a s o q d Q / Y H 1 s c 3 2 + u S K w = = < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " g W i E V g s 0 K N e 0 y k k 4 x f I B 2 U y k U 6 8 = " > A A A B 8 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N U Y 9 E L x 4 x k Y e B D Z k d G p g w O 7 u Z m S W S D V / h x Y P G e P V z v P k 3 D r A H B S v p p F L V n e 6 u I B Z c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o o a N E M a y z S E S q F V C N g k u s G 2 4 E t m K F N A w E N o P R 7 c x v j l F p H s k H M 4 n R D + l A 8 j 5 n 1 F j p s T N G l j 5 N u 1 6 3 W H L L 7 h x k l X g Z K U G G W r f 4 1 e l F L A l R G i a o 1 m 3 P j Y 2 f U m U 4 E z g t d B K N M W U j O s C 2 p Z K G q P 1 0 f v C U n F m l R / q R s i U N m a u / J 1 I a a j 0 J A 9 s Z U j P U y 9 5 M / M 9 r J 6 Z / 7 a d c x o l B y R a L + o k g J i K z 7 0 m P K 2 R G T C y h T H F 7 K 2 F D q i g z N q O C D c F b f n m V N C 7 K 3 m W 5 c l 8 p V W + y O P J w A q d w D h 5 c Q R X u o A Z 1 Y B D C M 7 z C m 6 O c F + f d + V i 0 5 p x s 5 h j + w P n 8 A d 9 1 k H c = < / l a t e x i t >
x 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " g W i E V g s 0 K N e 0 y k k 4 x f I B 2 U y k U 6 8 = " > A A A B 8 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N U Y 9 E L x 4 x k Y e B D Z k d G p g w O 7 u Z m S W S D V / h x Y P G e P V z v P k 3 D r A H B S v p p F L V n e 6 u I B Z c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o o a N E M a y z S E S q F V C N g k u s G 2 4 E t m K F N A w E N o P R 7 c x v j l F p H s k H M 4 n R D + l A 8 j 5 n 1 F j p s T N G l j 5 N u 1 6 3 W H L L 7 h x k l X g Z K U G G W r f 4 1 e l F L A l R G i a o 1 m 3 P j Y 2 f U m U 4 E z g t d B K N M W U j O s C 2 p Z K G q P 1 0 f v C U n F m l R / q R s i U N m a u / J 1 I a a j 0 J A 9 s Z U j P U y 9 5 M / M 9 r J 6 Z / 7 a d c x o l B y R a L + o k g J i K z 7 0 m P K 2 R G T C y h T H F 7 K 2 F D q i g z N q O C D c F b f n m V N C 7 K 3 m W 5 c l 8 p V W + y O P J w A q d w D h 5 c Q R X u o A Z 1 Y B D C M 7 z C m 6 O c F + f d + V i 0 5 p x s 5 h j + w P n 8 A d 9 1 k H c = < / l a t e x i t >   6: The L realisations of a continuous system can be clustered in a variable number C of labels x, ranging from C = 1 to C = L. These two discretisations are not informative about the system, as they induce a trivial frequency distribution and, consequently, a uniform probability p( x) of observing the label x over the sample. We identify (see main text) the thresholdC as the number of labels that separates the regimes of lossless and lossy compression.

< l a t e x i t s h a 1 _ b a s e 6 4 = " i I s R L M 0 7 W E X t n I j C K j D I t C f r P f s = " > A A A B 8 H i c b V A 9 T w J B E J 3 D L 8 Q v 1 N J m I z G x I n e E q C X R x h I T A Q 1 c y N 4 y w I b d u 8 v u H p F c + B U 2 F h p j 6 8 + x 8 9 + 4 w B U K v m S S l / d m M j M v i A X X x n W / n d z a + s b m V n 6 7 s L O 7 t 3 9 Q P D x q 6 i h R D B s s E p F 6 C K h G w U N s G G 4 E P s Q K q Q w E t o L R z c x v j V F p H o X 3 Z h K j L + k g 5 H 3 O q L H S Y 2 e M L H 2 a d i v d Y s k t u 3 O Q V e J l p A Q Z 6 t 3 i V 6 c X s U R i a J i g W r c 9 N z Z + S p X h T O C 0 0 E k 0 x p S N 6 A D b l o Z U o v b T + c F T c m a V H u l H y l Z o y F z 9 P Z F S q f V E B r Z T U j P U y 9 5 M / M 9 r J 6 Z / 5 a c 8 j B O D I V s s 6 i e C m I j M v i c 9 r p A Z M b G E M s X t r Y Q N q a L M 2 I w K N g R v + e V V 0 q y U v Y t y 9 a 5 a q l 1 n c e T h B E 7 h H D y 4 h B r c Q h 0 a w E D C M 7 z C m 6 O c F + f d + V i 0 5 p x s 5 h j + w P n 8 A e D 5 k H g = < / l a t e x i t > x 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " i v z b r 9 F O f v X 9 U N 6 2 V g 0 B + o t D c X o = " > A A A B + n i c b V D L S s N A F L 2 p r 1 p f q S 7 d D B b B V U l E 1 G W x G 5 c V 7 A O a E C b T S T t 0 8 m B m U i 0 x n + L G h S J u / R J 3 / o 3 T N g t t P X D h c M 6 9 3 H u P n 3 A m l W V 9 G 6 W 1 9 Y 3 N r f J 2 Z W d 3 b / / A r B 5 2 Z J w K Q t s k 5 r H o + V h S z i L a V k x x 2 k s E x a H P a d c f N 2 d + d 0 K F Z H F 0 r 6 Y J d U M 8 j F j A C F Z a 8 s y q M 6 E k e 8 y 9 z P G x y J p 5 7 p k 1 q 2 7 N g V a J X Z A a F G h 5 5 p c z i E k a 0 k g R j q X s 2 1 a i 3 A w L x Q i n e c V J J U 0 w G e M h 7 W s a 4 Z B K N 5 u f n q N T r Q x Q E A t d k U J z 9 f d E h k M p p 6 G v O 0 O s R n L Z m 4 n / e f 1 U B d d u x q I k V T Q i i 0 V B y p G K 0 S w H N G C C E s W n m m A i m L 4 V k R E W m C i d V k W H Y C + / v E o 6 5 3 X 7 s n 5 x d 1 F r 3 B R x l O E Y T u A M b L i C B t x C C 9 p A 4 A G e 4 R X e j C f j x X g 3 P h a t J a O Y O Y I / M D 5 / A B S V l I s = < / l a t e x i t >
sample: where RT is the roto-translation that superimposes r to r according to the Kabsch optimality criterion [38,39], thus minimising the overall displacement. Changing the threshold used to cut the dendrogram (see Fig. 6) resulting from the UPGMA clustering of this matrix, we can obtain arbitrary values of the number C of fine-grained labels x. A criterion is thus needed to establish the appropriate value of C that should be used to create this histogram of atomistic structures. A very coarse (resp. detailed) discretisation of the sample corresponds to C ∼ 1 (resp. C ∼ L) clusters, as shown in Fig. 6. Both of these choices are not "relevant" for the comprehension of the system, since they result in a uniform probability p( x) over the labels.
Hence, resolution and relevance are employed to determine the optimal number of fine-grained labels, which we denote byC, used to partition the collected protein structures. In Fig. 7(a) we report the H[s]-H[k] dependence for the 10001 realisations of the system; the considered trajectory displays a flat maximum of the relevance, which remains constant over a wide range of values of H[s] and C. The nature of this behaviour is certainly related to the hidden structure of the sample and to the properties of the clustering algorithm used to label its constituent elements.
The separation between the regimes of lossless and lossy compression [23] operated by the relevance is exploited to selectC. Indeed,C is chosen as the value of C corresponding to the critical point where the slope µ of the resolution-relevance curve is −1. The probability of each label x is now given by the number of times it is observed in the sample (k x /L, see Eq. 2).
The calculation of the optimalC that separates the region with µ < −1 from that with µ > −1 is shown in Fig. 7(b). In this context, we choose the first value of C after which µ < −1 for a consistent set of values of C, meaning that the induced discretisation remains in the regime of lossy compression for a while. At the end, the original trajectory of L snapshots is converted into its reduced counterpart ofC protein structures by choosing the first configuration of the sample belonging to each label x. This procedure allows us to determine the reference, empirical probability distribution p( x) as the frequency with which each of theC sampled highresolution configurations appears.
Next, we consider low-resolution representations of the protein structure, in order to identify the one that provides the most informative picture of the system with respect to the all-atom reference. A CG decimated representation of a protein is a selection of N out n atoms, which amounts at keeping N triplets of the original degrees of freedom. The coarse-grained labelling s = M(r 1 , . . . , r 3n ) lumpsC high-resolution labels x in K CG labels s. Following Ref. [17], we here select 5 different values of K to cut the dendrogram, thus creating 5 different probability distributions p( x). In Fig. 8 we provide a schematic depiction of this procedure: first, C mapped configurations M(r 1 , . . . , r 3n ), are compared using the coarse-grained RMSD RMSD CG (M(r), M(r )) = (26) where i denotes the indices of the retained degrees of freedom, and then the corresponding dendrogram is constructed with the UPGMA algorithm. Subsequently, different thresholds are employed to define the CG labels, and the resulting average mapping entropy is calculated using the following formula [17]: where {K} is the set of values of K and S map (K) is the corresponding mapping entropy, arising from the clustering ofC high-resolution labels into K CG labels; |K| = 5 is the number of K values. Now that we possess a method to calculate the mapping entropy (Eq. 11) for a continuous system, we follow Ref. [17] and run 48 mapping optimisations for the protein, employing N = 31 and sticking to the same minimisation protocol. As in Ref. [17], one can perform a basic statistics over the pool of low-S map mappings by using the conservation probability, P cons , of each atom, defined as the fraction of times it is included inside an optimised solution.
z s 7 X q S k I d v + t A p L y y u r a 8 X 1 0 s b m 1 v Z O e X e v Z c J Y C 2 y K U I W 6 4 3 G D S g b Y J E k K O 5 F G 7 n s K 2 9 7 4 I v X b D 6 i N D I M 7 m k T o + n w U y K E U n B L p 9 r r v 9 M s V u 2 p n Y I v E y U k F c j T 6 5 a / e I B S x j w E J x Y 3 p O n Z E 7 p R r k k L h r N S L D U Z c j P k I u w k N u I / G n W Z R Z + w o N p x C F q F m U r F M x N 8 b U + 4 b M / G 9 Z N L n d G / m v V T 8 z + v G N D x 3 p z K I Y s J A p I d I K s w O G a F l 0 g G y g d R I x N P k y G T A B N e c C L V k X I h E j J N S S k k f z v z 3 i 6 R 1 U n V O q 7 W b W q V + m T d T h A M 4 h G N w 4 A z q c A U N a I K A E T z B M 7 x Y j 9 a r 9 W a 9 / 4 w W r H x n H / 7 A + v g G b 0 K S A w = = < / l a t e x i t > K 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q W J Q W C C D 5 p n G 9 K W x 5 j 1 Y W P z t f N Y = " > A A A B 9 X i c b V C 7 T s N A E F y H V w i v A C X N i Q i J K r J R e J S R o E C i C Y I 8 p C S K z p d N O O X 8 0 N 0 a F F n 5 B F q o 6 B A t 3 0 P B v 2 A b F x C Y a j S z q 5 0 d N 1 T S k G 1 / W I W F x a X l l e J q a W 1 9 Y 3 O r v L 3 T M k G k B T Z F o A L d c b l B J X 1 s k i S F n V A j 9 1 y F b X d y n v r t e 9 R G B v 4 t T U P s e 3 z s y 5 E U n B L p 5 m p w P C h X 7 K q d g f 0 l T k 4 q k K M x K H / 2 h o G I P P R J K G 5 M 1 7 F D 6 s d c k x Q K Z 6 V e Z D D k Y s L H 2 E 2 o z z 0 0 / T i L O m M H k e E U s B A 1 k 4 p l I v 7 c i L l n z N R z k 0 m P 0 5 2 Z 9 1 L x P 6 8 b 0 e i s H 0 s / j A h 9 k R 4 i q T A 7 Z I S W S Q f I h l I j E U + T I 5 M + E 1 x z I t S S c S E S M U p K K S V 9 O P P f / y W t o 6 p z U q 1 d 1 y r 1 i 7 y Z I u z B P h y C A 6 d Q h 0 t o Q B M E j O E R n u D Z e r B e r F f r 7 X u 0 Y O U 7 u / A L 1 v s X d X 6 S B w = = < / l a t e x i t > K 5

< l a t e x i t s h a 1 _ b a s e 6 4 = " J V 4 d 4 A + q n 8 1 T b 8 Y S 4 x 8 O o U e C C a o = " > A A A B 9 X i c b V C 7 T s N A E F y H V w i v A C X N i Q i J K r J R B J S R o E C i C Y I 8 p C S K z p d N O O X 8 0 N 0 a F F n 5 B F q o 6 B A t 3 0 P B v 2 A b F 5 A w 1 W h m V z s 7 b q i k I d v + t A p
L y y u r a 8 X 1 0 s b m 1 v Z O e X e v Z Y J I C 2 y K Q A W 6 4 3 K D S v r Y J E k K O 6 F G 7 r k K 2 + 7 k I v X b D 6 i N D P w 7 m o b Y 9 / j Y l y M p O C X S 7 f W g N i h X 7 K q d g S 0 S J y c V y N E Y l L 9 6 w 0 B E H v o k F D e m 6 9 g h 9 W O u S Q q F s 1 I v M h h y M e F j 7 C b U 5 x 6 a f p x F n b G j y H A K W I i a S c U y E X 9 v x N w z Z u q 5 y a T H 6 d 7 M e 6 n 4 n 9 e N a H T e j 6 U f R o S + S A + R V J g d M k L L p A N k Q 6 m R i K f J k U m f C a 4 5 E W r J u B C J G C W l l J I + n P n v F 0 n r p O q c V m s 3 t U r 9 M m + m C A d w C M f g w B n U 4 Q o a 0 A Q B Y 3 i C Z 3 i x H q 1 X 6 8 1 6 / x k t W P n O P v y B 9 f E N c + + S B g = = < / l a t e x i t > K 4

< l a t e x i t s h a 1 _ b a s e 6 4 = " 5 Y S t h 7 o N T I K H b I F 2 T B r m 4 b W k 8 p Y = " > A A A B 9 X i c b V C 7 T s N A E F z z D O E V o K Q 5 E S F R R T Z E Q B k J C i S a I M h D S q z o f N m E U 8 4 P 3 a 1 B k Z V P o I W K D t H y P R T 8 C 7 Z J A Q l T j W Z 2 t b P j R U o a s u 1 P a 2 F x a X l l t b B W X N / Y 3 N o u 7 e w 2 T R h r g Q 0 R q l C 3 P W 5 Q y Q A b J E l h O 9 L I f U 9 h y x t d Z H 7 r A b W R Y X B H 4 w h d n w 8 D O Z C C U y r d X v d O e q W y X b F z s H n i T E k Z p q j 3 S l / d f i h i H w M S i h v T c e y I 3 I R r k k L h p N i N D U Z c j P g Q O y k N u I / G T f K o E 3 Y Y G 0 4 h i 1 A z q V g u 4 u + N h P v G j H 0 v n f Q 5 3 Z t Z L x P / 8 z o x D c 7 d R A Z R T B i I 7 B B J h f k h I 7 R M O 0 D W l x q J e J Y c m Q y Y 4 J o T o Z a M C 5 G K c V p K M e 3 D m f 1 + n j S P K 8 5 p p X p T L d c u p 8 0 U Y B 8 O 4 A g c O I M a X E E d G i B g C E / w D C / W o / V q v V n v P 6 M L 1 n R n D / 7 A + v g G c m C S B Q = = < / l a t e x i t > K 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " V E h Y 9 x J y 7 e 1 i O u j L d / k Z s Y K i v M Y = " > A A A B 9 X i c b V C 7 T s N A E F z z D O E V o K Q 5 E S F R R X Y U A W U k K J B o g i A P K b G i 8 2 U T T j k / d L c G R V E + g R Y q
O k T L 9 1 D w L 9 j G B S R M N Z r Z 1 c 6 O F y l p y L Y / r a X l l d W 1 9 c J G c X N r e 2 e 3 t L f f M m G s B T Z F q E L d 8 b h B J Q N s k i S F n U g j 9 z 2 F b W 9 8 k f r t B 9 R G h s E d T S J 0 f T 4 K 5 F A K T o l 0 e 9 2 v 9 k t l u 2 J n Y I v E y U k Z c j T 6 p a / e I B S x j w E J x Y 3 p O n Z E 7 p R r k k L h r N i L D U Z c j P k I u w k N u I / G n W Z R Z + w 4 N p x C F q F m U r F M x N 8 b U + 4 b M / G 9 Z N L n d G / m v V T 8 z + v G N D x 3 p z K I Y s J A p I d I K s w O G a F l 0 g G y g d R I x N P k y G T A B N e c C L V k X I h E j J N S i k k f z v z 3 i 6 R V r T i n l d p N r V y / z J s p w C E c w Q k 4 c A Z 1 u I I G N E H A C J 7 g G V 6 s R + v V e r P e f 0 a X r H z n A P 7 A + v g G c N G S B A = = < / l a t e x i t > w S m 9 e / C p e P C j i 1 a / g z W 9 j 1 v W g m w + E / P j / n 4 f k + X s R o 1 J Z 1 r d R W F p e W V 0 r r p c 2 N r e 2 d 8 z d v b Y M Y 4 F J C 4 c s F F 0 P S c I o J y 1 F F S P d S B A U e I x 0 v H F j 6 n c e i J A 0 5 L d q E h E n Q E N O f Y q R 0 p J r H i Z 9 z 4 f X a S W 7 R X q X 3 L s a k U g a a Z q e u G b Z q l p Z w U W w c y i D v J q u + d U f h D g O C F e Y I S l 7 t h U p J 0 F C U c x I W u r H k k Q I j 9 G Q 9 D R y F B D p J N k e K T z W y g D 6 o d C H K 5 i p v y c S F E g 5 C T z d G S A 1 k v P e V P z P 6 8 X K v 3 A S y q N Y E Y 5 n D / k x g y q E 0 1 D g g A q C F Z t o Q F h Q / V e I R 0 g g r H R 0 J R 2 C P b / y I r R P q / Z Z t X Z T K 9 c v 8 z i K 4 A A c g Q q w w T m o g y v Q B C 2 A w S N 4 B q / g z X g y X o  Fig. 6), the corresponding configurations r j k are coarse-grained by a mapping operator M. Here j k is the index (in the original sample) of the configuration associated to the fine-grained label k. The low-resolution projections M(r) are first compared and then clustered in a coarse-grained dendrogram. The latter is inspected at several points K 1 , K 2 , ... , K 5 , so as to identify different selections of CG labels s, over which the mapping entropy is calculated using Eq. 27.
FIG. 9: Probability P cons of conserving each atom in an optimal mapping built minimising Σ (see Eq. 27). Five residues are highlighted, namely the three arginines and other two solvent-exposed, charged residues. While the former are retained with a good level of detail inside optimal mappings (one atom per side chain, see main text), the latter are highly coarse-grained (see also Tab .III).
Once projected over the high-resolution protein, such probability distribution appears to be broadly spread throughout the polypeptide chain, with few notable peaks in correspondence of terminal atoms of the three arginine residues of the protein (ARG6, ARG7, ARG13), which are well-known [40][41][42] to play a crucial role in the binding of tamapin to its substrate. Let us focus on the side chain of ARG6: here, the atom with highest importance is NH2 (P cons (NH2, ARG6) = 0.60), but all the other atoms in the terminal region of the arginine display a non-negligible value of P cons , namely 0.10, 0.23, 0.08 for NE, CZ, and NH1, respectively. The sum of these probabilities with the one associated to NH2 gives 1.02: except for two (resp. one) cases in which there are two (resp. zero) atoms of this region in the optimal mapping, all the remaining 46 optimal solutions contain exactly one atom in the terminal region of ARG6. In other words, the optimisation procedures are informing the modeller that the side chain of this arginine must be treated with exactly one atom, with a preference for NH2. As for ARG7 and ARG13, they display a similar behaviour, with the majority of the optimisations retaining one atom of their side chain terminus. In particular, the NH2 atom of ARG7 shows the highest value of P cons (P cons (NH2, ARG7) = 0.67). These results are consistent with those found through the cumulant expansion approximation [17], thus supporting the viability and robustness of this procedure.
In summary, the properties of relevance and resolution are here exploited in order to extract a set of fine-grained labels out of a molecular dynamics trajectory, each one weighted with its own approximated probability. This step is key in order to compute the mapping entropy of a system defined in terms of continuous degrees of freedom: in fact, while for the case of a molecular system in thermal equilibrium approximations are possible, that rely on the assumption of Boltzmann statistics and the cumulant expansion approximation of the mapping entropy (as it was done in Ref. [17]), in general the underlying probability density of the system micro-states is not known, and/or it is not an equilibrium distribution. The approach illustrated here is general and unsupervised, and it can be also applied to tasks other than the calculation of the mapping entropy. We deem it important to remark the fact that the choice of distance (RMSD) and clustering algorithm (UPGMA) played no special role in the analysis presented in this section, thus broadening the generality of the proposed approach.

IV. CONCLUSIONS
In this manuscript we investigated the properties of coarse-grained representations by studying the behaviour of the associated resolution, relevance, and mapping entropy, computed over empirical samples of three complex systems. These three quantities offer distinct and complementary perspectives on the properties of a dataset, allowing one to extract crucial information about its underlying generative process, the nonlinear correlations among its degrees of freedom, and the levels of significance of the latter. Mapping entropy, in particular, is employed to characterise a system by quantifying the amount of information retained in a low-dimensional representation of it, which thus highlights those reduced models that preserve relevant details while discarding noise or otherwise trivial features. When coupled, res-olution and mapping entropy show, in a clear and easily intelligible way, how the information content of a representation increases together with the detail with which such representation describes the data set.
In the case of the non-interacting spin system, the mapping entropy pinpoints as ideal mappings those subsets of features that match our intuition of most informative representations. In contrast, when the system's constituents are interacting, as it is the case for the model of the Nasdaq stock market, the interpretation of maximally informative coarse-grained mappings is less immediate. Still, the mapping entropy efficiently and consistently separates the stocks that have been influential for the majority of the sampling time from those whose importance has been limited to the last portion of the selected time-window [43]. In both cases, the resolutionrelevance framework proved to be capable of highlighting the optimal level of detail at which a coarse representation of the system provides the largest amount of nontrivial information about the underlying generative process.
This feature was explicitly employed in addressing the problem of dimensionality reduction for a biomolecule, namely a small protein; in this case, the mapping entropy minimisation requires the knowledge of an underlying, high-resolution probability density that cannot be naively reconstructed from a sample of configurations. To tackle this issue, we proposed a method, based on the optimal trade-off between resolution and relevance, to identify unambiguous high-resolution labels defining a non-uniform probability distribution in the fine-grained space; these corresponds to clusters of configurations whose relative discrepancies are classified as noise by the relevance, thereby allowing the construction of a dataset of high-resolution configurations each associated to its empirical probability. Making use of this protocol, we then carried out several minimisations of the mapping entropy: the resulting optimal representations tend to display an uneven level of detail throughout the protein, treating with higher accuracy the three arginine residues that are fundamental for its binding to the substrate, consistently with data obtained through an independent procedure.
These results, obtained from a relevant set of distinct test cases, show that the combined usage of resolution, relevance, and mapping entropy is capable of quantifying the information content proper to different combinations of features of a high-dimensional, large-sized data set. In particular, it is our opinion that the multi-body nature of the mapping entropy, together with its simplicity of interpretation, can make its application in data science extremely fruitful, either as a feature selection algorithm or as a novel instrument of analysis of complex data sets. The first use is analogous to the mapping definition in CG, that is, a smart prescription to be implemented prior to the modelling. The second application is even more intriguing, as it suggests that the process of dimensionality reduction per se can provide information on high-dimensional data sets.
Looking at resolution, relevance, and mapping entropy from this multi-disciplinary perspective, it is our opinion that their application in diverse contexts would contribute a powerful instrument to make sense of data in a world increasingly full of them.

V. APPENDIX
A. Explicit derivation of the relation between resolution and mapping entropy Hereafter we provide the full derivation of Eq. 14, in which each step is made explicit: As for Eq. 16, which relates the conditional entropy H[ x|s] to the conditional probability p( x|s), we have: = −

B. Infinite sampling assumption
Sections III A and III B discuss the case in which the fine-grained and coarse-grained labels x and s( x) are determined through a marginalisation over the retained degrees of freedom. We now discuss how, in such scenarios, we can apply the assumption of infinite sampling and how it changes the overall results.
First we investigate the impact on the relevance of hav-ing such a large sampling that the empirical probabilitŷ p(s) is arbitrarily close to the real one obtained marginalising over the exact p( x). To this end, we consider the toy model of Sec. III A and compute resolution and relevance summing over the complete list of all possible states of the system, whose probability is known from Eq. 22. In this case, we obtain the resolution-relevance curves reported in Fig. 10. These are quite different from the empirical ones emerging from a finite sample (see Fig. 2(a)), demonstrating that the relevance shows a non-trivial behaviour even in the case of a simple system.
Let us explain how a finite sample size can create such a qualitatively different behaviour. This happens when the system has some equiprobable configuration p(s i ) = p(s j ) for two distinct CG labels s i , s j . This means that, for an infinite sample, the frequencies of these configurations are equal, i.e. k si = k sj .
Let us assume that the frequency k si appears exactly twice in the sample, m ks i = 2. This impliesp(k si ) = 2k si /L. Recall that the relevance is given by summing over the frequencies k; therefore, if k si = k sj there is only one term contributing, whereas if k si = k sj there are two terms. The contribution of the frequency k si to FIG. 10: Relevance vs. resolution for the toy model of binary spins with exhaustive configurational sampling and exact underlying probabilities. The colours refer to the number of retained sites. These results should be compared with the finite sample case in Fig. 2(a).
When the empirical sample is finite and the frequencies are not equal, k si = k sj , but close, k si ≈ k sj , the contribution to the relevance is Comparing these two cases, we observe that the contribution to the relevance of the infinite sample is lower than the finite case by roughly 2k si /L log 2 = 2p(s i ) log 2. Therefore, in case of some equiprobable configurations, observing high values of the relevance relies on finite imperfect sampling; sampling "too well" can reduce the relevance substantially.
We now consider the effect of infinite sampling on the mapping entropy. When the configuration of the complex system of interest is sampled for an infinite number of times, the multiplicity of labels x mapping onto the same CG label s is given by the analytical degeneracy: where V is the phase space volume accessible to each degree of freedom. Here for simplicity we assume that all degrees of freedom have the same accessible phase space volume. In this case, the mapping entropy can be expressed as a difference of two Kullback-Leibler divergences, where the probabilities p( x) and p(s) are com-pared to the uniform distributions V −n and V −N , respectively: Here, S x and S s quantify the gain in information guaranteed by employing p( x) and p(s) to sample the phase space in place of the uniform probability, respectively. The "infinite-sampling" mapping entropy can be expressed as a difference between these quantities [15][16][17][18]: which is still a strictly non-negative Kullback-Leibler divergence. As an example, let us focus on the case of the approximate financial market discussed in Sec. III B, where each stock can assume three different values (V = 3, see Fig. 3). In this case S ∞ map reads S ∞ map can be decomposed in a constant term, proportional to n − N , accounting for the inherent loss of information arising from retaining fewer stocks, and a difference of resolutions. As the fine-grained entropy is fixed for all CG mappings, the only expression that varies with the mapping M is the coarse-grained resolution, H[s]. In order to decrease S ∞ map , the mapping must induce a CG distribution p(s) with low entropy. It is useful to note that H[ x] (resp. H[s]) cannot exceed n ln 3 (resp. N ln 3), which corresponds to the maximum entropy over the possible 3 n fine-grained (resp. 3 N coarse-grained) labels. Figure 11 reports the comparison between the values of S ∞ map (Eq. 38) and those of S map (Eq. 11) for the two models m1, m2, considered in Sec. III B. Since S ∞ map discriminates coarse-grained mappings according to their value of CG resolution H[s], there are two clouds of points for each value of N , separating those representations containing both GOOG and GOOGL stocks from the others. However, restricting the analysis to a specific cloud of points, we observe that a distinct, positive correlation exists between the collected values, which is quantified by Tab. II in terms of the Pearson correlation and linear coefficients. The correlation is weak when N = 2, growing up to values higher than 0.8 for mappings with N ∼ n. Equation 14 sheds light on the presence of such correlation, showing that a reduction of the coarse-grained resolution is beneficial for a CG mapping if and only if it is not counterbalanced by an increase in x p( x) ln (Ω 1 (s( x))). The latter situation is experienced by mappings containing both GOOG and GOOGL stocks, which possess a low-H[s] probability distribution not because of their informativeness, but only because the number of resolved CG labels s is limited.  and those of S map (see Fig. 11) for the two models considered. At each value of the number of CG sites N , the coefficients are calculated considering mappings containing both Google stocks (r G , q G ) and representations in which there is at most one of the two Google stocks (rḠ, qḠ).
works, giving rise to the approximated mapping entropy: where the sum runs over CG labels s, U is the potential energy of the system and the subscript s denotes an average conditioned to the CG label s. In other words, the approximate mapping entropy of a CG label s is given by the variance of the potential energy proper of those high-resolution configurations x mapping onto it. Such approximations are necessary due to the incapacity of calculating the probability distributions involved in the definition of the mapping entropy, which is the canonical average of the logarithm of p( x)/p( x) (see Eq. 11). Both p( x) and p( x) are complicated to extract because of their high dimensionality and the numerical instabilities associated to the explicit calculations of the exponentials.
Analogously to Eq. 27, we can define an average mapping entropy In Fig. 12 we report the comparison of the values of Σ and Σ β calculated for the data set of 4968 mappings with N = 31 employed in Ref. [44]. The scatter plot shows that a good but not perfect correspondence exists between the two sets of values. It is important to underline how the nature of the energy considered in the calculation of S β map can possibly play a role in this difference: indeed, S β map is computed employing only the protein-protein interaction energy, thus neglecting protein-solvent and solvent-solvent effects. Such approximation can give rise to a bias towards exposed regions, where the interactions are not properly screened. One of the strengths of S map is represented by the fact that the solvent contribution of Eq. 39 [17], expressed in units of kJ/mol/K. The protein displays a clear correlation between the two expressions, resulting in a Pearson correlation coefficient equal to 0.62.
is taken into account more accurately by the probability.
Overall, further work is needed to assess the nature of this discrepancy.
FIG. 13: Probability P β cons of conserving each atom in an optimal mapping built minimising Σ β . With respect to P cons , P β cons is more concentrated in the terminal regions of charged residues, showing more pronounced peaks in correspondence of peculiar atoms.
Analogously to Sec. III C we now analyse the 48 optimal mappings obtained minimising Σ β (see Ref. [17]), with the aim of comparing the resulting conservation probability P β cons to the one considered in the main text (P cons ). Figure 13 shows how an optimal CG mapping  III: Differences between the values of conservation probabilities for the terminal atoms of residues GLU24 and LYS27. The difference is striking especially for GLU24, as its terminal atoms are never conserved in the Kullback-Leibler-based optimisation.
of 6D93 must contain the NH1 atom (P β cons (NH1, ARG6) = 0.92), while P cons is more evenly distributed throughout the variable region of this amino acid (see Fig. 9 and Sec. III C). Another interesting difference emerging from a comparison between Fig. 9 and Fig. 13 concerns the reduced values of conservation probabilities assigned to terminal atoms of the variable regions of GLU24 and LYS27; while these atoms were usually part of low-Σ β mappings, they are almost never present in the CG representations built minimising Σ. GLU24 and LYS27 are charged residues, and the energetic fluctuations proper to the terminal atoms can be huge, especially when the considered energies are not screened by the solvent. This is a further proof that S map is less biased towards solventexposed, charged residues than S β map . Overall, it is possible to conclude that Fig. 9 and Fig. 13 are quite similar, with P cons that is, on average, more evenly distributed over the full structure, displaying a tendency to reduce the probability weight assigned to terminal atoms of charged residues with respect to P β cons .

VI. DATA AVAILABILITY
The program and the data employed for the two models presented in Sec. III A and III B, as well as the results showed in Sec. III C are available from the GitHub repository at the address https://github.com/mgiulini/ pymap as well as on the Zenodo repository at the address: https://zenodo.org/record/6284439#.YhkWnO7MLUI.
VII. ACKNOWLEDGMENTS VIII. AUTHOR CONTRIBUTIONS RP conceived the study and proposed the method. RH and MG wrote the softwares, performed the simulations, and collected and analysed the data. All authors contributed to the analysis and interpretation of the data.