TMI: Thermodynamic inference of data manifolds

The Gibbs-Boltzmann distribution offers a physically interpretable way to massively reduce the dimensionality of high dimensional probability distributions where the extensive variables are `features' and the intensive variables are `descriptors'. However, not all probability distributions can be modeled using the Gibbs-Boltzmann form. Here, we present TMI: TMI, {\bf T}hermodynamic {\bf M}anifold {\bf I}nference; a thermodynamic approach to approximate a collection of arbitrary distributions. TMI simultaneously learns from data intensive and extensive variables and achieves dimensionality reduction through a multiplicative, positive valued, and interpretable decomposition of the data. Importantly, the reduced dimensional space of intensive parameters is not homogeneous. The Gibbs-Boltzmann distribution defines an analytically tractable Riemannian metric on the space of intensive variables allowing us to calculate geodesics and volume elements. We discuss the applications of TMI with multiple real and artificial data sets. Possible extensions are discussed as well.

Introduction: Scientific data often comprise positive numbers.Examples include pixels intensities of grayscale images 1 , abundances of bacteria in microbial ecosystems 2 , electrical activities of brain regions 3 , or more generally a collection of probability distributions; all of whom once suitably normalized can be manipulated as probability distributions.
Over the past few years, our ability to collect high quality high dimensional data has improved substantially which has been accompanied by a flurry of dimensionality reduction methods.These methods usually belong to one of two broad classes.Methods such as principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization (NMF) 4,5 are examples of matrix factorization based methods.Here, the high dimensional data (in the form of a matrix) is expressed as a multiplication of two or more simpler (for example, sparse or low rank) matrices.In contrast methods such as diffusion maps 6 , Laplacian Eigenmaps 7 , Isomaps 8 , tSNE (t-stochastic neighborhood embedding) 9 , and UMAP (uniform manifold approximation and projection) 10 are based on manifold learning.These methods rely on the assumption that the high dimensional data lies on a much lower dimensional embedded manifold.These methods infer the manifold using estimation of local density of data points in the higher dimensions using kernel based approaches.
Orthogonal to these modern approaches, statistical physics offers a physically interpretable solution to dimensionality reduction; albeit for a restricted class of distributions.Consider a system at thermodynamic equilibrium with a surrounding bath that can exchange K−types of extensive variables with it.Let the number of states in the system be d.Typically, d 1 (d ∼ 10 23 for a mole of ideal gas) and K ∼ o(1) d (K = 1, 2 for the canonical and the grand canonical ensemble respectively).Imagine that there are N different realizations of the bath; characterized by N × K Lagrange mulitpliers λ ).At thermodynamic a) email:pdixit@ufl.eduequilibrium, any realization α can be described by the K intensive variables; the probability q (α) a of observing the system in state 'a' is given by the Gibbs-Boltzmann distribution: In Eq. 1, λ (α) k are realization-specific Lagrange multipliers, Y ka are state-dependent extensive variables, is the partition function, and are generalized activity coefficients.Importantly, recent work has shown that the Gibbs-Boltzmann form has a much broader applicability, even beyond thermal systems at thermodynamic equilibrium.Notably, using the maximum entropy principle 11 , it has been employed to model probabilities in a variety of complex systems such as ensembles of protein sequences 12 , collective firing of neurons 13 , and collective motions of birds 14 .Unfortunately, however, not every collection {x (α) }, α ∈ [1, N ] of N abitrary probability distributions can be described using the exponential Gibbs-Boltzmann form.Here, we ask the following question: Given data in the form of N arbitrary distributions {x (α) }, can we infer approximate extensive variables Y s and intensive variables λs such that the Gibbs-Boltzmann form in Eq. 1 approximates the given distributions {x (α) }?
To that end, we introduce TMI: Thermodynamic Manifold Inference.In TMI, we simultaneously infer from the data extensive variables ('energies') and intensive variables ('temperatures').The extensive variables represent features on the state space while the intensive variables embed the data points in a lower dimensional space.TMI achieves several objectives.First, by enforcing the number of extensive variables to be much smaller than the data dimension, it achieves dimensionality reduction.Notably, unlike principal component analysis (PCA) or singular value decomposition, but similar to non-negative matrix factorization 4,5 , TMI-based approximation of the data leads to interpretable positive-valued factorization (see Eq. 1).Second, TMI defines a Riemannian manifold with an analytically tractable distance metric on the space of intensive variables where the data points reside.Importantly, this metric allows us to define geodesic distances between arbitrary points in the space of intensive variables as well as volume elements.Third, due to the convexity of the Gibbs-Boltzmann equation, TMI provides a unique out-of-sample extension 15 procedure.We illustrate TMI using several real and artificial datasets.
TMI approximates arbitrary distributions: Consider data in the form of discrete distributions {x (α) }, α ∈ [1, N ] defined on a d dimensional state space.We assume that x k } such that the Gibbs-Boltzmann distributions in Eq. 1 approximate the original distributions x (α) .
In TMI, we enforce K N to obtain an approximate lower dimensional representation of each distribution.To that end, for a fixed K, we minimize the sum of Kullback-Leibler divergences between x (α) and q (α) : The first term in the expanded KL divergence depends only on the distributions x (α) and can be dropped.We have There are several indeterminacies in the cost function in Eq. 7. First, for a fixed k, the cost is invariant to to an additive shift This corresponds to the translational invariance in energies in a physical system.Second, the cost is invariant with respect to a scaling λ Physically, this corresponds to the fact that extensive variables (for example, energies) are always multiplied by the corresponding intensive variables (for example, inverse temperatures) when computing probabilities.More generally, if we multiple the d × K matrix of extensive variables by a K × K matrix B and multiple the N × K matrix of intensive variables with B −1 T , the Gibbs-Boltzmann probabilities don't change.Finally, the cost is invariant to permutations in k, the label of the extensive variables.
We resolve the first indeterminacy by first finding a converged set of variables Y s and then setting the lowest one to zero.We resolve the second indeterminacy by constraining the L 2 norm of the extensive variables.We do this by introducing constraints in the cost using Lagrange multipliers.The modified cost function is given by Finally, we resolve the third indeterminacy by rankordering the parameters λ The cost is convex respect to λs when Y s are fixed and vice versa.However, similar to non-negative matrix factorization 4 , it is not guaranteed to be globally convex (see appendix A2).We can minimize C with respect to the intensive and the extensive variables to find a local minimum.Differentiating with respect to λ (α) k and setting the derivative to zero, Eq. 9 has a simple interpretation: when the values of the extensive variables Y are fixed, the intensive variables λ (α) k describing any particular distribution are determined by matching the averages of the extensive variables predicted using q (α) with their empirical average values computed using the actual distributions x (α) .
For a fixed value of λs, the value of Y ka are the fixed points of a non-linear equation.Differentiating C with respect to Y ka and setting the derivative to zero, Note that both sides of Eq. 11 depend on Y ka since the distributions q (α) depend on Y ka .Above, for any fixed k, the inference of the extensive variables Y ka s is invariant to a permutation over labeling of the state space indices {a}.However, it is possible to incorporate information about the geometrical structure of the state space in the inference as well.One such structure is smoothness.Consider the example of grayscale images.Here, the distributions represent normalized pixel intensities of a digitized image.In the images, any state 'a' is identified by planar two dimensional coordinates a ≡ (i, j) which define adjacency in the state space.Let us consider two adjacent states a ≡ (i, j) and b ≡ a + ê (ê ∈ {(1, 0), (0, 1), (−1, 0), (0, −1)}).We can ensure that the extensive variables Y ka and Y kb corresponding to neighboring states 'a' and 'b' are similar to each other by introducing regularizing constraints: where n ab = 1 when a and b are adjacent and zero otherwise.Such constraints will limit the ruggedness of the landscape of the extensive variables.Other constraints on the extensive variables, such as orthogonality, can also be imposed.
Similarly, constraints on the intensive variables can be imposed as well.The formulation developed above will lead to intensive variables that are both positive and negative.However, while the TMI-based factorization of the data still remains positive, nonnegativity constraints on the intensive parameters λs may be desirable in order to interpret the extensive variables as potential energy minima.These can be indirectly imposed by employing the multiplicative update algorithm, as is done in nonnegative matrix factorization 4 , to infer λs as opposed to a gradient or Hessian descent algorithm.The details of the numerical algorithms to learn both Y s and λs are in appendix A3.
Finally, we note that though the above discussion was restricted to data in the form of normalized distributions, TMI can also be implemented to unnormalized positive valued data.Notably, the equations to determine Y s and λs are identical to those presented above (Eq.9 and Eq.11).We present this development in detail in appendix A4.
TMI provides a unique out-of-sample extension procedure: A common situation in data analysis is as follows.Suppose that we have inferred λs and Y s from N data points using TMI.Now imagine that a N + 1st data point arrives.Can we approximately embed this data point in the lower dimensional space?This problem is commonly known as the out-of-sample extension 15 and usually does not have a unique solution 16 .
Notably, in TMI a new data point x (ν) can be embedded rapidly by determining the K Lagrange multipliers λ (ν) k by solving for λ(ν) : Moreover, the quality of the embedding can be assessed by evaluating the KL divergence TMI introduces a Riemannian distance metric: Several functionals can quantify the differences between distributions.These include traditional quantifiers such as the Kullback-Leibler divergence 17 , Bhattacharya distance 18 , and Hellinger distance 19 which are invariant with respect to permutations of state space indices.In contrast, the optimal transport distance (also known as the Wasserstein distance) 20-22 is a distance metric that takes into account the geometry of the state space.
TMI defines a Riemannian geometry and a distance metric on the space of intensive variables.Consider two different distributions approximated by intensive parameters λ(1) and λ(2) .Consider a smooth and differentiable path γ(t) between the two distributions such that γ(t = 0) = λ(1) and γ(t = T ) = λ(2) .In the linear response regime, the excess work -work done above the difference in thermodynamic potentials -along this path can be computed [23][24][25][26] : where the elements of the friction tensor g are given by 24 In Eq. 16, δY i = Y i − Y i where Y i is the ensemble average value of the extensive variable Y i when the intensive parameters are fixed at λ(t).We note that a similar derivation exist for transforming two non-equilibrium steady state (NESS) distributions 27 .However, NESS distributions cannot be expressed in the parametric Gibbs-Boltzmann form and therefore we do not pursue that direction here.The friction tensor depends on the dynamics on the state space {a} at a fixed λ.When the transition rate matrix κ a→b ( λ) is provided, the friction tensor can be computed in a straightforward manner (see appendix A5).What are reasonable choices for the dynamics?We want an 'equilibrium' (detailed balanced) transition rate matrix that is constrained to reproduce the Gibbs-Boltzmann distribution q( λ).One way to incorporate the information about the underlying geometry is to require that the rates penalizes transitions between geometrically 'distant' states a and b.A simple transition rate matrix is the one that maximizes the path entropy 28 : Another choice for the dynamics is the so-called Glauber dynamics 29 : In Eq. 17 and Eq. 18, d(a, b) is a measure of separation between states a and b (for example Euclidean distance) and ε > 0 plays the role an inverse diffusion constant.Finally, we note any choice of the dynamics will define a well-behaved friction tensor that as long as the dynamics is reversible and reproduces the stationary distribution q( λ).
From the dynamics, the friction tensor can be calculated in a straightforward manner as shown in Appendix A5.The distance computed using this friction tensor will be a proper distance metric which respects the underlying geometry of the state space.We note that unlike the Wasserstein distnce, TMI defines a distance metric even when the measure d(a, b) is not a proper distance metric.Moreover, a significant advantage of this geodesic approach is that it can be used to compute an optimal path of transition for a pair of intensive variables.
Notably, when the dynamics is fast, the friction coefficient reduces (up to a proportionality) to the Fisher information matrix [23][24][25][26] , which in the case of Gibbs-Boltzmann distributions is the matrix of fluctuations 30 .Moreover, if we assume that the rate of change of λ along a trajectory is kept constant, the paths that minimize excess work are also the paths that minimize the geodesic distance [23][24][25][26] .Hence, the length of the path of minimum excess work between two distributions, described by λ1 and λ2 respectively, also defines a metric distance between them.We note however that the Fisher information matrix is invariant to a permutation of the indices.Therefore, the geodesic distances evaluated using the Fisher information matrix does in itselt not take into account the geometry of the state space.
Finally, we note that the distance metric is defined on the space of intensive variables and not the distributions themselves.
Learning Ising model from data: As a test case, we show that TMI can infer the energy landscape of an Ising model from sampled distributions.We where In Eq.21, the summation is taken over the nearest neighbors of the graph shown in panel (a) of Fig. 1 and Z(H, J) is the partition function.
We randomly sampled 50 pairs of H and J values from a uniform distribution where H ∈ [−1, 1] and J ∈ [−1, 1] and generated 50 Ising model distributions (see Fig. A1).Next, we approximated these input distributions using TMI with K = 2 extensive variables Ȳ1 and Ȳ2 .We simultaneously inferred 50 pairs of Lagrange multipliers representing each of the 50 distributions.
As noted above, multiplication by a matrix Y → Y ×B and Λ → Λ × B −1 T does not change TMI predictions.Thus, in order to directly compare TMI predictions with the ground truth, we need to reorient the TMI-inferred variables.To that end, we find a matrix B such that (1) Ȳ1 and Ȳ2 have the same dot product as the vectors Ēint and Ēmag and (2) Ȳ1 is orthogonal to Ēint .In Fig. 1 panels (b) and (c) we show that the reoriented extensive variables Ȳ1 and Ȳ2 closely approximate the the true extensive variables E mag and E int respectively only from 50 sampled distributions.Notably, no symmetry or any other constraint was imposed on the inferred extensive variables.
Analysis of handwritten digits: We illustrate the application of TMI using the MNIST dataset 1 .We randomly selected 500 digits from the set of all '6's and '9's from MNIST.The digits were represented as a 28 × 28 array of positive numbers.Each data point was normalized and treated as a distribution represented by a 784 dimensional probability vector.Given that there were two types of digits, we set out to infer K = 2 sampleindependent extensive variables.We simultaneously inferred the corresponding intensive variables for individual data points.We imposed the positivity constraint on the intensive variables (see Appendix A3).In panels (a) and (b) of Fig. 2 we show the two inferred extensive variables Ȳ1 and Ȳ2 .Notably, TMI correctly identifies two extensive variables (potential energy functions) that correspond to a generic digit '9' and a generic digit '6' respectively.These represent the two potential energy minima in the data.
Moreover, as shown in panel (c), the two digits can also be classified by two different regions of the space of intensive variables; '9's are characterized by a high λ 1 and a low λ 2 while '6's are characterized by a low λ 1 and a high λ 2 .Importantly, the Fisher-Rao metric on the space of intensive variables defines a notion of distance between the distributions as well as the "number of points" in any given volume element 31 .The heatmap in panel (c) represents the logarithm of the volume element given by the square root of the determinant of the Fisher information matrix.It is clear that the reduced dimensional space is highly inhomogeneus; the same small change in λ 1 and λ 2 may have very different effects on the resulting distributions depending on the region of the space.
Finally, the Fisher-Rao metric allows us to construct geodesics between pairs of data point.As shown in panel (c) of Fig. 2, the geodesic (dashed green line) between an '6' (green circle, top left) and a '9' (green circle, bttom right) is substantially different than the straight line (dashed pink line).The geodesic can be used to perform a smooth transformation between the two distributions.For any transformation curve γ(τ ), we can compute ∆ γ (τ ) as the symmetrized Kullback-Leibler divergence between successive distributions along the curve.A uniform ∆ implies a net transformation that is equally spread out over the entire trajectory.In contrast, a varying ∆ implies a 'rough' transformation.Interestingly, as shown in panel (d) of Fig. 2, the geodesic leads to a uniform∆ as opposed to the straight line transformation.
TMI outperforms NMF in data reconstruction and classification: We compared the overall performance of TMI with a mathematically related technique, non-negative matrix factorization (NMF) 4,5   represents the thermodynamic potential of any state as a matrix product, NMF approximates the probabilities themselves as a matrix product.Briefly, in NMF, positive valued data is expressed as a product of two matrices: The matrices l and y are determined by minimizing either the L 2 norm or the Kullback-Leibler divergence between the data {x a } and the approximation {q a }.NMF is a widely used technique to model positive valued data as it leads to interpretable positive valued decomposition (see 32 for a review).We note that NMF-based decomposition of the data is a linear superposition of positive valued 'feature vectors' ȳk s with positive valued 'coefficients' l(α) s.In contrast, TMI expresses the data as a multiplicative decomposition (see Eq. 1).
To compare the ability of TMI and NMF to approximate the data, we chose four datasets of very different origins.The first was the MNIST dataset of handwritten digits 1 .From the MNIST dataset, we randomly selected 500 samples comprising digits from 0 to 9. As above, each digit was represented by a 28 × 28 array of pixel intensities which was normalized to 1.The second was the time series data collected on the gut microbiome of a human 33 .The microbiome data consisted of 318 samples collected approximately daily over a period of a year from the feces of one human individual.Each sample was represented by the relative abundances of 70 most abundant bacterial operational taxonomic units (OTUs).The third dataset comprised a 'bag of words' description 34 of papers submitted to the Neural Information Processing Systems conference (downloaded from 35 ).Each paper was represented as a collection of words wherein each word was assigned a frequency in each submitted article.The fourth dataset comprised 472 grayscale images of human faces stored as an array of 19 × 19 pixels (the CBCL database of faces 36 ) (see appendix A6 for details of the datasets).
We approximated each of the datasets using TMI and NMF with several different values of K.For each K we K TMI NMF 5 0.62 ± 0.05 0.49 ± 0.04 10 0.75 ± 0.04 0.63 ± 0.05 15 0.77 ± 0.03 0.67 ± 0.05 20 0.80 ± 0.04 0.69 ± 0.05 TABLE II.Comparison of classification success rate (fraction) of randomly chosen handwritten digits from the MNIST dataset using TMI and NMF.The error bars are standard deviations estimated using 20 equal subsamples of the test set.
compared the Kullback-Leibler divergence between the data {x (α) a } and the reconstruction {q (α) a }.As seen in Table I, TMI consistently performed better than NMF at reconstructing the data for every value of K.One possible reason behind this is that real data sets often have widely varying amplitudes.For example, the intensity of any given pixel in a set of images can vary substantially from image to image 37 .The exponential tuning of probabilities using the intensive variables in TMI may be better suited to capture such variability compared to the linear superposition in NMF.
Next, we tested how TMI performed in data classification using the MNIST dataset.To that end, used the 500 MNIST digits as above and inferred intensive variables and extensive variables across a range of K values.We used these intensive variables and the known identities of the digits to train a support vector machine (SVM) classifier.Next, we randomly selected 2000 digits from the dataset and predicted their identities.Similarly, we fitted the same data with NMF and trained an SVM classifier with the same hyperparameters.The accuracy of the two identifications is shown in Table II.Similar to its ability to fit the data accurately, TMI also performs significantly better than NMF at classifying the data.
Discussion: The manifold assumption 7 , commonly invoked in modern data analysis, posits that high dimensional data is generated by a few governing parameters and as a result can be represented by a lower dimensional manifold residing in the higher dimension.Several manifold inference methods such as diffusion maps 6 , Laplacian Eigenmaps 7 , Isomaps 8 , tSNE (t-stochastic neighorhood embedding) 9 , and UMAP (uniform manifold approximation and projection) 10 have been developed to approximately reconstruct these manifolds from the data.
While the manifold-based methods achieve dimensionality reduction, unlike other approaches such as principal component analysis (PCA) or nonnegative matrix factorization (NMF) 4,5 , they cannot obtain an approximate reconstruction of the original data using lower dimensional 'features'.At the same time, these methods do not obtain an analytical description of the manifold but only visualize it using a non-linear embedding of the data points in the lower dimensional space.As a result, analytical manipulations such as computation of geodesics and volume elements are not possible.
We presented TMI, an approach rooted in statistical physics to approximately embed positive valued high di-mensional data points in lower dimensions.TMI possesses advantages of both manifold approximation methods as well as matrix-based dimensionality reduction methods.(1) similar to matrix-based methods such as PCA, SVD, and NMF, TMI can approximate data using lower dimensional features.Notably, similar to NMF, these features are positive valued (see Eq. 1) and thus interpretable.Moreover, given the multiplicative nature of the decomposition, TMI appears to be better suited to model real data compared to NMF. (2) Similar to manifold approximation methods, TMI can infer an approximate lower dimensional manifold on which the data resides.Importantly, unlike previously developed methods (discussed above), TMI defines an analytically tractable and readily computable Riemannian manifold (with an associated distance metric) in the lower dimension.This in turn allows us to compute geodesics and volume elements in the reduced dimensional description.
While TMI outperformed NMF in modeling and classifying data, in the current implementation, TMI was slower than NMF.Therefore, in the future, it will be important to optimize the numerical algorithms in TMI.Similarly, the calculation of the geodesic can be time consuming given that it requires solving boundary value nonlinear differential equations.However, numerically efficient techniques have been developed 26,38,39 which will be more useful in situations when using K > 2 extensive variables.Another potential way to avoid solving the non-linear differential equations is to rely on the observation that the geodesics pass through the data rich regions of the λ space.Consequently, we can potentially approximate the geodesic as the shortest path on a graph connecting the data points themselves.
Finally, we comment on another potential approach to quantify differences between distributions using nonequilibrium statistical physics.The approach presented in this work relies on excess work in a nonadiabatic transformation which takes a system from a thermodynamic equilibrium with a bath λ(1) to a thermodynamic equilibrium with the bath λ(2) .In contrast, we can also set up a system that is simulataneously in contact with the two baths λ1 and λ2 .Such a system will reach a nonequilibrium steady state and will constantly dissipate heat from one bath to another.The steady state entropy production rate which is always positive and only zero when λ(1) = λ(2) can also be used as a quantifier of the differences between distributions.This rate is by definition positive and can be constructed to be symmetric.However, it remains to be seen whether it defines a distance metric.

A1. HESSIAN WITH RESPECT TO λS
We have the cost Let us derive the Hessian with respect to λs.We have the gradient: Differentiating Eq.A2 with respect to λ (γ) j , we find that the Hessian is simply the covariance matrix: In Eq.A3, the angular brackets • α denote an average with respect to the α th approximation: and δ αγ is the Kronecker delta function.

A2. COST FUNCTION IS CONVEX IN λS AND Y S
In this appendix, we show that the (1) cost function in Eq. 8 is convex with respect to λ The double derivative of the cost function for a fixed α is given by Eq.A5: The matrix in Eq.A5 is a covariance matrices.Given that covariance matrices are non-negative, the Hessian matrix in Eq.A5 is non-negative as well.
Next, we look at the Hessian with respect to the Y ka s for a fixed k when λs and other Y s are fixed.We have the derivative: We have the derivative, where From Eq. A8, we have We have Putting everything together, we have Thus, the elements of the Hessian are given by Given that q (α) a > q (α) a q (α) b and β k > 0, the sum of offdiagonal entries in the Hessian matrix is smaller than the diagonal entry, according to Gershgorin's disc theorem, the Hessian matrix in Eq.A15 will be positive semidefinite.

A3. NUMERICAL ALGORITHMS FOR PARAMETER INFERENCE
We numerically the λs and the Y s using a combination of gradient descent and Hessian descent.For a fixed value of Y s, the Hessian matrix H with respect to the λs is given by Eq.A3 and the gradient with respect to λs is given by gr( λ(α) ) k ≡ ∂C ∂λ The Hessian descent algorithm updates the Lagrange multipliers as λ(α) ← λ(α) − η λ H( λ(α) ) −1 gr( λ(α) ) (A17) where η λ is a learning rate chosen between 0 and 0.05 and H is the Hessian matrix.In a single λ-iteration, we update individual λ(α) for a fixed value of Y s.Similarly, the Hessian with respect to Y s is given by Eq.A15 and the gradient is given by Eq.A6.The Y s are also updated for each k individually using a Hessian descent scheme: Ȳk ← Ȳk − η Y H( Ȳk ) −1 g( Ȳk ). (A18) The learning rate η Y is also chosen between 0 and 0.05.

A. Enforcing positivity in the inference
Positivity constraints on λs are enforced using a multiplicative gradient descent algorithm 40  where η > 0 is a learning rate.

A4. TMI FOR UNNORMALIZED DATA
In this section, we show how TMI can work for unnormalized data.The cost function for unnormalized distributions can be written as  a } is the unnormalized positive valued data.We rewrite C after dropping terms that do not depend on λs and Y s: k Y ka + a,α q (α) a + consider a nearest-neighbor Ising model with n s = 8 spins arranged as shown in panel (a) of Fig. 1.Each spin σ can take values 1 or −1.The probability of observing any spin configuration σ(a) is given by p(σ(a)) = 1 Z(H, J) exp (−HE mag (a) − JE int (a))(19)

FIG. 1 .
FIG. 1. panel (a) the connectivity graph of a 8 spin Ising model, panel (b) inferred extensive variable Ȳ1 (red) compared to the true extensive variable Emag (black), and panel (c) inferred extensive variable Ȳ2 (red) compared to the true extensive variable Eint

FIG. 2 .
FIG. 2. panel (a) A heatmap of the inferred extensive variable Y1 representing a generic '9'.panel (b) A heatmap of the inferred extensive variable Y2 representing a generic '6'.panel (c) A scatter plot of the intensive bath parameters of the 500 data points.The data labeled '6' are colored blue while the data labeled '9' are colored magenta.The heatmap represents the volume element (square root of the deteminant of the metric tensor).The dashed red line is a straight line transformation between two data points shown at the top left and bottom right.The dashed green line is the geodesic computed using the Fisher-Rao metric.panel (d) The symmetrized Kullback-Leibler divergence ∆ between successive intermediate transformations along a discretization of the straight line trajectory (red) and the geodesic trajectory (green) of the transformation shown in panel (c).

a
} and the approximate reconstruction {q (α) a } using TMI and NMF respectively.K indicates the number of extensive variables used to model the data.The divergences are reported as an average per data point.

∀
k ∈ [1, K] when all the other λs and all the Y s are fixed and (2) the cost function is convex in Y ka ∀ a ∈ [1, d] when λs and all other Y s are fixed.
that the TMI based predictions do not depend on a translational shift in the extensive variables Y s, when learning λs, we frame-shift the extensive variables to ensure that all Y s are positive.Then, we identify g both positive.We start with positive valued λs and update them using a multiplicative procedure 40 :

TABLE I .
. While TMI Comparison of KL divergences between the data {x