Selection entropy: The information hidden within neuronal patterns

This

Boltzmann entropy is a measure of hidden information [5] that is frequently used as a way of quantifying repertoires of neuronal states .With regard to the concept of Boltzmann entropy, we consider a simple example of a neuroimaging modality that collects just two observations, each of which falls into one of two histogram bins, as shown in Fig. 1(a).We then ask the following question: in how many ways can we rearrange the observations in Fig. 1(a) so as to maintain the same distribution, i.e., one observation in each of the two bins?The answer is two, with the second possibility shown in Fig. 1(b).
In other words, if we only have access to the shape of the distribution (the "macrostate"), we are subject to an intrinsic ignorance that arises from the two indistinguishable ways (the "microstates") in which the observations can be rearranged.This ignorance is precisely what the Boltzmann entropy quantifies.
Let us now generalize the example in Fig. 1 such that there are a total of N observations.We then ask: in how many ways (W ) can these N observations be rearranged to yield the same histogram shape, such that the first bin contains n 1 observations, the second bin contains n 2 observations, and so forth?The answer is given by the standard [27] combinatoric expression where n i is the number of observations (occupation number) of the ith bin.
For the example in Fig. 1, we can use Eq. ( 1) to see that there are indeed W = 2! 1!1! = 2 ways of rearranging a histogram of N = 2 observations such that the first bin contains one observation and the second bin contains one observation.
Standard methods [28] can then be used to show that W in Eq. ( 1) leads directly to the following expression for the Boltzmann entropy, S, associated with a single observation: where logW ≈ NS, and p i = n i / N is the probability of occurrence of the ith observation.Note that the expression in Eq. ( 2) only holds when N and all n i are large.See the Appendix for the details of the steps leading from Eq. (1) to Eq. ( 2).

B. Selection entropy
Let us now extend the toy model in Fig. 1 such that we deal with two histograms, each of which is populated by three observations that are distributed across two bins, as shown in Fig. 2(a).
We now ask an entirely different question to the one considered with regard to the Boltzmann entropy: in how many ways can the observations contained within the same bins in the two histograms be selected between one another?By inspecting Fig. 2(a), we see that, for each of the two ways in which the bin 1 observations in system 1 can be selected by system 2, there are two ways in which the bin 2 observations in system 2 can be selected by system 1 [Fig.2(b)].We can generalize the example in Fig. 2 by considering two histograms that contain N and M total observations.The total number of ways (W sel ) of selecting observations between like bins is given by the product of (i) the number of ways of selecting the m 1 observations in the first bin of system M from the n 1 observations in the first bin of system N and (ii) the number of ways of selecting the m 2 observations in the second bin of system M from the n 2 observations in the second bin of system N, and so forth, as given by the product of the binomial coefficients where n i and m i are the numbers of observations in the ith bins of the two histograms.We can generalize the example in Fig. 2 by comparing two histograms constructed using data from a total of N 1 observations of system 1 and N 2 observations of system 2, respectively.Suppose that n (1)   i of the N 1 measurements of system 1 and n (2)   i of the N 2 measurements of system 2 fall into bin i.Let n i = max(n (1)  i , n (2)  i ) be the larger of these two numbers and m i = min(n (1)  i , n (2)  i ) be the smaller.The total number of ways of selecting m i observations from n i is . The total number of ways (W sel ) of selecting the smaller number from the larger number of observations in all bins is given by the product of these binomial coefficients: Using Eq. (3) for the example in Fig. 2(b), we see that there are indeed 2! 1!(2−1)!× 2! 1!(2−1)!= 4 ways in which observations can be selected from like bins in the two histograms.
Following the same logic used to derive the Boltzmann entropy in Eq. ( 2) from the expression for rearrangements in Eq. ( 1) (see the Appendix), we begin by taking the logarithm of the expression for selections in Eq. (3): We then assume that the numbers of observations in all bins are sufficiently large to allow us to use Stirling's approximation [i.e., log(y!) ≈ ylogy−y ], such that We can then express the numbers of observations within the ith bins of the two systems (m i and n i ) in terms of the probabilities of finding the system in the ith bin [p M (m i ) and p N (n i )] and the total numbers of observations in the two systems (M and N) as follows: We then define a ratio (r) of the observations, which we use together with Eq. ( 7) to obtain We then arrive at the following quantity, which we term the "selection entropy" associated with a single observation in the N length time series: It should be noted that, in the general case when N = M, the selection entropy depends on the numbers of observations made (N and M).In this sense, the selection entropy is not just a feature of the two probability distributions alone, as it depends on how many measurements are made of each system.
In summary, just as the Boltzmann entropy in Eq. ( 2) quantifies the hidden information contained within indistinguishable rearrangements of a single histogram via Eq.( 2), the selection entropy in Eq. ( 10) quantifies the hidden information contained within indistinguishable selections between like bins of histogram pairs via Eq.(3).

C. KL divergence
The KL divergence quantifies the extent to which two probability distributions differ from one another.The KL divergence has frequently been used in the context of neuroscience [29,30] and, as such, together with the Boltzmann entropy, will form our main point of reference in characterizing selection entropy.
Using the same notation as in Eq. ( 10), the KL divergence from p M (m i ) to p N (n i ) is defined as follows: When two distributions are identical, the KL divergence attains its lower bound of zero.

A. Equal-length time series
We will henceforth calculate the selection entropy between pairs of equal-length neuroimaging time series.This means that the resultant histograms are constructed from equal numbers of data points, and hence from Eq. ( 8) r ≡ M N = 1, which means that we use the following simplified version of Eq. (10):

B. Gaussian distributions
Equipped with this intuitive measure of hidden information, we can now use numerical analyses to understand better how selection entropy behaves in relation to the KL divergence.We will use samples from known distributions and empirical time series from resting-state functional magnetic resonance imaging.To create samples from known distributions we perform the following steps: Step 1: We sample from a series of Gaussian distributions with mean zero and standard deviations that vary from one to two in small increments.
Step 2: We sample from a second set of Gaussian distributions that also have mean zero, but for which the standard deviations change in the opposite direction from two to one in small negative increments.
Step 3: We create histograms from the two sets of samples in steps 1 and 2, by entering each data point into one of 100 bins.
Step 4: We divide each of the histogram entries in step 3 by the total number of data points in each histogram to obtain (sample) probability distributions.
Step 5: We calculate the selection entropy and KL divergence for each pair of probability distributions in step 4.

C. Gradient-based resting-state fMRI
All the empirical time series were obtained from the 1200subject release of the Human Connectome Project [31]  We parcellated these data using the Schaefer 100 atlas [32] to obtain a resting-state functional connectivity matrix and then computed pairwise correlations between all cortical and subcortical time series to obtain normative functional connectivity matrices.These 100 regions were then reorganized according to the principal gradient identified by Margulies et al. [3,4], with sensorimotor regions at one end (region 1) and regions associated with higher cognitive function at the other end (Fig. 3).

D. fMRI data analysis
We compare the behavior of selection entropy and KL divergence for resting-state fMRI data via the following steps: Step 1: We create histograms from each of the 100 time series for the 30 subjects by entering each data point into one of 100 bins.
Step 2: We divide each of the histogram entries in step 1 by the total number of data points in each histogram to obtain sample probability distributions.
Step 3: We calculate the selection entropy and KL divergence for all pairs of probability distributions in step 2, thus obtaining matrices of size 100 × 100 × 30 (regions × regions × subjects).
Step 4: We smooth and average the matrices in step 3 across subjects to obtain two matrices of size 100 × 100 each, summarizing the selection entropy and KL divergence between all pairs of regions.
Step 5: We calculate the mean and standard errors of the first, second, etc., diagonals of the matrices in step 4 to obtain the change in selection entropy and KL divergence with increasing separation along the principal gradient.

All results described in what follows can be reproduced with the accompanying MATLAB code.
We first establish the behavior of selection entropy and KL divergence revealed by the synthetic time series (sampled from Gaussian distributions with varying standard deviations).We then review the corresponding results based upon the empirical (fMRI) time series.

A. Gaussian distributions
As two Gaussian distributions become increasingly similar [Fig.4(a)], we find that the selection entropy and KL divergence approach the origin in concave downward and concave upward curves, respectively [Fig.4(b)].
Both the selection and KL divergence quantify the difference between the two distributions from which samples are taken.However, the selection entropy has an inflection point near the origin as the two distributions become arbitrarily close to one another.This indicates that the selection entropy is more sensitive to small differences between distributions.

B. Gradient-based resting-state fMRI
We base this section on probability distributions obtained from resting-state fMRI time series [Fig.5(a)] collected in 100 regions organized according to a principal sensorimotorto-transmodal gradient.We calculate the subject-averaged KL divergence and selection entropy for all pairs of these regions [Fig.5(b)].We then calculate the average KL divergence and selection entropy with increasing distance along the principal gradient [Fig.5(c)].
We see from Fig. 5(c) that the selection entropy displays a more pronounced monotonic increase with increasing separation along the principal gradient, as compared with the KL divergence.

IV. DISCUSSION
We have shown that there are scenarios in which the selection entropy reveals different information to that provided by the commonly used KL divergence, as demonstrated by Figs. 4 and 5. Technically, we can characterize the relationship between selection entropy and KL divergence mathematically by expressing Eq. ( 12) as follows: where we recall from Eq. ( 11) that the first sum of terms are equal to the KL divergence, such that we can write the selection entropy as follows: i.e., the selection entropy can be viewed as a sum of terms that augment the KL divergence.
Similarly, we can characterize the relationship between Boltzmann entropy and selection entropy mathematically by rewriting Eq. ( 12) as follows: which comprises the following three terms: (i) the Boltzmann entropy for system N, (ii) the Boltzmann entropy for system M, and (iii) the Boltzmann entropy for the difference between systems N and M, Using Eqs. ( 15) through ( 18), e can therefore write the selection entropy as the following linear combination of Boltzmann entropies: This shows that, in addition to the hidden information in the two systems (S M − S N ), the selection entropy also contains an interaction term (S N−M ) that depends upon the difference in probabilities between the two systems.
The relationship between Boltzmann entropy and selection entropy can be expanded upon in terms of the disparity between measurements and indistinguishable configurations.For instance, the Boltzmann entropy quantifies the hidden information associated with an ambiguity between a macroscopic measurement ("macrostate") and multiple indistinguishable microscopic rearrangements ("microstates") of the underlying system.We can view this ambiguity in terms of the multiple ways of shuffling the constituent masses of a probability distribution while keeping its shape unchanged [see Fig. (1)].On the other hand, the selection entropy quantifies the hidden information associated with an ambiguity between what we can call a "macroselection" (i.e., the knowledge that the states of one system are selected from another) and what we can call "microselections" (i.e., the multiple ways in which like state components of two systems can be cross-selected) (see Fig. 2).
As opposed to the KL divergence, the selection entropy displays an inflection point as probability distributions become very similar, as shown with the synthetic data (Fig. 4).This suggests that the selection entropy is more sensitive to small changes in distributions.Furthermore, the selection entropy is more sensitive to changes between time series-derived probability distributions collected from neuronal regions relative to their location along a principal gradient of cortical organization [Fig.5(c)].As such, we suggest that selection entropy could be useful in the study of functional segregation, due to its ability to reveal novel information within neuroimaging time series.

FIG. 1 .
FIG. 1. (a)A histogram showing that observations (obser.) 1 and 2 (black and white) appear with frequency (fr.) 1 in the first and second bins, respectively.(b) The other rearrangement of (a) that yields the same distribution with one observation in each of the two bins.

FIG. 2 .
FIG. 2. (a)Two histograms for systems 1 and 2 showing that there are two observations in the first bin and one observation in the second bin for system (sys.) 1 and vice versa for system 2. (b) The same two histograms as in (a), showing that there are a total of four ways in which observations can be selected between like bins, as indicated by the arrows.

FIG. 3 .
FIG. 3. The 100 regions of the atlas organized according to the sensorimotor-to-transmodal principal gradient.Images are shown in groups of 10, with image 1 displaying regions 1 through 10 in black, image 2 displaying regions 2 through 11, and so forth.

FIG. 4 .
FIG. 4. (a) Two Gaussian curves with mean zero, with standard deviations that range between 1 and 2 (dashed line) and between 2 and 1 in small increments.The curves are shown at minimum, zero, and maximum differences in the standard deviations ( σ ).(b) Normalized values of the KL divergence (thin line) and the selection entropy (sel.ent.; thick line) for the same range of differences in the standard deviations shown in (a).

)FIG. 5 .
FIG. 5. (a) The time course for the first region of the first subject (left), with its associated probability distribution (right).(b) The smoothed KL divergence (left) and selection entropy (SE, right) for all 100 × 100 region pairs.The outer Brainspace images indicate quartiles of the principal gradient-ordered regions, i.e., the first image shows regions 1 through 25 in black, the second shows regions 26 through 50, etc. (c) Mean and standard error of the KL divergence (left) and selection entropy (right) for increasing separation between regions along the principal gradient.
All data and accompanying MATLAB code are available at the following public repository [33].funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.