Entropy of leukemia on multidimensional morphological and molecular landscapes

Leukemia epitomizes the class of highly complex diseases that new technologies aim to tackle by using large sets of single-cell level information. Achieving such goal depends critically not only on experimental techniques but also on approaches to interpret the data. A most pressing issue is to identify the salient quantitative features of the disease from the resulting massive amounts of information. Here, I show that the entropies of cell-population distributions on specific multidimensional molecular and morphological landscapes provide a set of measures for the precise characterization of normal and pathological states, such as those corresponding to healthy individuals and acute myeloid leukemia (AML) patients. I provide a systematic procedure to identify the specific landscapes and illustrate how, applied to cell samples from peripheral blood and bone marrow aspirates, this characterization accurately diagnoses AML from just flow cytometry data. The methodology can generally be applied to other types of cell-populations and establishes a straightforward link between the traditional statistical thermodynamics methodology and biomedical applications.


I. INTRODUCTION
Many complex diseases require the measurement of multiple molecular factors at the single-cell level over large populations of cells for their precise characterization and detailed understanding [1][2][3]. Current technologies, such as flow cytometry (FCM), allow for such measurements, including simultaneous quantification of several morphological and molecular properties [4,5].
These technologies can currently provide the simultaneous single-cell measurement of tens of surface and intracellular markers of up to thousands of cells per second [6,7]. Such rapid technological development, however, is yet to be matched by traditional analysis tools to interpret the data [8,9]. It is now clear that all this data by itself, without the analytical tools to extract the relevant information, is not enough to faithfully understand the underlying cellular processes and their dysregulation in diseases such as cancer.
A prototypical example in which large amounts of data are generated is the cytometric analysis of acute myeloid leukemia (AML), a type of cancer produced by the dysregulated growth of the myeloid line of blood cells [10]. AML leads to abnormal white blood cells, red blood cells, or platelets to accumulate in the bone marrow and blood, which interferes with the production of normal blood cells. The presence of abnormal cells can potentially be detected by analyzing the changes of the distribution of morphological and molecular attributes of cell samples from blood or bone marrow. There are important difficulties associated with this approach for diagnosing AML. Besides abnormal cells being mixed with normal cells in the population samples, there are several AML subtypes and there is not an obvious well-defined set of changes in the molecular attributes that can fully characterize AML. In addition, there is the natural variation between individuals that overlaps with the changes induced by AML. Thus, the main challenge is to identify the quantities that best can be used to distinguish between AML patients and Normal (AML-free) individuals based on the single-cell statistical properties of their blood or bone marrow cell populations.
Here, I show that entropy, as traditionally used in statistical thermodynamics [11], provides a measure for the precise diagnosis of AML. Cell populations are characterized by their entropies in multidimensional landscapes constructed from the distributions of single-cell morphological and molecular attributes of flow cytometry data. Diagnosis is achieved by comparing how far the cell population distribution of an individual is from the AML and Normal prototypical maximum-entropy distributions.
The approach presented here was applied to samples from peripheral blood and bone

II. GENERAL APPROACH
Leukemia, as any other type of cancer, results from dysregulated growth caused by genetic and epigenetic changes that alter the cellular state [17]. Therefore, the first step towards distinguishing cancerous from normal cell populations is to characterize the state of each cell. To take into account that there is only limited information, the approach considers measured, x , and internal (non-measured), q , quantities separately. In the case of FCM, light scattering and the intensity of fluorescent reporters are the prototypical examples of measured quantities. Internal quantities are much more numerous and include, among others, non-measured protein levels and specific DNA mutations. The characterization of a population i considers the probability distribution ( , ) i P x q of these two types of attributes among its cells.
The goal is to discriminate among different cell-population types based on the statistical properties of their measurable quantities. To be able to take into account the effects of the internal quantities q , one must estimate their effects from the measurable quantities x , which should carry sufficient information about the key cellular differences between the different population types.
A convenient starting point to estimate the effects of the internal quantities in a cell population is the entropy of its attributes distribution ( , )ln ( , ) This quantity can be expressed as a function of just x by rewriting the join probability in terms of the conditional probability of q with respect to x , ( | ) i P q x , and the probability of x , ( ) i P x . Integration over q leads to which explicitly encapsulates the contributions from internal quantities into the function The main hypothesis to proceed further is that the internal contribution for each population ( ) i f x can accurately be approximated by the same function ( ) T f x for all the members of a given type T . In the case of AML and Normal types, the result would be depending on whether the population i is AML or Normal.
which can be interpreted as the entropy of the population i on the multidimensional morphological and molecular landscape defined by the maximum entropy distribution for the cell population type T .
The explicit reference distribution ( ) T P x is chosen as the distribution that maximizes the total entropy of all the members of the type T . The variation of the total entropy with respect to The explicit form of Eq.
(2) shares many similarities with other expressions used in physical sciences and in information theory. It has exactly the same form as the Gibbs entropy formula [11,18,19]. The main difference is that, in that case, the state T refers to an equilibrium state instead of a cell population type. It is also very similar, except for the constant T S , to the Kullback-Leibler divergence, which uses relative entropies to compare distributions [20]. It is important to emphasize that entropy defined by Eq.
(2) does not quantify variability. Instead, it quantifies how similar the distribution of measured values for a given individual is to the expected average distribution for a given type. High entropy means that the distribution of the individual is similar to the expected distribution and low entropy means that it is different. Both high and low variability would lead to low relative entropy values if the underlying distributions are not close to the expected distribution.
A key feature of the method is that since it is based on a maximum principle with respect to the distributions of the attributes, the effects of small perturbations in the measured distributions are second order and therefore they will only impact minimally the results.
The analogy with statistical thermodynamics can be extended further to estimate the likelihood , where the sum in the denominator is performed over all the cell population types. This expression assigns a high likelihood to a cell population as being of type T if the distribution of its attributes has a form similar to the maximum entropy distribution for that type. This assignment parallels in many ways the approach first used by A. Einstein to compute the probability of observing a fluctuation moving a system away from its equilibrium state based on its entropy change [22].

III. APPLICATION TO DIAGNOSING LEUKEMIA
The explicit application to diagnosing AML relies on evaluating whether the distribution of the values of flow cytometry data for a given individual to be diagnosed ( Fig. 1 and Supplementary   Fig. 1) is closer to the AML or Normal maximum entropy distribution (Fig. 1). Using just a few variables, such as side scatter and a fluorescent marker for the receptor protein CD45, is informative in many cases but does not offer a full characterization (compare for instance AML patient #7 with Normal patient #96 in Supplementary Fig. 1). In general, there are several sets of simultaneously measured variables for a given individual and therefore there are several different landscapes. For each population i and each set k of simultaneously measured variables k x , the corresponding entropy is given by , indicates that the individual looks like an AML patient for negative values and like a Normal subject for positive values. These entropies can be used to compute the weighted entropy where k w is the weight of the set k . Similarly, the weighted entropies are defined by , ,

IV. IMPLEMENTATION
The DREAM6/FlowCAP2 data [9,12]  The specific fluorescent markers used as FL1, FL2, FL4, and FL5 in each group of measurements are listed in Table I. Therefore, there are seven different attribute spaces, An avenue to estimate the suitability of a space or subspace for AML and Normal individual discrimination is to use leave-one-out cross-validation with the training data. In this case, it consists in testing each individual i of the training set without its contribution to AML P or The results indicate that the approach accurately diagnoses AML, except for just a few individuals ( Fig. 2 and Supplementary   Fig. 2) and that the performance increases with the dimensionality of the space in which the entropy is computed. The results also suggest that the test data was easier to classify than the training data, which contains a few difficult patients.
The discriminative capabilities can be improved by selecting, among the many possibilities available, not only the dimensionality of the space but also the subspaces. The A systematic procedure for adjusting the weights k w in Eq. (5) to better segregate between AML and Normal individuals is to increase the weights that benefit the segregation and decrease the weights that worsen it. Segregation is quantified by the distance between the Normal, min N , and AML, max A , individual with the minimum and maximum entropy difference, respectively. This distance is explicitly defined by The procedure considers updates proportional to a small quantity t Δ so that the weights at the step index t are updated iteratively to the step index t t + Δ through the This procedure guaranties that S ΔΔ % increases in each interaction if t Δ is sufficiently small and the identities of the individuals min N and max A do not change. Under these assumptions, expanding Eq. (7) in t Δ and using 2 2 is a significant overlap (Fig. 3a). By updating the weights k w according to Eq. (7), the quantity S ΔΔ % , which is negative, is increased by decreasing its absolute value. This quantity never changes to positive values and there is not an actual segregation of the populations (Fig. 3b).
Closer inspection indicates that segregation is prevented by a patient in the AML training set (patient #116 in Fig. 1). By removing this patient from the calculation of max A , there is a clear segregation of AML and Normal individuals according to the entropies of their distributions (Fig. 3c). Perfect segregation occurs in both cross-validation with the training set, which was used to obtain the different parameters, and predictions with the test set, which is completely independent of the training set. As shown in Fig. 3c, the reason patient #116 prevented segregation is because this patient has the Normal instead of the typical AML signatures.
The segregation measure ( ) S t ΔΔ % starting with 1 k w = increases with the number of iterations, as shown mathematically, until it reaches a plateau. In all the cases except for 7-D, it changes its sign from negative to positive values ( Supplementary Fig. 3), implying that complete segregation has been achieved in the training set. This segregation, except for 7-D, is also present in the test set ( Fig. 4 and Supplementary Fig. 4) Fig. 4 and Supplementary Fig. 4).

V. DISCUSSION
The use of physical approaches has been remarkably successful in the characterization of heterogeneous cell populations and their evolution in many complex biological scenarios [23][24][25][26] to the extent of making inroads in the mainstream biomedical research [27,28]. One of the main problems posed by hematological cancers, such as AML, is the underlying heterogeneity resulting from diverse molecular changes in the cellular state, including several recurrent mutations and chromosome translocations [10,17]. This heterogeneity is responsible to a large   TABLES   TABLE I. List of the five specific fluorescent markers, denoted by FL1, FL2, FL3, FL4, and FL5, used in each of the 7 groups of measurements.    individual of the training and test sets from one-to seven-dimensional landscapes. The likelihood that an individual of the test set is AML positive, given by equation (6), is also shown from one-to sevendimensional landscapes. Circles and squares represent, respectively, the actual AML and Normal state of a patient as clinically assessed by a physician. The insets are magnifications around the AML-Normal boundary. , w k (t) from equation (7), and w k (0) = 1, is shown for the training set from one-to seven-dimensional landscapes.