Chromosome contact maps are bifractal

Biological Complexity Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa 904-0495, Japan. The Niels Bohr Institute, Blegdamsvej 16, 2100 Copenhagen Ø, Denmark. Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland. University of Basel, Petersplatz 1, 4001 Basel, Switzerland Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133 Milano, Italy.

as a hierarchical set of domains nested in each other and we solve it exactly. The predicted multifractal spectrum is in excellent quantitative agreement with experimental data. Our theory yields to a more robust estimation of the scaling exponent of the contact probability than existing methods. By applying this method to experimental data, we detect subtle conformational changes among chromosomes during differentiation of human stem cells.
During cellular interphase, mammalian chromosomes assume a globular structure in the nucleus 1, 2 . Their conformational properties can be studied in vivo with a set of techniques called chromosome conformation capture (3C) 3 , most notably their genome-wide version called Hi-C 4 .
Results of Hi-C experiments can be represented as matrices whose elements are proportional to the probability that two chromosome regions are in contact in space 5 . Hi-C measurements have paved the way for a mechanistic understanding of chromosome folding. In particular, they have revealed that mammalian chromosomes are characterized by a hierarchy of nested, tightly connected structures 9 . Structures below the megabase scale are usually called topological associating domains (TADs). They are particularly relevant for gene regulation 6,8 , although they do not seem to be privileged over other levels in the hierarchy from a structural point of view 9 . It has been suggested that the hierarchical structure of Hi-C matrices can be studied by means of an iterative coarse-graining procedure 10 .
Information contained in Hi-C matrices is often characterized by the decay of the contact probability p( ) of pairs of chromosomal regions i and j with respect to their genomic distance ≡ |i − j|. Such decay appears to follow a power law (1) The contact probability exponent β is often estimated to be slightly smaller than one 4,11,12,26 . Such low values are incompatible with simple equilibrium homopolymeric models 13 . In contrast, nonequilibrium models such as the crumpled globule 4, 13 are able to account for such low exponents.
Other proposed mechanisms include mediation of polymer interaction by other molecules 14 , active loop-extrusion 15 , and finite-size effects in heteropolymers 16 . In any case, the apparent power-law range of the contact probability is usually limited to one decade or less, see e.g. 4,11,12 . Therefore, estimates of the exponent β are rather sensitive to the choice of the fitting range.
In this paper, we robustly characterize scale-invariant properties of chromosomes using the theory of multifractals. This theory has been developed to characterize heterogeneous systems characterized by scale invariance [18][19][20] . This analysis reveals that chromosome contact maps are bifractal, i.e. multifractal objects characterized by two distinct fractal dimensions, a behavior previously reported in studies of surface roughness 21 , distribution of matter in the universe 22 and turbulence 23 .
We introduce our theory using a Hi-C map of chromosome 1 in mouse embryonic stem cells, see Fig. 1a. The contact probability seems to decay as a power law, at least for relatively short genomic distances, see Fig. 1b. However, the local logarithmic slope of the contact probability does not present the clear plateau characteristic of a power law behavior, see Fig. 1c.
To characterize scaling properties of chromosomes in a more rigorous way, we study Hi-C maps as multifractals. A multifractal is a system described in terms of a density, that in our case is the density of number of counts in the Hi-C map. We study the scaling property of this density by constructing two-dimensional histograms of the Hi-C map on grids of different linear resolution , see Fig. 1d. Geometrical structures of linear size smaller than are not resolved in these maps. The smallest possible value of is the resolution of the original Hi-C map, in our case = 4 · 10 4 bp.
We define the probability p ij ( ) in bin at coordinates i, j at resolution . We always work with normalized maps, so that ij p ij ( ) = 1 for all choices of .
We assume that the system is scale invariant, at least for small enough. This means that the probability in a bin scales as a power law of the resolution: where ∼ denotes the leading behavior and α is the scaling exponent associated to the density. Since the map is not homogeneous, different bins can be in principle characterized by different values of α. The number of bins N (α) associated to a given value of α must also scales as a power of The exponent D(α) characterizes the scaling of the number of bins of linear size necessary to cover the set with density exponent α and therefore can be interpreted as the fractal dimension associated to this set. The quantity ρ(α) is a prefactor independent of .
Computing D(α) directly is often unpractical. A convenient related quantity is the partition function Z(q, ), defined by 19,20 The name "partition function" originates from an analogy with equilibrium statistical mechanics, where the exponent q plays the role of an inverse temperature. Indeed, in an inhomogeneous system, for q → 0 (high temperature), all bins give similar contributions to the sum in Eq. (4), whereas for large q (low temperature) the sum is dominated by relatively few terms characterized by largest values of the measure p ij . We link the partition function with the fractal dimensions by collecting in Eq. (4) all terms with the same value of α: In the limit of small , we evaluate this sum with the saddle point method. This calculation shows that, for a scale-invariant system, one also expects a power-law scaling of the partition function as a function of the resolution The function K(q) is the multifractal spectrum and characterizes the scaling behavior with respect to the resolution of the partition function 20 . The saddle point evaluation of Eq. (5) leads to conclude that the dimensions D(α) and the multifractal spectrum K(q) are related by a Legendre transform: A linear multifractal spectrum indicates that the system is homogeneous, i.e. all of its parts are characterized by the same fractal dimension α = D and thus by the same scaling behavior N ∼ −D and p ij ∼ D . Conversely, a non-linear multifractal spectrum signals a diversity of scaling exponents and associated dimensions.
The partition function Z(q, ) computed from Hi-C data presents a clear power law behavior as a function of , as predicted by Eq. (6), see Fig. 1e. In this case, for slightly larger than its minimum value, the local logarithmic slope of the partition function is essentially flat, see Fig. 1f.
This means that the power-law behaviour of Z(q, ) is much more robust than that of p( ).
The multifractal spectrum K(q) of chromosome 1 presents two different linear regimes, one at low moments with the properties of a two-dimensional object and one at high moments characterized by a fractal dimension between one and two, see Fig. 1g. A system with such properties is called a bifractal 21,22,22 . Since the exponent q is analogous to an inverse temperature, the sharp change of slope in the spectrum of a bifractal system is akin to a phase transition in equilibrium statistical physics 24 .
To rationalize these observations, we introduce a hierarchical domain model of Hi-C maps.
We define the model in terms of an iterative transformation of a measure on the unit square [0, 1] × [0, 1]. At the first iteration, the measure is given by a 2 × 2 matrix with diagonal element a and off-diagonal elements b, with a > b > 0. At each following iteration n we generate a 2 n × 2 n matrix. The off-diagonal element of the matrix at the previous iteration becomes a 2 × 2 block of identical values in the matrix at the nth iteration. The diagonal blocks are further multiplied by the original matrix (see Fig. S1 in the Supp. Mat). We impose b = 1/2 − a, so that the measure remains normalized at each iteration. This means that, effectively, the model is defined in terms of a single free parameter a. Because of the normalization and the requirement that the measure should be concentrated along the diagonal, such parameter falls in the range 1/4 ≤ a ≤ 1/2.
The parameter a represents the weight of domains compared to the rest of the Hi-C map.
Matrices with larger a have a measure more concentrated along the diagonal, whereas matrices with lower a are characterized by a more uniform measure, see Fig. 2a. In the limiting case a = 1/4 the matrix is uniform at all iterations, whereas for a = 1/2 the matrix is characterized by a uniform measure on the diagonal, whereas all the off-diagonal elements are zero.
The partition function of the hierarchical domain model can be expressed as where we defined the exponent A saddle-point estimation of the partition function gives which is positive for q > q c = (ln 2)/(ln 4a). This means that, for q ≥ q c , the leading term is either , the maximum of the exponent is attained at k = n. Thus, for large n, the partition function scales as Z(q, n) ∼ K(q) with the length scale = 2 −n and the multifractal spectrum For a = 1/2, the matrix p ij ( ) is diagonal at each iteration. In this case, Eq. (11) predicts a linear spectrum with slope D = 1, consistent with the fact that the geometric set is equivalent to a one dimensional line. For a = 1/4 the distribution is uniform on the square, and Eq. (11) correctly returns D = 2. Between these two limiting cases, Eq. (11) predicts a fractal distribution with a dimension D = − ln a/ ln 2 between 1 and 2.
We now focus on the "high temperature phase" where − ln 2 + q ln 4a < 0. In this case, the term dominating the scaling is k = 0, so that Since in the high temperature phase the scaling is determined by the terms far from the diagonal, the spectrum is that of a regular two-dimensional set.
Summarizing, our calculation predicts a multifractal spectrum characterized by two linear regimes: one with slope − log(a)/ log(2) for q > q c = (ln 2)/(ln 4a), Eq. (11), and one with slope These observations support that the bifractal spectrum predicted by the hierarchical domain model is compatible with that observed in a broad class of higher organisms From the model, we also predict the scaling of the contact probability with the genomic distance. In our framework, the contact probability P ( ; n) at the n-th iteration can be expressed by summing the probabilities of blocks at a genomic distance from the diagonal: From this expression, we derive that the decay exponent of the contact probability with the genomic distance is β = ln(4a) ln (2) .
Derivation of Eq. (14) is presented in SI. For 1/2 ≤ a ≤ 1, Eq. (14) predicts contact probability exponents in the range β ∈ [0, 1]. This prediction is supported by numerical simulations, see Fig.   S3 in the Supp. Mat. This means that the hierarchical structure of contact matrices automatically leads to exponents β < 1, as observed in chromosomes.
We test more extensively our method using datasets from different Hi-C experiments using mouse embryonic stem cells (mESC). We fit the multifractal spectrum of each dataset and obtain the corresponding exponent β via Eq. (14). We first test the robustness of the multifractal analysis across two Hi-C experiments in mESC from two different labs 17,30 , both following the Hi-C protocol in solution, see Fig. 3a. Both the multifractal and the power law methods predict that the variability of the exponent β across chromosomes is significantly larger than the variability of the exponent across replicates. To figure out which method performs best, we use bootstrapping to test against the null hypothesis that the values of β of different chromosomes are paired at random.
The multifractal method permits to exclude random pairing of chromosomes (p-value 5.2 · 10 −4 ) much better than the direct fit (p-value 0.11, see Supplementary Table S1). Moreover, the mean squared error of β between the two replicates is smaller in the multi-fractal case compared to direct fit (0.009 vs 0.04). The χ 2 associated with the direct fit is affected by a strong systematic error, although remaining quite correlated (see Supplementary Fig. S6). This effect is much milder in the multifractal approach. The multifractal and direct fit methods are similarly robust with respect to varying the resolution of Hi-C maps (see Supplementary Fig. S7).
We apply the two methods to attempt to distinguish among experiments on the same cell lines, but following different experimental protocols. These different protocols mainly differ in the ligation step of digested DNA fragments. In the original "in solution" protocol, the ligation is performed in a diluted solution 36 . In other protocols, the ligation step is either carried out in intact nuclei ("in situ" protocol 36 ) or on solid substrates ("TCC" protocol 32 ). The two latter protocols are able to produce Hi-C maps with better signal-to-noise ratio compared to the in solution protocol 32,36 . Both the multifractal and the direct fit methods show that the values of β obtained from in situ and TCC experiments are markedly different from those obtained in solution (see Supplementary   Fig. S6). The multifractal method estimates compatible scaling exponents for in situ and TCC protocols (the p-value associated with the null hypothesis that chromosomes are paired at random is 0.021, see Supplementary Table S2), while a direct fit does not permit to draw this conclusion (pvalue 0.14). This result suggests that the in situ and TCC protocols are able to produce statistically compatible Hi-C maps due to their high signal-to-noise ratio.
We compare the two methods in detecting differences between wild-type and mutant cell lines, in which either the Nipbl gene 31 or the CTCF gene 30 is knocked down (see Fig. 3b).
Knock-down of these genes has been shown to disrupt chromosome folding. In particular, Nipbl knock-down leads to loss of TAD structures and global changes in scaling properties 31 . We quantify average differences in exponents between the wild-type and the mutant in terms of χ 2 of pairs of chromosomes, weighted by their mean squared errors (see Supplementary Table S2 and Supplementary Fig. S7). Multifractal analysis detects more marked differences (χ 2 = 214) compared to the direct fit (χ 2 = 4.7), although both methods highlight a statistically significant difference between the two sets (p-values 1.3 · 10 −5 for the direct fit and < 10 −6 for the multifractal). Knockdown of CTCF also causes a loss of TAD structure, but without a clear effect on genome-wide scaling 30 . Nonetheless, the values of β of different chromosomes obtained by multifractal analysis reveal a large (χ 2 = 86) and significant (p-value < 10 −6 ) difference between the scaling properties of wild-type and CTCF-deficient cells. Also in this case, the direct fit detects less marked differences (χ 2 = 25, see Supplementary Table S2).
We apply the multifractal analysis to elucidate how chromosome structure changes during cellular differentiation. To this aim, we analyze Hi-C data obtained from different human cell lines at different stages of early differentiation 35  to the idea that more differentiated cell types require more insulated domains to achieve more specialised function.
Multifractal analysis of Hi-C data is a powerful statistical tool to characterize scaling properties of chromosomes. We have shown that contact maps of chromosomes are bifractal, i.e. they are characterized by two distinct fractal dimensions. To understand this observation, we proposed a hierarchical domain model, whose analytical solution is in striking quantitative agreement with observed multifractal spectra. Our theory implicates a power-law scaling of the contact probability with the genomic distance with exponents β lower than one, also in agreement with experimental observations. The multifractal method is sensitive enough to discard a null model with powerlaw decay of the contact probability, but domains on a single length scale only. The predicted form of the multifractal spectrum provides a more stringent prediction than the contact probability exponent alone. The analysis proposed here constitutes an excellent benchmark to select among different polymer models that provide similar values of β 13-16 .
Our results indicate that scaling properties of chromosomes are a direct consequence of the hierarchical structure of chromosome domains 9,27 . Recent work has suggested that such domain structure is generated by a "hierarchical folding" mechanism, mediated by different proteins such as cohesin and CTCF 28 . The precise mechanism driving the folding of chromosomes has been subject of debate 29 . It will be interesting to test whether the activity of these factors can produce self-similar structures compatible with our observations. The shape of the multifractal spectrum is controlled by a single parameter, that also controls the contact probability exponent β via Eq. (14). The determination of β from the fit of the multifractal spectrum is a much more robust way of characterizing the Hi-C map than the direct fit of β. Multifractal analysis is importantly also more sensitive in highlighting subtle structural differences as shown in the case of Nipbl and CTCF knock-down. The analysis of the multifractal spectrum is simple and robust enough to become a routinely tool to analyze contact maps, both from polymer models and experiments, and capture subtle differences among them. at different values of a, the solid curve is the prediction of Eqs. (11) and (12). c, The multifractal spectra of the first three chromosomes of mESC (colored symbols) are barely distinguishable from each other. The solid curve is a fit of Eqs. (11) and (12), giving a ≈ 0.413.  Figure 3: a Exponents β calculated from the multifractal spectrum using Eq. (14) and from a direct fit of the data (as in Fig. 1B) with the alternative Hi-C methods. One replicate of in situ 33 , TCC 31 and two independent replicates of in solution 17,30 Hi-C are shown in situ 32, 33 and in solution 17,30 . The black bar indicates the median over the chromosomes, the colored box indicates the first and third quartiles and black dots are outliers. b Value of β obtained for the wild-type (wt) system and for a mutant (mut) in which either CTCF 30 or Nipbl 31 is depleted in experiments. c Exponents β obtained with the multifractal method for cell lines at different stages of human stem cell development 35 . d Exponents obtained by direct fit for the same data.