Topological spectra and entropy of chromatin loop networks

The 3D folding of a mammalian gene can be studied by a polymer model, where the chromatin fibre is represented by a semiflexible polymer which interacts with multivalent proteins, representing complexes of DNA-binding transcription factors and RNA polymerases. This physical model leads to the natural emergence of clusters of proteins and binding sites, accompanied by the folding of chromatin into a set of topologies, each associated with a different network of loops. Here we combine numerics and analytics to first classify these networks and then find their relative importance or statistical weight, when the properties of the underlying polymer are those relevant to chromatin. Unlike polymer networks previously studied, our chromatin networks have finite average distances between successive binding sites, and this leads to giant differences between the weights of topologies with the same number of edges and nodes but different wiring. These weights strongly favour rosette-like structures with a local cloud of loops with respect to more complicated non-local topologies. Our results suggest that genes should overwhelmingly fold into a small fraction of all possible 3D topologies, which can be robustly characterised by the framework we propose here.

Within mammalian cells, DNA interacts with proteins called histones to form a composite polymeric material known as chromatin, which is the building block of chromosomes and provides the genomic substrate for cellular processes, such as transcription -the copying of DNA into RNA [1,2].Understanding the mechanisms underlying and regulating chromatin transcription is important, as these determine the pattern of active and inactive genes in a cell [3].An important factor linked to transcription is 3D chromatin structure -as active genes tend to be associated with open chromatin -and chromatin looping -as DNA elements such as promoters and enhancers often need to come together forming a loop to trigger transcription [2].Within this context, polymer models have provided key insights into chromatin structure and loop formation, and into their link to transcription, concomitantly showing that physical principles may have far-reaching consequences in biology [4][5][6][7].
As an example, a simple model for chromatin organisation is shown in Figure 1A.Here, chromatin is viewed as a semiflexible polymer which interacts with chromatin-binding proteins associated with transcription -such as RNA polymerases and transcription factors.There is a set of binding sites on the chromatin fibre, which has a high affinity for proteins, or protein complexes; the rest of the fibre has a weaker attraction for them, for instance, due to nonspecific or electrostatic interactions.When proteins can bind multivalently, which is generally the case for protein complexes, microphase separation of proteins and binding sites spontaneously emerges through a thermodynamic positive feedback loop, known as the "bridging-induced attraction" [5,6], which works as follows (Fig. 1B).First, possibly through a fluctuation, the local density of binding sites in 3D may locally increase.This recruits chromatinbinding proteins which, if multivalent, stabilise chromatin loops further increasing the concentration of both specific and non-specific binding sites, in turn increasing protein concentration, and ultimately triggering a positive feedback loop resulting in the self-assembly of clusters of protein complexes and binding sites.Such clusters are accompanied by the formation of chromatin loops, which incurs an entropic cost growing non-linearly with the number of loops [8] so that clusters do not coarsen past a typical size, given by the competition between gain in binding energy and loss in entropy.This type of microphase separation leads to structures very much like the transcription factories -clusters of RNA polymerases and gene promoters or enhancers -observed experimentally in living cells [3,9].
Microphase separation into transcription factories is associated with the spontaneous emergence of a network of chromatin loops joining the binding sites in 3D (Fig. 1).These networks are important as they determine the 3D structure of genes, which as anticipated underlie the transcriptional state of a given cell [7,10].Networks such as these depicted in Fig. 1 were previously considered in polymer physics, for instance, in [11,12], where it was found that their entropy mainly depends on the number of loops and on the local number of legs associated with each cluster (Fig. 1).The relevant thermodynamic ensemble is however fundamentally different for the chromatin networks we consider, as in these cases the typical distance between binding sites, l, cannot be taken to be arbitrarily large as implicitly done in [11,12], but instead is a model parameter which remains finite, and corresponds to a typical ∼ 50 − 100 kbp chromatin loop [13,14].
In this work, we show that the theory of partitions [15] provides a powerful framework to enumerate the topologies of the emerging chromatin loop networks.By performing numerical simulations of the folding of a typical gene locus with the model sketched in Fig. 1, we find that different topologies, with the same entropy in the limit of l → ∞, appear with vastly different frequencies.A striking example of this is provided by the 'rosette' and 'watermelon' topologies in Fig. 1: despite having the same l → ∞ behaviour for their entropic weight, we show that the former is observed orders of magnitude more often than the latter.We resolve this apparent paradox by computing the amplitudes of the statistical weights associated with these diagrams: the ratio between amplitudes of different diagrams follows simple patterns which reflect the biases seen in simulations.We suggest that the topological weights we compute -i.e., the statistical weights of a diagram corresponding to a given topology -are important factors to understand the principles of loop network formation in chromatin, as well as in all polymer systems where binding sites have a typical finite separation between them.Our results can be applied in the future to describe actual 3D chromatin loop topologies observed in experiments or computer simulations.The local polymer structure at the loop base is the same in (i) and (ii): this is what determines the entropy of the topology in the limit in which the distance between binding sites goes to infinity.In the case of chromatin, though, this limit is not relevant and the weights of the diagrams (i) and (ii) are in practice very different.
Here, for concreteness, we focus on a stretch of chromatin with n = 8 binding sites, or transcription units (TUs, see Fig. 2A).This case is relevant biologically, as when looking at the genome-wide distribution of gene loci with n TUs, the most frequent cases are those with n ∼ 5 − 10 [7].Note we also expect similar results for generic values of n.
To enumerate all possible configurations of a gene locus with n = 8, we start by observing that what is important is the relative position of the TUs, as this is what determines the transcriptional activity of the promoter and hence of the gene [7,14] -the detailed position of the chromatin in the loop is instead irrelevant.Equivalently, we simply need to count all possible partitions of n (initially distinguishable, or labelled) binding sites into different clusters, which equals the Bell numbers B n [16].B n is a large number: for n = 8, B 8 = 4140, whereas B n ∼ n n for large n.
For simplicity, here we further focus on the possible gene locus topologies where the TUs are grouped into k = 2 clusters (each of size ≥ 1), without any singleton); different cases (with k = 3, 4) are discussed in the companion paper [17] and lead to qualitatively similar results.We call the number of possible configurations with n labelled (or distinguishable) TUs and k clusters N (n, k).For k = 2, such number can be found via the theory of partitions [15], and is given by N (8, 2) = S(8, 2) − 8 = 119.Here, S(8, 2) denotes the Stirling number of the second kind [18] which counts the number of ways in which 8 points can be partitioned into 2 non-empty clusters -the subtraction of 8 is needed to remove singletons.This reasoning and formula can be generalised, such that the number of configurations of a gene locus with n distinguishable TUs and 2 clusters with ≥ 2 TUs in each is given by Additional properties of N (n, k) are discussed in [17].
It is also useful to classify the distinct types of topologies which can be created, where we omit the labelling of the TUs, or equivalently consider them indistinguishable.We call the number of such inequivalent "unlabelled" topologies with n TUs and k clusters N u (n, k).This classification is relevant if we are interested in calculating the relative importance, or statistical weights, of different loop network topologies, without worrying about the detailed labelling.For two-cluster networks, we find that there are 20 such inequivalent unlabelled topologies (Figs.2B and  Table I).In general, N u (n, 2) can be found analytically and it is given by [17] where ⌊x⌋ denotes the largest integral which is smaller than x (floor function of x).Clearly, N u (n, 1) = 1, whereas it is very challenging to find N u (n, k) explicitly for k > 2. Asymptotically, N u (n, 2) ∼ n 2 , therefore the growth rate is much smaller than that of the number of labelled networks, N (n, 2), which grows as ∼ 2 n [Eq.( 1)].
Each inequivalent topology i is associated with a combinatorial weight, or multiplicity, Ω i , which counts the number of different labelled networks corresponding to it (see Fig. 2 in [17] for an example of two different labelled configurations corresponding to the same unlabelled topology).These multiplicities are listed in Table I; note that they satisfy the constraint Nu(n,2) i=1 Ω i = N (n, 2).One way to characterise the different topologies is by counting the number of legs that loop directly back to their vertex (or cluster) of origin, and the number of legs that begin at one vertex and end at a distinct vertex (or cluster).We call these two quantities n l and n t , for the number of loops and the number of ties respectively; we note that n t + n l is an invariant (which equals n − 1, or 7 in our chosen example).
To quantify the statistical weights, or relative importance, of the different topologies in Fig. 2B, we simulate, by using coarse-grained molecular dynamics run within the LAMMPS package [19], the behaviour of a chromatin fibre with contour length L, monomer size σ and persistence length 3σ [6,14,20], with 8 equally separated TUs whose mutual distance is l = 30σ, interacting with 10 multivalent spherical complexes of TFs and polymerases (see Fig. 2a, and see Appendix for more details).Focussing on configurations with two clusters as in the theoretical analysis above, we then compute the topological spectrum of our model chromatin fibre, by computing the frequency of each of these, which is an estimate of its statistical weight.Note that for simplicity in these simulations we do not include weak interactions between proteins and non-TU beads, although we do not expect this simplification to change the qualitative trends we observe.I. (ii) Normalised frequency of occurrence of topologies in Bi, as a function of nt.The frequency of the i-th topology, is normalised by its combinatorial multiplicity Ωi (see Table I).
Our simulations show that, remarkably, only a handful of the topologies contribute significantly to the population of possible 3D structures of our model gene locus: accordingly, the top 3 topologies account for over 40% of the total structures, and the top 6 for just under 80% (Fig. 2Ci).Overall, we find that non-local topologies with a higher number of intercluster ties, n t , are much less likely than rosettes, characterised by clouds of local loops and with n t = 1.Fig. 2Cii shows the frequency of the symmetric diagrams in Fig. 2Bi as a function of n t , normalised with where R F (which depends on l) is the Flory radius of a self-avoiding walk, γ ≃ 7/6 and ν ≃ 3/5 its entropic and metric exponents [21].In our case, therefore, g ≃ 5/18 and δ ≃ 5/2.By repeating the calculation for the ratio between the weights of the topologies in Fig. 2Bi, but now including self-avoidance by using the propagator in Eq. 8, after some algebra we obtain where Γ denotes the gamma function, and the label SA stands for self-avoidance.Importantly, Eq. ( 9) predicts a different functional dependence on n t with respect to that of the Gaussian network, as √ nt for large n t .By setting n t = 7, we obtain that Z SA R /Z SA W ≃ 42.4, as opposed to 7 3/2 ≃ 18.5 for the Gaussian network.In other words, excluded volume interactions within the chromatin fibre sharpen the difference between the topological weights, rendering predictions closer to values observed in simulations.The remaining discrepancy may be due to the fact that excluded volume is only partially accounted for in the calculation leading to Eq. ( 9), as this still disregards mutual avoidance between the n t chains connecting the two clusters.Additionally, the δ functions in Eq. ( 4) should be regularised, which may impact loops and tie segments differently: for simplicity we do not include here such a more sophisticated treatment.
In conclusion, we have studied the topological spectrum of chromatin loop networks emerging upon the 3D folding of a gene.We have shown that the theory of partitions, via the Bells and Stirling numbers, provides a useful general way to enumerate the possible topologies of labelled chromatin fibres, whereas counting unlabelled inequivalent topologies is much more challenging.Our main finding is that, out of all such possible topologies, only a small fraction is in practice observed in polymer simulations of the 3D folding of a gene locus.Specifically, we predict that gene loci should be overwhelmingly organised in rosette-like structures with a predominant number of local loops between neighbouring transcription units [6].Non-local structures with multiple ties between clusters of transcription units are orders of magnitude less likely statistically.This is at first sight surprising because in many cases the corresponding networks have the same entropic exponent, as previously calculated [11,12].This apparent paradox can be resolved by noting that in the chromatin organisation problem, the relevant scenario is one where the distance between transcription units is finite, and does not diverge as in previous statistical physics theories of polymeric networks.
Our results provide a generic framework to classify and rationalise chromatin loop topologies in mammalian chromosomes.This is an important task in view of the link between 3D chromatin topology and transcription [14,22].More specifically, our results can be used to classify chromatin organisation genome-wide in simulations, for instance by analysing numerical results of more detailed models for gene loci folding [7].Experimentally, it would be of interest to combine our methodologies with methods such as SPRITE [23], to quantitatively classify characterise the networks of transcription factories and chromatin loops formed in different types of cells.In the future, it would also be interesting to see how the topological spectrum of a gene locus depends on the local positioning of TUs, and on the relative balance between non-specific and specific interactions between proteins and chromatin.Finally, the graphs associated with the chromatin loop networks we discuss can be mapped into words [24], and it would be fascinating to explore this mapping more fully, which would allow us to view our 3D loop networks as a topological alphabet of chromatin folding.
This work was supported by the Wellcome Trust (223097/Z/21/Z).For the purpose of open access, the authors have applied a Creative Commons Attribu-tion (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

APPENDIX A
Here, we give details of the potentials used in the simulations described in the main text and show in Table I the multiplicity and frequency of formation of each of the 20 inequivalent topologies formed by unlabelled networks.
We model a chromatin fibre as a bead-and-spring polymer, interacting with multivalent proteins, also modelled as (spherical) beads.The dynamics of each bead (labelled by i) is governed by the following Langevin equation, where m is the bead mass, x i is the i-th bead position, γ is the drag, and η(t) is uncorrelated white noise defined by ⟨η(t)⟩ = 0 and ⟨η α (t)η β (t ′ )⟩ = δ αβ δ(t − t ′ ).
To simulate the behaviour of the system, the following potentials, U , enter into this equation according to the particular beads under consideration.
First, a simple phenomenological Lennard-Jones potential, truncated to include only the repulsive regime (Weeks-Chandler-Andersen potential) acts between any two chromatin beads, and between any two protein beads, enforcing excluded volume, or self-avoidance.This is given by where σ is the bead diameter.Note that we used the same diameter for chromatin and protein beads for simplicity.Second, for multivalent proteins to cluster and create chromatin loop networks, we included another LJ potential, acting between transcription units and proteins.In this case, we truncated the interaction range to 1.8σ to allow the attraction that drives clustering.The protein-TU interaction affinity was set to 11.5 k B T , as this is the upper 95% bound on the energy of formation of the most expensive network topology in our study (see Table I of the companion paper [17]).This was done to ensure all topologies had a chance to form in the simulations.
Third, to model chain connectivity, finitely extensible non-linear elastic (FENE) bonds are considered, acting only between consecutive beads along the chromatin fibre: where K = 30 k B T σ 2 is the spring constant and R 0 = 1.6σ is the maximum extent of the bond.Finally, and again only for the chromatin beads, we included a bending or Kratky-Porod potential, which acts on the angle θ between three consecutive beads along the chain and enforces a non-zero persistence length, l p : where K b = k B T l p /σ = 3k B T .Note that the persistence length is artificially raised during equilibration to assist the system in reaching a self-avoiding configuration.It is then lowered from 10σ to 3σ during the main simulation.This is an appropriate value for flexible chromatin [5,6,14].
Chromatin fibres were initialised as random walks and protein position was initialised randomly within the box (size 30 σ, within periodic boundary conditions).A soft potential and springs between connected chromatin beads were used to remove overlaps between beads.After that, the potential was set to the set of potentials described above.Simulations were run by using the LAMMPS software [19].
To map between simulation and physical units of length, we note that σ = 1 can be viewed as modelling a 30 nm chromatin bead, or to 3 kilo-base pairs (kbp) of DNA, as in standard coarse-grained models for chromatin such as the one we use [5,25].For timescales, we can map the Brownian time, τ B = σ 2 /D, with D the diffusion coefficient, from simulation units to physical units, as in [5].For chromatin τ B ∼ 0.1 − 1 is typically appropriate [5], and our simulations consist of ∼ 10 5 τ B or more; note though that the timescales are not too important for the current work as we focus on steady-state probabilities, averaged over multiple runs (in our case 11000).

FIG. 1 .
FIG. 1. A. (i) Schematics of the problem we consider.A chromatin fibre is modelled by a polymer, on which transcription units (pink spheres) are bound by multivalent proteins (red spheres), representing complexes of transcription factors and polymerases.(ii) Emerging states, with clusters arising through the bridging-induced attraction.B. (i) Chromatin loop network associated with the state in (Aii).This is a chain of two rosettes, each containing a cluster of 4 binding sites in the fibre.(ii) Alternative chromatin loop network involving the same amount of binding sites in clusters.This is known as the watermelon topology.(iii)The local polymer structure at the loop base is the same in (i) and (ii): this is what determines the entropy of the topology in the limit in which the distance between binding sites goes to infinity.In the case of chromatin, though, this limit is not relevant and the weights of the diagrams (i) and (ii) are in practice very different.

D
Diagram n t n l L 1 L 2 B oe D oe I -

25 DTable 4 . 4 : 25 DTable 4 . 4 : 25 D
Diagram nt nl B oeD oeI Ác/ kBT CI/ kBT f Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Diagram nt nl B oeD oeI Ác/ kBT CI/ kBT f Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Diagram nt nl B oeD oeI Ác/ kBT CI/ kBT f

25 DTable 4 . 4 : 25 DTable 4 . 4 :Table 4 . 4 :
= 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Diagram nt nl B oeD oeI Ác/ kBT CI/ kBT f Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Diagram nt nl B oeD oeI Ác/ kBT CI/ kBT f Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Results summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number of loops (n l ), and multiplicities (oe D , oe I ).Results are tabulated for the transition a nity (Á c ) and the relative frequency as a percentage (f ), each with their respective bootstrapped D Diagram n t n l B oe D oe I Á c / k B T CI/ k B T f

D
Diagram n t n l B oe D oe I Á c / k B T CI/ k B T f

Table 4 . 4 : 25 D
Results summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number of loops (n l ), and multiplicities (oe D , oe I ).Results are tabulated for the transition a nity (Á c ) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.Diagram n t n l B oe D oe I Á c / k B T CI/ k B T f

1 FIG. 2 .
FIG. 2. A. Setup of our explicit numerical calculation.A gene locus is modelled by a chromatin fibre with n = 8 equispaced TUs, with mutual distance l = 30σ = 90 kilo-base pairs (see Appendix for details of the mapping to biophysical units).B. (i-iii) Sketch of all possible emerging topologies.There are 20 inequivalent topologies, which are here grouped into three classes (i-iii).Network topologies in each of the three classes have the same node degree distributions (Fig. 1c), and hence the same value of γG [12].C. Results of simulations.(i) Pie chart showing the relative frequencies with which the 20 topologies in B are observed.The top 6 are shown; note the numbers labelling each topology correspond to their combinatorial multiplicity given in TableI.(ii) Normalised frequency of occurrence of topologies in Bi, as a function of nt.The frequency of the i-th topology, is normalised by its combinatorial multiplicity Ωi (see TableI).

D
Diagram n t n l B oe D oe I Á c / k B

Table 3 .
3: Topology summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number of loops (n l ), nontrivial vertex orders (L i ), symmetry breaking (B), and multiplicities (oe D , oe I ).11 D Diagram n t n l L 1 L 2 B oe D oe I

Table 3 .
3: Topology summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number of loops (n l ), nontrivial vertex orders (L i ), symmetry breaking (B), and multiplicities (oe D , oe I ).

Table 3 .3: Topology
Diagram n t n l L 1 L 2 B oe D oe I summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), nontrivial vertex orders (Li), symmetry breaking (B), and multiplicities (oeD, oeI).11D

Table 3 . 3
: Topology summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number of loops (n l ), nontrivial vertex orders (L i ), symmetry breaking (B), and multiplicities (oe D , oe I ).

Table 4 .
4: Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.

Table 4 .
4: Results summary table.All topologies with ns = 8 binding sites and nv = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (nt), number of loops (nl), and multiplicities (oeD, oeI).Results are tabulated for the transition a nity (Ác) and the relative frequency as a percentage (f ), each with their respective bootstrapped 95 % confidence intervals.

Table 4 . 4 :
Results summary table.All topologies with ns = 8 binding sites and nv

Table 4 . 4
: Results summary table.All topologies with n s = 8 binding sites and n v = 2 vertices are listed, with their Duplantier equivalence class (D), number of ties (n t ), number