Modelling Protein Target-Search in Human Chromosomes

Several processes in the cell, such as gene regulation, start when key proteins recognise and bind to short DNA sequences. However, as these sequences can be hundreds of million times shorter than the genome, they are hard to find by simple diffusion: diffusion-limited association rates may underestimate $in~vitro$ measurements up to several orders of magnitude. Moreover, the rates increase if the DNA is coiled rather than straight. Here we model how this works $in~vivo$ in mammalian cells. We use chromatin-chromatin contact data from state-of-the-art Hi-C experiments to map the protein target-search onto a network problem. The nodes represent a DNA segment and the weight of the links is proportional to measured contact probabilities. We then put forward a master equation for the density of searching protein that allows us to calculate the association rates across the genome analytically. For segments where the rates are high, we find that they are enriched with active genes and have high RNA expression levels. This paper suggests that the DNA's 3D conformation is important for protein search times $in~vivo$ and offers a method to interpret protein-binding profiles in eukaryotes that cannot be explained by the DNA sequence itself.

Several processes in the cell, such as gene regulation, start when key proteins recognise and bind to short DNA sequences. However, as these sequences can be hundreds of million times shorter than the genome, they are hard to find by simple diffusion: diffusion-limited association rates may underestimate in vitro measurements up to several orders of magnitude. Moreover, the rates increase if the DNA is coiled rather than straight. Here we model how this works in vivo in mammalian cells. We use chromatin-chromatin contact data from state-of-the-art Hi-C experiments to map the protein target-search onto a network problem. The nodes represent a DNA segment and the weight of the links is proportional to measured contact probabilities. We then put forward a master equation for the density of searching protein that allows us to calculate the association rates across the genome analytically. For segments where the rates are high, we find that they are enriched with active genes and have high RNA expression levels. This paper suggests that the DNA's 3D conformation is important for protein search times in vivo and offers a method to interpret protein-binding profiles in eukaryotes that cannot be explained by the DNA sequence itself.
Several processes in the cell nucleus start when proteins bind to specific DNA sequences.
For example, transcription factors that regulate genes and the CRISPR/CAS9 complex that edits DNA [1,2]. Because target sequences are much shorter than the genome -a few base pairs compared to billions in humans -these proteins face a needle-in-a-haystack problem.
Despite the large number of potential targets, measured search times are shorter than theoretical estimates. The Lac repressor in E. coli needs 1-5 min to find its designated site [3] which is twice as fast as a threedimensional (3D) search by diffusion inside the bacterium's volume (≈2-11 min) [33]. Also, diffusion-limited association rates -Smoluchowski's rate -may underestimate in vitro measurements by one to two orders of magnitude [4]. These examples suggest that some proteins search by other mechanisms than simple diffusion.
One mechanism that speeds up diffusive search is offered by the Facilitated-diffusion model [5]. In this model, the proteins alternate between 3D diffusion and 1D diffusion along the DNA. This decreases the search time because the proteins may take shortcuts to a linearly distant DNA segment through the surrounding bulk. Although criticized [6,7], the model is widely accepted after experiments in bacteria [3,8] and in vitro [9].
Another important aspect in target finding is rebinding. This is because proteins likely bind to a DNA segment that is close by in 3D rather than far away. Several modelling studies examined this aspect and found that search times change with DNA conformation [7,[9][10][11][12][13][14]. However, because these studies treat the DNA as a simple polymer the results cannot be generalized beyond bacteria to eukaryotes that have longer DNA and more complex 3D organization.
The most widely used experimental method to study 3D genome organization is Hi-C [15,16]. The Hi-C method cross-links close by DNA fragments inside the nucleus and gives a genome-wide map of the number of contacts between fragments pairs (Fig. 1a) [17]. Mamalian Hi-C maps have several interesting features -some that are evolutionary conserved [18]. For example, the blocklike structure along the diagonal represents densely connected 3D domains. The locations of these domains correlate with protein binding sites, active genes, and chromatin states [19][20][21]. And, beyond the domains, the average contact probability decays as a power-law [34].
Hi-C is the state-of-the-art Chromosome Conformation Capture method that estimates the chromatin contact probabilities across the genome. But it does not provide chromatin's 3D structure. Going from the contact map to a computer-generated 3D structure is difficult [22,23].
Because the chromatin's spatial organization is so complex, there are few attempts to model protein search in eukaryotes. One exception [24], represents chromatin as a crumpled polymer globule that reproduces the average looping probabilities in the human genome. However, the crumpled globule lacks the Hi-C maps' domain structure.
Here we model target search in eukaryotes without relying on chromatin's explicit 3D structure. Instead, we represent the DNA as a network in which the nodes are DNA segments and the link weights are proportional to the contact probabilities measured in Hi-C. Then we put forward a master equation for the density of proteins on the network over time that allows us to calculate the association rate -the inverse mean-first passage time -to all nodes analytically. Correlating these rates with RNA expression data in humans, we find that easy-to-find genomic regions are enriched with active genes and have high RNA expression.

THE MODEL
We model the proteins' search on chromatin as noninteracting particles that move between nodes in a weighted network that represent physically connected chromatin segments (Fig. 1).
The model has three important parameters. First, the jump rate ω ij between nodes i and j (i, j = 1, . . . , N ). We assume that ω ij is proportional to the number of contacts between segments i and j, v ij , measured in Hi-C: ω ij = f coll v ij ; f coll -a free parameter in the modelis the collision frequency that leads to a successful jump. Second, the unbinding rate k off to the surrounding bulk. And third, the rebinding ratek on to a randomly chosen node. We assume that the protein bulk concentration n bulk is constant and therefore use k on =k on n bulk ; k on and k off therefore have the same unit: time −1 .
In terms of these parameters, we formulate a master equation for the protein number in node i at time t, n i (t): The first term represents diffusion on the network -we put ω jj = − i =j ω ij -and the two remaining terms describe the exchange with the surrounding bulk. As in [14], we assume that the protein density is initially uniform ρ 0 = k on /k off except for the target i = a which always is zero n a (t) = 0, that is n i (0) = ρ 0 (1−δ ia ).
In terms of the eigenvalues λ j and eigenvectors V ij of ω ij , the solution to Eq. (1) is where k i on = k on N j=1 V −1 ij . This equation is key to calculate the association rate to the target node i = a.

PROTEIN ASSOCIATION RATES
To calculate the association rate K a , we use that it is one over the mean first arrival time: K a = τ −1 a . To obtain τ a , we proceed as in [14]. First we calculate the total number of particles that arrived to the target up to is the particle flux. Second, if N p is the initial protein number on the network, and if k on = k off = 0, then J a (t)/N p is the probability that a single protein have reached the target up to time t. The probability that the target has not been reached -the target's survival probability -is therefore S a (t) = (1 − J a (t)/N p ) Np and τ a = Ka vary by several orders of magnitude along the chromosomes but the 95% confidence interval is a few percent of the mean Ka = 1 (Ka = 1.0 ± 0.0027). We calculated Ka from Eq. (6) (k off 1) using the parameters k off = 0.002, kon = 0.001 and ρ0 = 0.5.
the number of proteins N p → ∞ and therefore S a (t) exp (−J a (t)) [25]. Note that if k on = 0 and k off > 0 there is a chance that all proteins end up in the bulk before ever reaching the target causing τ a to diverge. Finally, using n i (t) from Eq.
(2), we may calculate K a exactly as:

Limiting cases for Ka
Depending on the unbinding rate k off , K a has three regimes.
(i) Small k off . In this regime most particles find the target before they unbind. This leaves the initial density ρ 0 approximately unchanged, n i (t) ρ 0 . Using this approximation in Eq. (5) leads to J a (t) J a t, wherē (ii) Large k off . Here, the particles unbind and rebind many times before finding the target. The protein density is therefore approximately in steady-statē [26]. Using n i (t) ρ a and proceeding as in (i) leads to K a k on +ρ a W a .
These regimes simplify to k off 1 and k off 1 if we choose f coll so that the genome-wide averaged K a , K a , is unity [26]; We define the average as X N is the number of nodes for all chromosomes After rescaling, we have in which V a = N i =a v ai is the node strength and Equation (6) covers a broad range of k off [26], it is easy to implement, and computationally cheaper than Eqs. (4) and (5).
(iii) Intermediate k off . When k off ∼ 1, we cannot use Eq. (6). Instead we must use the exact expressions (4) and (5). In [26] we also treat the case k on = k off = 0.

Protein association rates depend on chromatin's 3D organization
Equation (6) suggests that the association rates change with chromatin's 3D structure because K a depends on the node strength V a . To quantify by how much, we used Hi-C data (40 kb resolution) and calculated K a for chromosomes 1-21 (Fig. 2). We found that K a varies by several orders of magnitude relative to the genome-wide average K a = 1. Most K a values, however, only deviate by a few percent from the mean: K a = 1 ± 0.0027 (95% confidence interval). Figure 2 shows the case when k off 1; k off 1 has qualitatively the same behavior [26]. Equation (6) also suggests that chromatin's 3D structure becomes less important as the unbinding rate k on grows -for example by increasing the bulk concentration of particles as k on ∝ n bulk . We see this in the data for small V a where K a k on [26]. We interpret this as if the particles reach the target mostly from the bulk. In the other limit, where k on is small, we see that K a ∝ V a . This means that most particles find the target via jumps on the network and that the 3D structure is important.
Chromatin regions with high association rates are enriched with active genes Figure 2 shows that the association rate varies across the genome. This is important for regulatory proteins, such as transcription factors, that look for promoters to control transcription. We therefore ask: are promoter regions easier to find than non-promoter regions?
FIG. 3: Nodes with many gene starts have higher predicted association rates -and thus easier to find -than nodes with few gene starts (in humans). We define the gene starts as the Transcription Start Sites (TSSs). The curves represent predicted association rates to nodes with active TSSs (gray), inactive TSSs (green), and any TSS type (pink). The active TSSs have higher association rates than the genome wide average ( Ka = 1, dashed), whereas nodes with inactive TSSs (green) are below (apart from one data point). The symbols represent the average association rate [Eq. (6)] and the colored areas show the 95% confidence interval. Parameters (dimensionless, see [26]): k off = 0.002, kon = 0.001 and ρ0 = 0.5. We omitted data points with less than 7 TSSs per node because the sample size is too small (≤ 50 nodes).
To answer this, we downloaded gene annotation data for human cells [27] to extract the gene starts -defined as the Transcription Start Sites (TSSs) -and correlated them with the predicted association rates from Fig. 2. We found that the rates grow with the number of gene starts per node (Fig. 3, pink). The data points represent the average association rate to all nodes with the same number of gene starts and the shaded area shows the 95% confidence interval. In other words: gene-dense regions are easy to find.
Then we asked: because these regions harbour both active and inactive genes, are active gene-dense regions easier to find than inactive ones? To see this, we grouped the gene starts into 'active' and 'inactive' based on the RNA expression level surrounding each TSS -1kb upstream and downstream -and calculated the association rates to the nodes with these TSSs. We found that nodes with many active TSSs have even higher association rates than if we do not separate active and inactive TSSs: gray is above pink in Fig. 3.
For nodes with inactive gene starts we find the inverse relationship: green is below pink in Fig. 3. This is underscored when comparing the green area to the genomewide average K a = 1 represented by dashed line: most inactive TSSs are in regions with association rates that are smaller than unity. This suggests that inactive gene starts are hard to find. Figure 3 also shows that the association rate grows slowly beyond one or two TSSs per node: adding a few extra TSSs does not make the node easier to find. However, there is still a positive correlation between the number of gene starts per node and high association rates.
Chromatin regions with high RNA expression levels have high association rates Figure 3 suggests that transcription factors quickly find highly transcribed genes because the association rate is larger for active than inactive TSSs. But what about any transcribed region? Are the association rates high also for them?
To study this, we summed the RNA expression in all nodes across the genome and ranked them based on their RNA expression level. Then we partitioned the nodes into 20 equally-sized groups and calculated the association rate in each group. Shown as a violin plot (Fig. 4a), we find that our predicted rates vary widely but that the median (white circles) increases with high RNA expression levels (Spearman's correlation coefficient = 0.5449 [28]). This suggests that nodes with high RNA expression levels -with or without active gene starts -are relatively easy to find.
To see by how much this correlation is caused by active gene starts, we made two new groups: nodes with at least one active TSS and the rest -nodes with inactive or no TSSs. Then, as before, we ranked the nodes in these large groups based on the RNA expression levels, divided them into 20 equally-sized subgroups, and calculated the average association rate for each subgroup. Plotting the predicted average association rate for the two large groups versus the average RNA expression level as well as the average for all nodes (Fig. 4a), we cannot see any significant difference: all curves nearly lie on top of each other (Fig. 4b). This is a more general result than before. It is not only the highly transcribed gene starts that are relatively easy to find, it is any region with high RNA expression.
In addition, as a simple measure of DNA accessibility, we checked how the association rate change with the fraction of base pairs that are transcribed per node [26]. Just like for the RNA expression level, we find a positive correlation with high association rates (Spearman's correlation coefficient = 0.5632).

DISCUSSION AND SUMMARY
Protein-binding experiments show that association rates change if the binding sites are embedded in a short or a long piece of DNA [4]. This is partly explained by Predicted association rate as function of RNA expression level for nodes with at least one active TSS (grey) and no active TSSs (purple). The RNA expression is divided into 20 bins with equally many points in each bin (same as in (a)). The nodes with active TSSs tend to be above the genome wide average (18 points is above Ka = 1), while most nodes with no active TSSs are below (6 points above Ka = 1). The shaded areas show the 95% confidence interval. Parameters used in both plots: k off = 0.002, kon = 0.001 and ρ0 = 0.5.
the facilitated-diffusion model in which proteins switch between 1D search along the DNA, and 3D search -by diffusion -in the surrounding volume. The association rates also change if the DNA is straight or coiled [9]. This can be understood if the facilitated diffusion model includes inter-segmental transfer between loop anchors [13]. However, current studies use standard polymer models that do not capture the chromatin's complex 3D orga-nization in eukaryotes. To remedy this, we used Hi-C data as proxy for the 3D proximity between chromatin segments in vivo. This allowed us to to map the protein search problem onto a network problem with nodes and links representing chromatin segments and how they are physically connected to each other. Then we formulated a master equation for the number of searching proteins per node, from which we calculated analytically the genome-wide association rates in terms of the eigenvalues and eigenvectors of the Hi-C matrix. Using human Hi-C data, we compared the predicted association rates with RNA expression data and positions of gene starts. We found that regions which are easy to find -measured by high association rates -are enriched with active genes and have a generally high level of RNA expression.
We assume that the protein finds the target, for example a promoter site, as soon as it arrives at the target node -here, 40 kb. This means that we model protein binding as diffusion-limited. However, some transcription factors, such as TetR in mammals [29], are suggested to be reaction-limited. To include imperfect protein-DNA binding our model we may follow [30]. Denoting the protein-DNA binding rate as k DNA , and reinterpreting the on rate k on as an effective on rate k eff. on , we may write 1/k eff. on = 1/k on + 1/k DNA where K a = k eff. on + γ a V a . We calculated the association rates in each chromosome without considering inter-chromosome connections. This is an assumption as chromosomes do come in physical contact. From Hi-C data, however, it seems like these contacts are less frequent than within chromosomes. This is a limitation of the data rather than our model that can handle any genome-wide Hi-C map.
Overall, this study provides a framework to predict protein-binding positions dictated by chromatin contact maps in the cell nucleus. As such, it opens new ways to interpret binding profiles of transcription factors that cannot be explained by the DNA sequence [1,31]. Mechanistic understanding of these profiles is important to reach a molecular understanding of gene regulation.

Particle flux through the target
The number of proteins that reached the target up to time t is J a (t). For non-zero k on and k off it reads (9) where λ j and V ij are the eigenvalues and of eigenvectors ω ij . Because λ 1 = 0 is the largest eigenvalue, we can approximate Eq. (9) at times t k −1 off with terms proportional to t where the steady-state distribution is The relation J a (t) (k on + T −1 a )t coincides with the continuum approach in [14] for proteins that combines bulk excursions with 1D sliding (jumping to nearest neighbours in our model) and Lévy relocation's with jump lengths x distributed like |x| −1−α (0 < α < 2). Since ω ij |i − j| −1−α with 0 < α < 1 -on averagewe see that our model is a network analogue of t [14].

Particle flux through the target without bulk exchange
Here we investigate the case when proteins do not unbind from the DNA. As k on , k off → 0, Eq. (9) becomes with N p = ρ 0 (N − 1). For large times we know that J a (t → ∞) = N p since by then all proteins have arrived to the target. This leads to the simplification in the 2nd row. For large times t |λ N | −1 -λ N is the largest eigenvalue (in magnitude) -where J a (t) N p , we find the same behaviour as before, J a (t) ∝ t. This is seen by expanding Eq. (12) around t = 0. When the unbinding rate k off is small compared to the association rate K a , the number of proteins per node is close to its initial value ρ 0 by the time of the first arrival to the target, and we have the approximation where W a = i =a ω ai . We may find this approximation by expanding Eq. (9) around t = 0 and using the inverse Furthermore, by demanding that the genome wide average of K a in Eq. (13) is unity, K a = 1, and using that ω ai = f coll v ai we find f coll to be Using this in Eq. (13) leads to With this definition of f coll , the binding rate k on is bound by [0, 1], where k on = 1 corresponds to target-finding directly from the bulk.
Target finding in steady-state (k off 1) When the unbinding rate k off is large compared to the association rate K a , few proteins will find the target before leaving on a bulk excursion. In this limit, the system reaches its steady-state -withρ a the number of proteins per node -before the first arrival to the target. This leads to the approximation To arrive at this equation we identify in Eq. (10) that J(t) K a t. Then we replace then j by the approximate densityρ a that we findρ a by the following argument. In steady state, proteins bind to the DNA with rate k on . Except for the absorbing target, there are N − 1 nodes available to bind. Similarly, there areρ a (N − 1) number of proteins that unbind from the DNA with rate k off . Last, proteins are absorbed at the target with rate T −1 a = ρ a W a . These three terms sum to zero, and thereforē ρ a = k on k off + W a /(N − 1) .
Using that W a = f coll V a with f coll from Eq. (14) gives

Validation of approximations
To better understand the validity of Eqs. (15) and (18), we compare them to the exact association rate (19) Figure 5a shows how the association rate changes for a specific target node -we choose a = N/2 in human chromosome 21 -as we change k off while keeping the on-rate fixed, k on = 0.001, and adjusting the density ρ 0 = k on /k off . The solid grey line shows K exact a and the horizontal lines represent the approximations for small and large k off -Eqs. (15) and (18).
The blue area in Fig. 5a shows the large-K a regime (K a > 400k off ). Here, Eq. (15) deviates only a few percent from K exact a : the deviation is 2.7%(≈ 1−K a /K exact a ) at the encircled green dot. To get this number, we used k off = 0.002, k on = 0.001 and ρ 0 = 0.5 -the same values that we used to create all plots in the main text.
Th pink area represents the opposite limit: small K a (K a < k off /10). In this region the approximation Eq. (18) is a good match to K exact a . At the red dot (k off = 2) the relative error is 5.7%.
In the intermediate region (white area), we cannot use the simple expressions because the flux J(t) has a complicated time-dependence. To get the association rate in this regime, we have to evaluate Eq. (19) directly.
In Figs. 5b and 5c, we calculate the association rate for all nodes in chromosome 21 using Eqs. (19), (15) and (18) with fixed parameters (shown in the figures). The figure shows the limiting k off cases. In 5b the unbinding rate is small (k off = 0.002), and we see that the approximation (15) match well with K exact a whereas Eq. (18) does not. Equation (18) that matches better in 5c where the unbinding rate is large (k off = 2).
Genome-wide association rates when k off 1 Similar to Fig. 2 in the main text, we show the association rate for all targets in every chromosome at large k off . We calculate K a from Eq. (18), see Fig. 6. These curves for k off 1 and k off 1 (main text) are almost identical, except by an offset on the y-axis.
Genome-wide association rate as a function of node strength In Fig. 7a we show K a -calculated from Eq. 1 the target finding is fast, while in the red area where k off 1, the system is close to its steady state and target finding is slower. Target at N/2, in the middle of the system. b Association rates to all targets when k off 1. c Association rates to all targets when k off 1. Note the green and red circled dots in b and c, they correspond to the same parameter values as in a, respectively.
FIG. 6: Genome-wide association rates Ka at 40 kb resolution evaluated using Equation (18). The association rates vary by several orders of magnitude along the chromosomes but the 95% confidence interval of Ka varies only by a few percent (0.7531 ± 0.0018). We used these parameters: k off = 2, kon = 0.001 and ρ0 = 0.0005. plot the analytical prediction Eq. (15). We find that the search times are dominated by k on for weakly connected nodes. For strongly connected ndoes, we find the universal behavior K a ∝ V a .
In Fig. 7b we show K a for all nodes in the other limit k off 1. Here K a depends on the number of nodes N -via ρ a in Eq. (18) -and therefore we do not expect a universal large-V a behaviour

Fraction of transcribed DNA
We use RNA expression data (downloaded from EN-CODE) to calculate the fraction of transcribed DNA. This is calculated in the following way. For every 40 kb region i across the genome, we count the number of base pairs n i that has an RNA expression level above zero. The fraction of transcribed DNA in region i is thus n i /40000. In Fig. 8 we plot the association rate as a function of the fraction of transcribed DNA, where the all nodes are divided into 20 equally sized groups. The correlation between K a and fraction of transcribed DNA is slightly stronger than for RNA expression level (Spearman correlation coefficient = 0.5632). We point out that the fraction of transcribed DNA and the RNA expression correlate strongly: Spearman's correlation coefficient is 0.9693.

Contact probability
The average contact probability decays with distance from the diagonal. To see this, we calculated p ij = N −1 j=1 ω j,j+1 /(N − j) , where ... denotes genome-wide average. We find that p ij ∼ |i − j| −α where there are three regimes with different α (Fig. 9). Since the chromosomes has different sizes, the regimes appear at different length scales, but the cross-overs are roughly at 500 kb and 5,000 kb. We used Hi-C matrices ω ij = f coll V ai , where f coll is given in equation (14) with k on = 0.001 and ρ 0 = 0.5. FIG. 9: The average contact probability (genome wide) decays as d −α with distance d from the diagonal. At d = 500 kb and d = 5000 kb, α changes value. The black line is is the averaged data from ωij, and green, pink and purple are fits.

Transcription start sites (TSSs)
To distinguish between active and inactive TSSs we use RNA expression data (downloaded from ENCODE) and calculated the average number of RNA reads per base pair,n RNA , 1kb upstream and 1kb downstream of each TSS. We defined an active TSS as whenn RNA ≥ 1. Given this threshold we found 32712 active and 20795 inactive TSSs.
For each chromosome at 40 kb resolution, we show howK a changes with the number of TSSs per node (Fig.  10). The plots show three TSS groups: active, inactive , and all. The genome-wide average of all these plots is Fig. 3 in the main text.