Statistics of shared components in complex component systems

Many complex systems are modular. Such systems can be represented as"component systems", i.e., sets of elementary components, such as LEGO bricks in LEGO sets. The bricks found in a LEGO set reflect a target architecture, which can be built following a set-specific list of instructions. In other component systems, instead, the underlying functional design and constraints are not obvious a priori, and their detection is often a challenge of both scientific and practical importance, requiring a clear understanding of component statistics. Importantly, some quantitative invariants appear to be common to many component systems, most notably a common broad distribution of component abundances, which often resembles the well-known Zipf's law. Such"laws"affect in a general and non-trivial way the component statistics, potentially hindering the identification of system-specific functional constraints or generative processes. Here, we specifically focus on the statistics of shared components, i.e., the distribution of the number of components shared by different system-realizations, such as the common bricks found in different LEGO sets. To account for the effects of component heterogeneity, we consider a simple null model, which builds system-realizations by random draws from a universe of possible components. Under general assumptions on abundance heterogeneity, we provide analytical estimates of component occurrence, which quantify exhaustively the statistics of shared components. Surprisingly, this simple null model can positively explain important features of empirical component-occurrence distributions obtained from data on bacterial genomes, LEGO sets, and book chapters. Specific architectural features and functional constraints can be detected from occurrence patterns as deviations from these null predictions, as we show for the illustrative case of the"core"genome in bacteria.


I. INTRODUCTION
A large number of complex systems in very different contexts -ranging from biology to linguistics, social sciences and technology -can be broken down to clearly defined basic building blocks or components. For example, books are composed of words, genomes of genes, and many technological systems are assemblies of simple modules. Once components are identified, a specific realization of a system (e.g., a specific book, a LEGO set, a genome) can be represented by its parts list, which is the subset of the possible elementary components (e.g. words, bricks, genes), with their abundances, present in the realization. We use the term "component systems" for empirical systems to which this general representation can be applied.
Occurrence patterns of components across realizations are expected to reveal relevant architectural constraints. For example, the bricks present in each LEGO set clearly reflect a target architecture that can be built with them following the instruction booklet. While for LEGO sets the assembly instructions are provided by the seller, in most component systems the architectural constraints are not obvious. Inferring such constraints from the statistics of components may answer important questions about the nature of a system. For example, it could reveal new clues about the complex combination of selective pressure and random events that shaped the functional composition of extant genomes. Even in those cases where the architecture is partially or even fully known and the instruction manual is available, the statistics of components may help us distill some general principles characterizing a given class of component systems, in some cases revealing basic features of the underlying generative processes.
In order to perform detection of system-dependent features from patterns of shared components, we need to have a clear idea of the general behavior of component systems even in absence of functional constraints on the presence/absence of specific classes of components. This is by itself a challenging task, as such systems show a large degree of non-trivial universal properties [1][2][3] that could in principle affect the occurrence statistics. Indeed, several notable quantitative laws can be identified in the composition of component systems of very different nature. This is well known, e.g., in linguistics, where the notorious "Zipf's law" [4] describing the word frequency distribution (or its equivalent rank plot) in a linguistic corpus has been the subject of extensive investigations [5][6][7][8][9]. In this context, the existence of quantitative "universal" laws may in principle provide insights on the cognitive mechanisms of text production, and can have practical applications in data mining and data search techniques [1]. Analogously, for genomes across the whole tree of life, the number of genes in different evolutionary families is power-law distributed, a discovery that represents one of the first examples of "laws" of the genome sequencing era [2,10]. Such heterogeneous usage of the different basic components, often resulting in an approximately power-law distribution of their frequencies, can be seen as a hallmark of the complexity of component systems [6].
A large body of theoretical work addresses the origins of this heterogeneity. Several models have emerged in different areas of science, with context-specific ingredients. For example, stochastic processes based on gene duplication, deletion and innovation have been proposed as simple evolutionary models of genome evolution at the basis of the observed heterogeneous component usage [3,[11][12][13]. On the other hand, specific communication optimization principles [14,15] and stochastic models for text generation [5,16,17] have been invoked to explain the emergence of Zipf's law in natural language. In many, but not all, of these models a preferential attachment principle is at the origin of the emergence of the powerlaw distribution of component frequencies. More importantly, the ubiquity of this emergent behavior raises the question of whether (and to what extent) empirical laws like Zipf's law are pervasive statistical patterns that transcend system-specific mechanisms [2,18]. In this spirit, the analysis of radically different systems can help the discovery of patterns that descend from pure statistical effects or general principles [18,19].
Here, we analyze empirical data from three very different component systems from linguistics (book chapters), genomics (protein domain families in sequenced genomes) and technology (LEGO toys) and we look for general statistical consequences of their heterogeneous frequency distributions. The different data sources considered here reasonably do not share any generative mechanisms, nor are they expected to share the same type of constraints, selection criteria or optimization principles. However, the frequency of their components is heterogeneous and they all obey laws that are similar to Zipf's.
The marginal statistics that we concentrate on is the fraction of components that are shared among a certain number of realizations. For example, the fraction of LEGO bricks with the same shape found in a given fraction of unequal LEGO boxes. In genomics, this is the so-called "gene-frequency distribution", which was shown to follow a U-shape at several taxonomic levels [20][21][22]. A U-shape of this distribution of shared components indicates that there is a set of "core" components that are common to most realizations, as well as an enriched set of realization-specific components. This histogram also decays approximately as a power law for rare components, both in genomic data and in technological systems [19]. In evolutionary genomics, the origins of this pattern are the focus of a lively debate. The pattern has been rationalized theoretically by neutral or selective population dynamics models [22][23][24][25], or as a consequence of functional dependencies among different components [19]. For component systems outside of genomics, the distribu-tion of shared components remains under-explored, and is typically neglected by the current debate, for example in linguistics [1].
Using theoretical calculations based on random sampling of components (with replacement) from their overall frequencies (estimated by their total abundance across empirical realizations), we show that a distribution of shared components with a power-law behavior is a general feature of component systems not only with Zipf-like component frequency distributions, but also for general power-laws and exponential decay of the overall component frequencies. In other words, a U-shaped distribution of shared components can naturally emerge in component systems with a heterogeneous component usage (which is often the case empirically). Importantly, we quantitatively identify the general features of the system leading to a U-shaped distribution of shared components, a given core size, and a specific decay of the realization-specific bulk of this distribution.

II. DATA
A. Data sources a. Genomes. We used the superfamily classification of protein domains from the SUPERFAMILY database [26] considering a set of R = 1061 prokaryotic genomes ("realizations") and a total number of different families N = 1531 ("components"). Protein domain families are the basic modular topologies of folded proteins [27]. Different domains of the same family can be found in each genome in the same or different proteins. As a functional annotation of protein domains in SUPERFAMILY, we considered the SCOP annotations mapped into 7 general function categories, as developed by C. Vogel [28].
b. LEGO sets. The composition in bricks of several LEGO sets (R = 2820) can be freely downloaded from "http://rebrickable.com". We excluded from the analysis LEGO sets belonging to the category of "LEGO Technic" since, by construction, they share a very small number of bricks with the classic LEGO toys. Similarly, we did not consider LEGO sets with less than 80 components or belonging to the categories "Educational and Dacta" and "Supplemental" in order to exclude sets that are actually collections of spare parts or additional bricks for other sets.
c. Texts. The analyzed linguistic corpus is composed by R = 1721 book chapters (realizations) of several English books randomly chosen from the most popular ones in the database "http://www.gutenberg.org". We defined chapters as realizations, instead of entire books, to obtain a corpus with a range of sizes (total number of components per realization) comparable to the one of genomes and LEGO toys ( Figure S1). The complete list of books considered is reported in Table S1 of the Supplemental Material (SM). The elementary components are defined as the words regardless of capitalization (e.g. "We" and "we" are considered as the same component). A set of empirical realizations of a component system can be naturally described as a matrix {n ij } defined such that the entry n ij represents the abundance of the component i (i = 1, . . . N ) in the realization j (j = 1, . . . , R). Thus, each realization (a literary text, a LEGO set or a prokaryotic genome), is represented as a matrix column ( Figure S1). Some key observables can be easily defined using this representation. First, the total abundance a i of the component i in the whole ensemble is defined by summing over all realizations a i = j n ij . The normalized abundance represents the component frequency f i = ai i ai . The "component occurrence" o i is instead defined as the fraction of realizations in which the component is found, Two other crucial quantities are: the total number N of different components in the system, which is essentially the number of bricks of different shape or the vocabulary, and the size of a realization j, defined as the total number of its components M j = i n ij .

A. Component frequency distribution and distribution of shared components show general features across systems
This section illustrates two empirical laws in the analyzed datasets (LEGO toys, bacterial genomes, and literary texts). We first consider the component frequencies in the whole universe of available realizations of a given system, which is essentially the generalized Zipf's law [6] for the three systems. Figure S2 shows the rank plots of these component frequencies. The three data sets share a power-law behavior for components with high frequencies (low rank), with an exponent close to 1 as in the classic Zipf's law [4], and a faster decay at higher ranks (components with low frequency). This double-scaling behavior has been recently observed in the context of linguistics [17]. In evolutionary genomics, the gene frequency was previously analysed over single genomes and shown to be approximately power-law distributed with an exponent dependent on genome size [3,10]. Figure S2 shows that the same distribution calculated over thousands of prokaryotic genomes has a double-scaling, with an exponential-like decay for low ranks in its rank plot. We tested that the shape of these component frequency distributions do not strongly depend on the specific size or number of realizations analyzed. The rank plots in Figure S2 do not vary when evaluated on different subsamples of the whole data sets (Fig. S2 of the SM). This suggests that the frequency distributions evaluated using the available finite empirical data sets estimate reliably the global heterogeneity of the component usage in the systems.  Figure S1). The three curves follow similar behavior, which can be described qualitatively as a powerlaw-like decay with exponent close to 1 for low ranks (high frequency), and a faster dataset-specific decay for higher ranks.
We aim to evaluate also the distribution of shared components, {o i }, and how much of its features can be explained from other measurable quantities, namely, the component frequencies, the realization sizes {M j } and the number of different components in the universe N . Figure S4 shows this distribution for the three data sets considered here. For small occurrences, the plots are compatible with a power-law decay, with a datasetspecific exponent. Only for genomes this curve is clearly U-shaped (see also Fig. S3), and shows a "core" of shared components, i.e., protein domains shared by almost all the genomes, together with a rich group of rare components. Book chapters do not show this marked behavior, due to the fact that the ubiquitous words (e.g. articles, pronouns, prepositions) are much less than the chapterspecific words. Finally, LEGO sets display no core of shared components, and this is probably due to the wide range of themes using poorly overlapping brick types. In order to identify the statistical consequences of a heterogeneous usage of components on the statistics of shared components, a suitable model is needed. In particular, we would like to generate system realizations starting from a fixed component frequency distribution without any additional functional information or constraint. To this end, we employ a random-sampling procedure [8,17,29,30] that builds artificial realizations through an iterative random extraction (with replacement) of components from their frequencies {f i } in the whole system. Each realization size M is specified by the number of random extractions.
More precisely, the following prescriptions ( Fig. S4e) define the random-sampling model that will be used in the following. (i) The component abundance rank distribution is assumed to be a universal property of the component system and well represented by the empirical overall abundances (see Fig. S2 and the SM for a discussion of this assumption). (ii) The extraction probability of a component is proportional to its overall abundance. (iii) A realization of size M is generated by M independent extractions from the pool of components. Statements (ii) and (iii) define a multinomial process. Given a normalized list of component frequencies {f i }, i = 1, . . . N (where N is the size of the available "vocabulary"), and the size M of the realization, the probability of a specific configuration {n 1 , n 2 , . . . , n N }, where n i is the number of the components with frequency f i is P (n 1 , n 2 , . . . , n N ; under the constraint that  . The log-log scale highlights the power-law like decay. The black dashed lines represent the prediction of the random-sampling model assuming the empirical component frequencies and realization sizes. The model reproduces very well the power-law decay, but may differ quantitatively from the empirical laws in the high-occurrence region. Panel (d) plots the same quantities in log-lin scale, to highlight the quantitative differences between systems and the presence/absence of a peak of core components. Note that the different range of the y-axis values with respect to previous panels is due to the different binning procedures, logarithmic vs linear. (e) Scheme of the random-sampling process: samples of size M are generated from independent draws from the "universe" of all possible components with their specific abundances. Therefore, the probability of a component extraction is proportional to its global abundance, i.e., the sum of its abundances over all realizations of the systems.
global abundance distribution is conserved in each realization. In other words, the component composition in each realization is a sampled copy of the universe, without any of the possible complex correlations which may follow from architectural and functional properties of an empirical system.
For example, in the context of bacterial genome evolution, the random-sampling model translates into a scenario in which there is continuous and completely random horizontal gene transfer (exchange of genetic material) between species [31]. Thus, genome composition would simply reflect the pan-genome abundances of pro-tein domains. While horizontal gene transfer is indeed a major force in bacterial evolution [21,32,33], several additional genome-specific functional constraints are clearly in place in evolution [32,[34][35][36][37] and these are neglected by the model. Therefore, the random sampling can be considered as a null model useful to disentangle the consequences of the observed global heterogeneity in the component usage from actual hallmarks of more complex functional constraints.
C. The distribution of shared components is mainly a consequence of component frequencies, number of available components and realization sizes The fact that the distribution of shared components is qualitatively very similar in systems that are so different triggers the question of whether it may be an emergent statistical consequence of other system properties. In particular, we asked to what extent the statistics of shared components could be a direct consequence of component frequencies. As explained above, this question can be addressed quantitatively using a randomsampling model that generates an artificial copy of the empirical system by drawing realizations (whose sizes are fixed by the empirical ones) from the component frequency distribution. Figure S4 compares the empirical occurrence distributions with simulations of a random sampling. The null-model curves (dashed lines) provide very good approximations of the empirical laws, particularly for low component occurrences. Additionally, the model matches well the power law decay with the system-specific exponent. Finally, the model predicts also the qualitative behavior of core components, and specifically that only genomes show a clear U-shaped distribution of shared components. The relative core sizes of the three systems are also well approximated, although there are some quantitative deviations from the empirical values that will be addressed in detail in Section III F. These results suggest that the shape of the distribution of shared components in the three widely different empirical systems considered here is well described by a randomsampling model that only conserves the empirical component frequencies, the vocabulary (i.e., the set of possible components) and the realization sizes. The next section provides an analytical understanding of this observation. Thus far, we have used the model only to address the specific statistics of component sharing of the empirical systems under consideration. To this end, we have simulated the random-sampling model fixing the component frequencies and realization sizes as in the empirical cases. More in general, one can ask whether a power-law decay-ing and/or U-shaped distribution of component occurrences are expected for a given distribution of component frequencies. To address this question, we have computed analytically the distribution of shared components under general prescriptions for the component frequency distributions within the random-sampling model.
For the sampling procedure explained in Section III B, the probability q i that a component of rank i is present in where f i is the component probability of extraction. Therefore, the expectation value for the occurrence of component i over a set of R realizations is In order to obtain the probability distribution associated to this rank representation, one can use the fact that the rank of a component with occurrence o is the number of components with occurrence higher than o. In fact, these naturally correspond to components with higher frequency and thus lower rank. Therefore, we can write the rank i(o) as where o 1 is the highest possible occurrence, which corresponds to the component of rank 1. The function i(o) is simply the inverse function of Eq. 2. From the approximate integral representation of i(o), the occurrence probability distribution p(o) is defined by the simple re- Eq. 3 provides a general relation between the representation of the frequency distribution as a rank plot and the representation as a probability distribution. Indeed, the arguments presented here to introduce Eqs. 2 and 3 have been previously used to establish the connection between Zipf's law as a rank plot and Zipf's law as a frequency distribution [38].

Observed versus possible vocabulary of components and Heaps' law
When a set of R realizations of size M is generated through a random-sampling procedure from a pool ofÑ possible different components with their probabilities of extraction {f i }, the expected size N of the vocabulary that is actually sampled can be expressed as [29] Thus, in general, N ≤Ñ .
If the system size, defined by the total number of extractions M R, is large enough, essentially all possible components are expected to be sampled at least once, thus leading to the simplification N Ñ that we implicitely assumed in Eq. 1. However, in general, the observed vocabulary in an ensemble of realizations is an increasing function of the system size, i.e., N (M R). This functional dependence is the analogous of Heaps' law, which is the empirical power-law growth of the number distinct components with the system size observed in linguistics [1,17], and in genomics [3]. This distinction between the observed and the possible vocabulary of components is discussed in more detail in the SM and will be relevant in the following sections.

Analytical distribution of shared components for component frequencies with a power-law or an exponential distribution
Explicit expressions for the occurrence distribution can be derived assuming a simple scenario, in which all realizations have the same size M , and the component frequency statistics follows a prescribed function. We first consider the empirically relevant case of a power-law frequency rank plot (Fig. 4, left panel) defined by Under these assumptions and using Eqs. 2 and 3, the exact expression of the occurrence distribution can be calculated: The distribution is defined in the interval of occurrences where o i is computed by Eq. 2 and N is the effective or observed component vocabulary, which can be a function of the system size, i.e., N (M R), as described by Eq. 4. Considering the limit of small occurrences and large sizes, i.e., o 1 and M 1, one finds precisely the empirically observed power-law decay. Specifically, in this limit the occurrence distribution takes the form where the power-law exponent depends only on the exponent γ of the frequency rank-plot . Analogous calculations (details in the SM) can be performed assuming a frequency distribution described by an exponential rank plot f i ∼ e −λi (right panel of Figure 4). In this case, the distribution of shared components, for large enough realizations M 1, has the expression Interestingly, for rare families the above expression further simplifies to a power-law decay with a "universal" exponent −1. This indicates that also systems with a heterogeneous but more compact frequency distribution are expected to show a power-law decay in the occurrence distribution. Figure 4 shows the agreement between these predictions and simulations of the random-sampling model for the two illustrative examples of a power law and of an exponential distribution of component frequencies. These analytical predictions have a dependence on the sampled vocabulary N and are expected to hold even if this is actually smaller than the total number of possible componentsÑ (Fig. S5). The effects of a dependence of the observed dictionary on system size (i.e., Heaps' law N (M R)) become relevant and has to be taken into account when comparing statistical features of ensembles of realizations with different sizes M R.

Shape of the distribution of shared components and rescaling properties
We now turn our attention to the conditions for a Ushaped distribution of shared components in the randomsampling model. Figure 4ac already show that the decay of the occurrence of rare components is only set by the exponent γ as described by Eq. 7, but for different values of M and N the distribution may or may not display a significant fraction of core components. Additionally, Figure 4bd proves that equations (6) and (8) can capture quantitatively the occurrence distributions and thus can well describe the relative proportion of core and specific components. In order to understand under what conditions this distribution becomes clearly U-shaped for an underlying power-law frequency distribution, it is useful to note a rescaling property of Eq. 6. Taking the limit of large realizations M 1, Eq. 6 becomes which depends only on two parameters, γ and the rescaling parameter This rescaling property shows that the statistics of component sharing is actually a function of a specific combination of realization sizes (e.g., text lengths) and of the range of possible components (e.g., the observed vocabulary). Specifically, the functional form of the distribution is purely defined by the exponent γ, while the rescaling parameter k sets the normalization factor and the range of possible occurrences. In fact, the analytical expression of the occurence corresponding to the distribution minimum, i.e., o min = 1−e −1− 1 γ , is only a function of γ, while the minimum possible occurrence value o N 1 − e −k γ scales with k. Therefore, a U-shaped occurrence distribution should be generally expected for component systems with highly heterogeneous component frequencies since the power-law decay and the presence of a mini-mum before the core are robust features with respect to system parameters. This is confirmed by the analysis of component systems with different values of k and γ (illustrative examples in Figure S7): the system specificities set the power-law decay of the left part of the distribution, its support, and the relative proportion of core and rare components, but the U-shape is conserved. However, this shape can be more or less symmetric and more or less clearly evident depending on the actual size of the core fraction. The following section discusses in detail the non-trivial dependences of the core size on system parameters.
For the case of component frequency distributions with an exponential rank plot, the statistics of shared components (Eq. 8) is a function of a single effective parameter λN , and does not depend on the realization sizes M . In other words, the shape of the distribution, and whether it is clearly U-shaped, only depend on the decay of component frequencies and on the total number of components. In fact, occurrence distributions corresponding to different exponential frequency rank plots collapse if λN is constant, even if the realizations have widely different size. This is shown in Figure S4 of the SM.

The core size
We can estimate the "core size" by computing the fraction of components with occurrence greater than a given arbitrary occurrence threshold θ c as a function of the only two effective parameters γ and k. Integrating Eq. 6 between θ c and the maximum occurrence o 1 , and then taking the limit M 1, this quantity reads where o N is the left boundary of the occurrence distribution, corresponding to the component with lowest frequency.
Starting from this estimate of the core size, Figure 5ab show how the scaling property is verified in simulations. Fig. 5c compares the analytical predictions for the core size with simulations for different values of γ, showing perfect agreement. Equally, one can obtain analytical estimates for the fraction of rare components (occurrence below a fixed threshold), which are tested in Fig. 5d. Thus, with increasing k, core families increase linearly with a γ-dependent slope until all components are shared, and concurrently rare components decrease linearly until they hit zero (when the lower cutoff of occurrence exceeds the chosen threshold value). Component number and realization size only enter through the combination defined by the rescaling parameter k. This phenomenology fully characterizes the distribution of shared components with varying parameters.
The general relation (Eq. 12) between the core size and the rescaling parameter k translates into different dependences of the core size on the typical realization size M , depending on the relation between the system size M R and the total number of accessible components N . While this issue is discussed in more detail in the SM, it is easy to intuitively understand the different regimes. For large enough systems, all possible componentsÑ are expected to be sampled at least once, thus making the observed vocabulary N Ñ a constant parameter. This is the regime considered in Fig. 5a. In this regime, Eq. 12 simplifies to the simple scaling c ∼ M 1 γ N . On the other hand, in several empirical systems the observed vocabulary is a function of the system size, and typically with the power-law dependence N (M R) ∼ (M R) β (with β < 1) called Heaps' law. Thus, in general, the core fraction is expected to show the more complex dependences c ∼ M 1 γ −β R −β . However, a random-sampling procedure starting from a Zipf's law described by Eq. 5 leads to the approximate relation β 1/γ between the exponents of Zipf's and Heaps' law [7,8,29]. Therefore, in this regime the core fraction becomes only a function of the number of realizations as c ∼ R 1 γ . These different scaling relations in different regimes are tested in Figure  S6. Note that the absolute number of core components cN , as estimated from Eqs. 11 and 12, is instead always independent from the number of realizations, even in the regime where Heaps' law is expected to hold ( Figure S6).
For component frequency distributions with an exponential rank plot, the sampling procedure leads to an occurrence distribution that is independent from the realization size M (Eq. 8). However, the exact analytical prediction for the core size (the analogous of Eq. 12) still has a dependence on M . But this is due to the residual dependence of the maximum occurrence values (o 1 ) on M and does not affect the shape of the distribution. This last technical point is discussed in more detail in the SM.
E. Empirical distributions of shared components satisfy the relations predicted by the random sampling.
One can ask whether the general analytical predictions discussed in the previous section can be applied to empirical data. In particular, we first asked how the power-law decay exponent of the distribution of shared components relates to the component frequency rank plot in empirical systems, and if this relation follows our analytical prediction. An analytical mapping would give a more synthetic and powerful description than the direct simulations discussed in Fig. S4. Importantly, the analytical formulas for the distribution of shared components are derived under the hypothesis of a pure power-law or exponential component frequency rank plot. However, the three empirical datasets (as previously discussed), show a double-scaling frequency distribution. To override this issue, we restricted the frequency rank plot range in which the predictions are applicable. The procedure to perform this comparison is described in Fig. 6.
First, we chose an arbitrary threshold θ r defining the rare components and we mapped it to the frequency rank plot (assuming the model), by using the inverse function of Eq. 2. The frequency rank associated to the occurrence threshold θ r , i(θ r ) in the figure, is the rank above which the model prediction for the decay of the distribution of shared components should apply as long as i(θ r ) does not cross the position of the change in scaling. In other words, since in the model there is a monotonic relation between occurrence and frequency (Eq. 2), all components with rank greater than i(θ r ) (and frequency smaller that f i(θr) ) are assumed to be the components with occurrence lower than θ r . We then estimated the behavior of the frequency rank plot in the high-rank region (after i(θ r )) as the best fit with a power-law function or an exponential. This leads to a prediction for the decay exponent of the distribution of shared components (using Eq. 7 or Eq. 9 for the exponential case) in the range [o N , θ r ]. Fig. 6 shows that the predicted decay exponents correspond well with the data.
The random-sampling model also gives qualitative analytical predictions for the expected fraction of core components, and thus for the expected shape of the distribution of shared components for a given empirical system. While the analytical relations between exponents applied in Figure 6 do not depend on the realization sizes, the analytical formulas for the fraction of core components (see e.g. Eq. 12) were derived assuming realizations of fixed size M . The actual size distributions for the three empirical systems are quite broad ( Figure S1), but we can still use the analytical framework to get an estimate of the core fraction considering the average realization size of each empirical system. Following the same line of reasoning as for the low-occurrence tail of the distribution of shared components, we can use a restricted region of the frequency rank plot. In this case, the low-rank region (with exponent around 1 for all the datasets, see Fig. S2) is expected to contain the core components. Therefore, the parameter γ can be fixed to 1, implying that the fraction of core components, given by Eq. 12, should be simply proportional to the rescaling parameter k (Eq. 11).
However, the normalization factor α, which is present in the definition of k and defined in Eq. 5, takes an approximately constant value with respect toÑ for large values ofÑ , as it is the case for the empirical examples considered. As a consequence, the core fraction should be simply proportional to M N . This estimate can be used to explain why the core fraction is much larger in genomes than in the other two empirical systems (see Figure S2d). In fact, genome sizes are typically of the same order as the total number of families (M 3000, N = 1531, see Figure S1) leading to a large expected core. By comparison, book chapters have similar realization sizes but a much larger vocabulary (N 50000), and LEGO sets have very small sizes (M 100) compared to vocabulary size (N 13000).
More in general, Eqs. (11) and (12) lead to a scaling estimate (dependent on the decay of the frequency rank plot) as a function of the system parameters M and N , which can be applied to data, in order to generate expectations for the core components. For example, for Zipflike (exponent -1) frequency distributions, we expect the absolute number of core components to be linearly dependent on the average size of realizations M , and essentially insensitive to the vocabulary size N and the total number of realizations R. In genomics language, this would imply that the number of core protein domains does not directly depend on the number of sequenced genomes but only on their sizes and on the total number of different protein domains discovered. Note that adding new genomes to the data set is not expected to alter the power-law exponent γ 1 of the global frequency distribution for high-frequency components, since it does not change if the distribution is evaluated on sub-samples of the empirical dataset ( Figure S2). As previously discussed, the core fraction, instead of the absolute number of core components, is expected to have a more complex dependence on the typical realization size M and on the number of realizations R. Moreover, in empirical systems these relations are further complicated by the fact that the frequency distributions cannot be described by simple power-laws (Fig. S2). Nevertheless, the relation between the core fraction and the average realization size predicted by a random-sampling model can be tested numerically, as Figure 7a shows for prokaryotic genomes, and seems accurately verified and roughly linear in the tested range of sizes. However, the predicted fraction of core components is actually much smaller than the empirical one. This highlights the presence of additional functional constraints and/or specific correlations in the empirical system that the model can not capture. The next section addresses this point more in detail.
F. Deviations from the random-sampling predictions can highlight system-specific properties Beyond the striking agreement with null predictions for shared components, the deviations from sampling can be used to quantify specific functional and architectural features of a component system. While the scope of this work is to highlight the common trends and their origins, we discuss a specific example, in order to show the feasibility of this procedure. Of the three data sets considered here, the case where the clearest deviations emerge are genomes. For example, Figure 7a illustrates how the random sampling underestimates the empirical core size by a constant offset, for genomes of increasing size. Generally speaking, this larger core of components is due to the components that tend to occur in most realizations, but in few copies. The natural explanation is that there are specific basic functions that are essential for all (or most) genomes, but the domains involved in these functions are not necessarily needed in many copies per genome, and thus their presence in all realizations does not simply correlate with high global abundances as the random sampling would entail [39].
To test this hypothesis, we divided the domain families in functional categories (see Section II for the functional annotation), and tested if most of the deviations from the random-sampling prediction can be ascribed to the statistics of domains belonging to specific categories. The result of this analysis is reported in Figure 7b. Different parts of the distribution of shared components are indeed enriched in components of different biological functions with respect to the random-sampling expectation.
In particular, protein domains that play a functional role in information processes -such as DNA translation, DNA transcription, and DNA replication-are clearly enriched in the core. At the same time, they seem statistically under-represented at occurrences around 0.6. These two deviations can be explained as two sides of the same coin if this category contains domain families that empirically occur in all genomes but in a single copy per genome. Indeed, the global frequency (i.e, across all genomes) of families that are both single-copy and ubiquitous is f = R RM = 1/M . Therefore, their occurrence predicted by the random-sampling model is 1 − e −1 0.6 (where the rough approximation holds for large enough M ), thus naturally leading to an excess of those families in the core and to a depletion around o 0.6.
The observation of a strong presence of protein domains related to basic cellular function in the core genome is not new [21,39]. However, the randomsampling model allows in principle to distinguish families whose presence in the core could be simply explained by their high abundance in the pan-genome and thus it would be expected also in a simple scenario of random gene exchange. Finally, the observed correlation between biological functions and deviations from random sampling predictions seems coherent with a picture, recently proposed [23], in which natural selection and functional constraints have played an important role in defining the empirical U-shaped distribution of gene occurrences.

IV. DISCUSSION AND CONCLUSIONS
This work employs a simple statistical model based on random sampling to describe the distribution of shared components in complex component systems. A similar approach was employed in quantitative linguistics to explain how the dictionary used in a text scales with text size as measured in number of words (the so-called "Heaps' law") while assuming Zipf's law for component frequencies [7-9, 29, 40]. We extended the model to show that there is a general link between the heterogeneity in component frequency and the statistics of shared components, regardless of the mechanisms that generate heterogeneity. Consequently, models or generative processes able to explain the heterogeneity in component frequency implicitly carry predictions for the statistics of shared components.
The striking similarities of laws governing both component abundance and occurrence found in empirical systems of very different origins (LEGO sets, genomes, book chapters) support the idea that the concept of "component system" defined in this work can capture in a unified framework a large class of complex systems with some common global properties. Different component systems, besides having specific architectural constraints, may show convergent phenomena in terms of global statistics. Such "universal" phenomena may be regarded as emergent properties due to system heterogeneity, which transcend the specific design, generative process or selection criteria at the origin of a system. Analogous phenomena occur, for example, in ecosystems, where emergent species-abundance distributions appear for forests, birds or insects [41].
Beyond the examples considered here, modular systems in a wide range of disciplines can be represented as component systems. Developing a common theoretical language for such systems can help the exchange of ideas, models and data-analysis techniques between distant communities of researchers [42]. For example, the statistics of component sharing considered here plays a central role in genomics [2,23,43] but is relatively unexplored in the context of natural languages [1]. Conversely the random-sampling approach used here was developed in quantitative linguistics [8], and this work shows that it is applicable to other systems, including the detection of functional constraints in prokaryotic genome evolution.
An important result of this work is a proof of the clear link between the heterogeneity of component abundance in a system and the statistics of shared components. This link is consistent with data from three very different empirical systems and well captured by the random-sampling model. The fact that emergent patterns can be explained by largely null models resembles again the case of biodiversity, where neutral theories ignoring species interactions and competitive exclusion appear to capture many of the emerging trends of species abundance [41,44].
If the trends of component sharing of generic component systems are to be regarded as largely null and due to components heterogeneity, system-specific investigations should be informed of this general trend. Quantitative null models, such as the one provided here, may be crucial for identifying dataset-specific deviations that are related to functional reasons or constraints. In the data considered in this work, the patterns of shared components show differences between empirical data and the null model in some cases. This is particularly true in the genomic context, where the differences can indeed be traced back to functional constraints in genome composition. Therefore, the framework can be useful to pinpoint hallmarks of functional design and distinguish them from statistical effects, particularly for the detection of causality, dependency and correlation structures between components from occurrence patterns. Once a null model is defined, these features can emerge as significant deviations from the null behavior, for example as violations of the constraints linking different global statistics such as the abundance rank plot, the distribution of shared components and Heaps' law. We have considered here a specific example for the case of shared protein domain families in genomes (Fig. 7), but this question still needs to be approached systematically. In this specific case, core components are particularly enriched by specific functional classes of components with respect to the random-sampling prediction. In evolutionary terms, the random sampling defines a scenario in which the pan-genome fully determines the overall abundance of the gene families in each genome, while in empirical bacterial genomes genome-specific functional constraints are clearly in place [35,36,45]. Deviations from the null scenario can thus highlight the role of selection for specific functions, supporting from a different perspective the idea that the empirical U-shaped gene occurrence distribution is affected by selective rather than neutral processes [22][23][24][25].   Figure S2 tests the component frequency rank distribution conservation when it is evaluated on different sub-sets of the total empirical data set. These sub-sets are composed by realizations in a fixed range of sizes, showing that the global frequency statistics does not depend on the realization sizes or on the number of realizations considered. Note that this test is necessary to safely compare the analytical null predictions with different sub-samples of the empirical data set, as for example in Figure 7 of the main text. Moreover, the fact that the Zipf's laws of the under-sampled datasets is essentially identical to the global one (especially in the high-frequency regime) suggests that the observed Zipf's law is not under-sampled and can thus be considered a good estimation of the "universal" one.  The mathematical calculation described in the section IIIC of the main text can be applied to an exponential rank distribution of the form

SUPPLEMENTAL MATERIAL
Considering a random sampling of R realizations with fixed size M , one finds: Imposing the condition M 1 this equation takes the form which provides a good approximation for the overall distribution shape as a function of one single effective parameter k = N λ.
In the M 1 limit, the occurrence extreme values are o 1 1 and o N 0. This implies that the distribution is well defined over all possible values of occurrence. Figure S4 shows the rescaling properties of Eq. S3 by testing its independence on M (panel a) and by varying N and λ while keeping their product constant (panel b).
For rare families, one can further approximate the expression for p(o) finding the expected power-law decay with exponent −1: We now analyze the properties of the fraction of core components, i.e., those with occurrence greater than the threshold θ c . In order to derive the core size one has to integrate the distribution described by Eq. S2 from o = θ c to the maximum occurrence value o 1 (whose formula can be obtained from Eq. 2 of the main text).
The result reads: In the limit of large M this expression becomes which further simplifies only when the logarithm of M becomes dominant over the other terms. It is worth mentioning that the expression above does not show rescaling properties, even in the regime M 1, and this may seem to be in contradiction with Eq. S3. Nevertheless, this apparent inconsistency is basically due to the singular behavior of the occurrence distribution in o 1. In the large M limit, the right boundary can be expressed as o 1 = 1 − , where is an infinitesimal term depending on M and λ, whose effect on the overall distribution shape is negligible (Eq. S3). However, the core size is defined as the integral of the distribution. Therefore, the variation of p(o 1 ) due to a change in M or λ provides a sufficiently large contribution (because of the function singular behavior) which compensates the infinitesimal variation of o 1 . Finally, this leads to a finite contribution to the integral and thus to the core size as it is defined in the main text. In general, this finite contribution has a non-trivial dependency on the parameters, explaining why Eq. S6 does not show the rescaling property. For large system sizes M R (left panel), essentially all the possible different componentsÑ have been sampled. Therefore, in the analytical prediction (Eqs. 6 or 7 of the main text) for the occurrence distribution either the observed N or the possibleÑ vocabulary can be indiscriminately used as parameters. On the other hand, for small systems (rigth panel) the theoretical expectation is in good agreement with numerical simulations only if the size of the actually sampled dictionary N is used as a parameter. The parameter η = M R N γ α , whose value gives an estimate of the system distance from the saturation regime, is η 5, 1, 0.1 in the three panels respectively.
As discussed in the main text, the difference between the possible different componentsÑ and the ones that are actually sampled in an ensemble of R realizations of size M is definded by: This equation shows the general dependence of the sampled dictionary on the system size N (M R), which is essentially a generalization of Heaps' law. Our analytical predictions (Eqs. 6-9 of the main text) for the distribution of shared components have an explicit dependence on N rather than onÑ . As Figure S5 shows, this distinction becomes negligible in the limit of large systems, but it is in general relevant. A residual dependence onÑ is in principle present