Generative network model of transcriptome patterns in disease cohorts with tunable signal strength

Algorithmic methods for interpreting the collective transcriptome (gene expression) patterns of disease cohorts in the context of biological networks are a cornerstone of systems medicine. The calibration of these algorithms using synthetic data with predeﬁned statistical properties can be a relevant benchmarking procedure, facilitating the choice of the appropriate algorithm and the detailed mechanistic interpretation of the results. Here we present a generative model producing patterns of signiﬁcantly up-and down-regulated genes for synthetic disease cohorts, in which the statistical agreement between the given biological network and the transcriptome patterns can be tuned. Parameters of this generative model are, among others, the size of the cohort, the number of disease-associated genes, the clustering of differentially expressed genes in the network and the network size. Several properties of the model can be analyzed analytically. In a ﬁrst application of this generative model to produce test instances, we show that considering the subset of signiﬁcant expression changes occurring in more than one patient of the cohort as an additional ﬁltering step serves as an efﬁcient noise suppression mechanism to enhance the recall of the signal contained in the data by the network connectivity. DOI


I. INTRODUCTION
Over the last two decades a range of algorithmic methods has been developed for interpreting the collective gene expression (transcriptome) patterns of disease cohorts in the context of biological networks (e.g., Refs.[1][2][3][4][5][6][7]).These methods are in the broader context of interpreting clinical data using biological networks [5,[8][9][10][11][12] as a path towards a systems-level understanding of medical data, i.e., the emerging discipline of systems medicine.In this way, medical research undergoes a transformation similar to the one we have seen in Biology with the advent of systems biology [13][14][15].
One of the components, which research in systems medicine is currently lacking, are generative models capable of producing data sets with similar statistical properties as real clinical (cohort) data, where the signal type and signal strength can be tuned via specific model parameters, thus allowing medical researchers to test, compare-and hence better select-and calibrate their data analysis methods.
Creating such generative models for medical data sets can be a novel and highly transformative contribution of statistical physics to medical research.Over decades already, physics has contributed to systems biology in a multitude of ways (see Refs. [16][17][18][19] for overviews; see Refs.[20][21][22][23] for specific examples).The design of generative models has for a long time been an important component of statistical physics.Examples include suitable random graphs models to benchmark * m.huett@jacobs-university.de Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license.Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI. the analysis of real networks [24][25][26], as well as the multitude of interdisciplinary applications of coupled phase oscillators [27] and the Ising model, as well as its derivations [28][29][30].
The goal of "network medicine" [31], the statistical analyses of high-throughput ("omics") data from a network perspective, is to identify systematic differences between healthy and disease states or between different diseases [9].The algorithms differ in the processing of the given biological network and the transcriptome data characterizing the disease cohort and the controls, as well as in the quantification of the network patterns obtained from the individual transcriptome profiles and the various types of filtering and binning steps applied to the experimental data (see, e.g., Refs.[4,[32][33][34]).Due to the often small cohort sizes, the unknown numbers of disease-associated genes, the noise in the transcriptome data and uncertainties in the biological networks it is in general unclear, what "signal strength" can be expected in such an analysis.
Here we introduce a simple model for generating dichotomized gene expression (transcriptome) profiles for disease cohorts.By "dichotomized" we mean that for each gene and patient we generate only the presence or absence of this gene in the set of differentially up-or down-regulated (see Ref. [4]) genes.Furthermore, we study the properties of this stochastic model and the statistical features of the generated disease-specific transcriptome profiles.For some key features of the model we have been able to derive semianalytical expressions, as well.Such disease cohort transcriptome data will differ substantially in signal strength: often it will not be possible to extract any information on the set of disease genes (no discernible signal).
Sometimes, statistical methods applied to the individual patient level will reveal features of the set of disease genes without the need to include any network information (strong signal).Often only the combination of network information FIG. 1. [(a) and (b)] Depiction of expression densities ρ G , ρ D , and ρ H and their dependence on the parameters G, D, H , and P. Here, G is the number of genes in the network, D is the number of disease related genes, H is the number of disease unrelated genes, H = G − D, while P is the number of patients and C is the number of controls.For the densities, ρ G is the expression density in the control group, ρ D is the expression density of disease-related genes in the patient group and ρ H is the expression density of other genes in the patient group.The quantity A is the parameter scaling the statistical over-representation of disease genes among patients.(c) Schematic representation of our data generation and analysis pipeline, including disease pattern creation, generation of cohorts and recalling the disease pattern via filtering.and cohort information will allow access to some properties of the set of disease genes (weak signal).This weak signal scenario is the one addressed here.Our generative model of transcriptome patterns in disease cohorts has been designed to create data sets with very small differences between patients and controls.For this case of weak signals, we show how the interplay of statistical properties and network information allows extracting information on the set of disease-associated genes.
If the biological network is meaningful for the situation at hand (i.e., for characterizing the gene expression data) we can expect that the differentially expressed genes display a certain amount of clustering in the network.By now there is ample evidence linking neighborhood in biological networks with disease-gene associations [10,35] and gene coexpression [1,3].
In our generative model, we mimic this expected clustering by creating a set of V D disease-associated genes via a version of a random walk algorithm on the given network.With this minimal model we thus provide a conceptual tool for exploring key ideas in systems medicine.
In the following, we will describe the model (Sec.II), next we will illustrate the capabilities of the generative model using a simple case study (Sec.III).Then we provide broad description of the results discussing disease cohorts and filtering, together with some analytical insights (Sec.IV).Lastly, we put our results in the context of systems medicine (Sec.V).

II. THE GENERATIVE MODEL
The foundation of our generative model of stylized transcriptome profiles for disease cohorts is the existence of a small systematic difference between patients and controls.The control group can be expected to be phenotypically highly diverse.Patients, however, even though similarly diverse, at least agree in one phenotype (the disease).Thus if the systematic gene expression differences between patients and controls can be functionally linked to the disease, we should see a clustering of the corresponding genes in a biological network relevant to this functional interpretation.
The generative part of the model consists of generating a disease, then a cohort and control group.These steps are described in the present section.The resulting sets then enter the analysis pipeline, which is described and further studied in Sec.IV.All these steps are illustrated in the following schematic depictions: disease and cohort generation: Fig. 1 (right), details are depicted in Appendix B.

A. Disease generation
Let (V G , E ) be a graph representing a biological network, where vertices V G denote genes linked (via some biological relationship) by edges E .
Examples of such networks are gene-centric representations of metabolic networks [4,36], transcriptional regulatory networks [37] and protein-interaction networks translated into genes via gene-to-protein mappings [38,39].Such relationships between gene expression patterns and network properties have for a long time been a topic of interest in physics.In [40] random matrix theory has been applied to gene expression patterns.Reconstructing gene expression levels pertaining to individual cell types in samples containing mixtures of cell types has been studied in Ref. [41].In an application of the theory of stochastic processes to gene expression, the impact of noise correlation time on regime shifts (sudden changes in systemic behavior) has been analyzed [42].The broad question, how connectivity can be inferred from the response dynamics of a system, has been addressed in Ref. [43].In Ref. [44], constraints on gene expression have been derived from fluctuation theorems.In an application of Boolean networks to real gene regulatory systems, the link of network states to an epigenetic landscape has been outlined [20].
As discussed above, we assume that the signal discriminating between patients and controls in a disease cohort is associated with a set of disease-associated genes.If the network is relevant for the disease mechanisms, we can expect that the disease genes display some type of (occasionally weak) clustering on the network.
We generate suitable gene sets via a random walk with occasional jumps (sometimes also called a "teleporting random walk" [45]) with a teleportation probability 1 − p.In this way, the amount of clustering displayed by the set of genes (nodes) generated by the random walk is regulated by the parameter p.For the purpose of our investigation, this set of genes represents the disease.The full procedure can be summarized as follows.(1) Randomly draw one gene from the network and mark (color) it as disease-related.(2) With probability p color one randomly chosen neighbor of the last colored gene and repeat step 2 or with probability 1 − p go to the step 1.
Vertices colored in the course of this procedure indicate disease-related genes and form the set V D .The parameter p regulates statistical properties of the set V D such as the average cluster size.The greater p, the larger is the average cluster, which sizes follow a Poissonian distribution.For p = 0, there is no clustering at all and disease genes are distributed randomly, while for p = 1 only one large cluster forming a connected subgraph of (V G , E ) is created.

B. Cohort generation
We now turn to the generation of the disease cohort.A typical step in the analysis of real (experimental) transcriptome data is to identify significantly up-and down-regulated genes via some suitable statistical test, leading to a binarized version (a gene is part of the differentially expressed set or not) of the data.In our model, the transcriptome data are already generated in this binarized form.
We deliberately decided not to distinguish between up-and down-regulated genes.There is a multitude of methods for extracting gene sets from expression patterns with differential gene expression being only one possibility [see the diverse discussions in [46][47][48][49].An alternative is to consider the highest percentile of expression levels of a gene, as it was done in [4] and [7].
In the following, we describe how the corresponding sets of differentially expressed genes for an individual patient or member of the control group (sets V * G i ) are generated.Here and in the following the asterisk indicates that these gene sets are the marked or selected sets (as opposed to the "whole" static sets like V G ).
We assume that for some diseases, the disease-related genes V D ⊆ V G form clusters in the given biological network .Furthermore, we assume that patients have a slightly higher probability for genes from the set V D to be differentially expressed than the controls.Across the whole cohort this results in a set V * D ⊆ V D , which is the data representation of the (static) set V D .Additionally, patients can also differentially express genes unrelated to the disease (the rest of the set V G ), The whole set of differentially expressed genes of the patient j is Usually for clinical cohorts focusing on a specific disease a group of patients S P = {P j } and the control group S C = {C i } exist, where |S P | = P and An advantage of large control sets, C P, is that subsets of S C of the same size as the patient set can be created and, hence, the comparison of S C and S P can be done on a broader statistical level (for several subsets of S C ) and hence leads to a more reliable estimate of V D .As long as C P, the size of the control group will not show up as a separate parameter of our analysis.
The process of generation of artificial expression profiles adheres to the following algorithm.
(1) Control group.For each member of the control group take a random subset of the set of all genes V * C i ≡ V * G,i ⊆ V G with the probability for each gene to be chosen being ρ G .
(2) Patients.For each patient randomly select two subsets of genes with different probabilities: V * D j ⊆ V D with probability ρ D and V * H j ⊆ V H with probability ρ H . Then take the union of these sets, V * P j ≡ V * G j = V * D j ∪ V * H j , as an individual expression profile.
When filtering (see below) is applied to these sets, we obtain the patient and control gene sets V * P and V * C discussed in the illustrative example above.The following shorthand notations for the sizes of sets: ( There are two important assumptions with regard to probabilities ρ x .(1) There is no difference in the average number of expressed genes between individual patients from S P and controls from S C .(2) Patients S P tend to express disease related genes V D slightly more often than the rest V H and than the healthy group S C .Thus ρ H ρ G ρ D .
Within our generative model and under these two assumptions the densities can be parameterized in the following way: where A is the parameter scaling over-representation of disease related genes in the group of patients, hence For reference, see Fig. 1 (bottom left), where the relation between densities is depicted.Artificial expression profiles created in this way are ready for further analysis.Note that we distinguish two steps: generating the disease (where V D appears) and generating the cohort (where affected genes are randomly sampled from V D and, to a lesser probability, from V G \ V D ).The set V P is then the set of genes extracted (e.g., via filtering) from the patient cohort.The corresponding set for the controls is V C .

III. A FIRST ILLUSTRATION OF THE GENERATIVE MODEL
Before describing results generated by the model and the network-based data analysis method, it is worth to analyze a specific example, illustrating how a signal embedded in the data generated with our model can be extracted via the tuning of the data analysis ("filtering") parameters.
Let us assume a disease with a certain set V D of diseaserelated genes, D = |V D |.In the "disease generation" part of the model, this set is generated by a random walk on a given biological network = (V G , E ).The nodes of the network are genes (a set V G , with V D ⊂ V G ) and a link describes some biological interaction (forming the edge set E ).
In the "cohort generation" part of the model, two subsets of data are generated.(1) Patients.These are represented by sets of genes with a slight over-representation of diseaseassociated genes (for a patient, the probability of selecting a gene from the set V D , rather than from the remaining set

is regulated by the parameter A).
(2) Controls.The gene sets for the healthy controls are randomly drawn from the whole set V G .
Cohorts generated in such a way vary in cohort sizes (numbers of patients |P|, number of controls |C|), clustering of disease genes in the network (clustering parameter p), numbers of genes G = |V G |, and disease-associated genes D.
Another parameter is the number of differentially expressed genes (regulated via the density parameter ρ G ).This parameter can be varied during the statistical analysis via a (fractional) threshold for determining differentially expressed genes (see, e.g., Ref. [4]).
The set of differentially expressed genes, together with their multiplicities, in the patient group and the control group are then subjected to further statistical analysis.Here, any data analysis technique from the literature can in principle be applied to this stylized representation of transcriptome profiles of a disease cohort.Here we illustrate this procedure using an extension of the "network coherence" method discussed in Refs.[3,4], where the connectivity of subgraphs spanned by gene sets is evaluated.The two ingredients of our statistical analysis are multiplicity filtering and the computation of subgraph connectivity.
Multiplicity filtering of strength f means that a set of candidate disease genes is constructed from the patient gene sets, which consists only of genes occurring more than f times in the patient gene sets.The same is done for the controls.
Subgraph connectivity essentially assesses the agreement of a set of genes with the network.Here we use the following definition: Given a set of genes V * G , we consider the induced subgraph | V * G consisting only of nodes from V * G and the subset of edges from E among nodes from V * G .The connectivity c V * G is the number of nodes in V * G with nonzero degree in this induced subgraph divided by the total number of nodes in the subgraph (i.e., by |V * G |). Variation of f (selecting genes from the individual gene sets into a common set according to their multiplicities) and ρ G (varying the number of differentially expressed genes and, hence, the average size of the individual gene sets) now allows us to estimate the set of disease-associated genes V D .At a fixed choice of f and ρ G we obtain one common set V * P for the disease group and one common set V * C for the controls.Figure 2 shows the comparison of the two sets V * P and V * C with the disease gene set V D for different values of ρ G and f .Subsets of nodes, together with the links among them, are highlighted in the following way: true positives (genes in the intersection of V * P and V D ): red; false negatives (genes in V D , but not in V * P ): yellow; false positives (genes in V * P , but not in V D ): green.Furthermore, the connectivities c * P and c * C , asf well as their difference c are indicated.One can see the variation of the reconstruction quality (many red, few green and yellow subgraphs) with the data analysis parameters f and ρ G .Also, the figure provides a first indication that the c can serve as a measure for this reconstruction quality.
The generative model described so far is also made available via a web application, see Ref. [50].Note that default settings of this app allow to reproduce Fig. 2 In the following, we will compute this difference of connectivities between the gene sets derived from patients and controls as a quality measure.The analysis strategy will then be to vary the data analysis parameters such that c is maximized.We expect that the resulting set V * P is the best predictor of V D .

A. Individual connectivity
The first and simplest analysis strategy is to study, how the distributions of connectivity c (defined as a fraction of nonisolated nodes in the subnetwork spanned by differentially expressed genes) vary between the patients and controls.Within the framework of our model of disease cohorts, the averages of these connectivities can be obtained analytically (see Appendix A for details) and are given by the following expressions: where c C * i is connectivity for the healthy individual i, and c P * j is connectivity of the patient j.This attempt does not always proof successful, as the difference in c between these two groups might rarely be detectable, see Fig. 3 (where three of four panels present no discernible difference between patients and controls).The reason is that usually D, p, and A are rather low (small set of disease genes, low clustering in the network, small enhancement of differentially expression for disease genes).Comparison of the bottom row of Fig. 3 with Fig. 2 shows that even in the cases of weak signal and no noticeable difference on the individual level, the filtering procedure can allow extracting information on the disease gene set.

B. Collective connectivity
In the following approach, there is no attempt of interpreting the data on an individual level.Instead, the group of the patients is analyzed as a whole.While individual information is lost in this analysis, the set of disease-associated genes, as well as functional clusters in the given biological network can be extracted with higher quality.Extraction of this information consists roughly of three steps.
(1) Filtering.This step combines information on differential gene expression from single patients into one group, and then filters genes by the number of occurrences.This is a noise reduction step in order to unveil general underlying biological mechanisms driving the disease for more details, see Fig. 8 in Appendix B.
(2) Projection.This step is the process of putting filtered data onto a given network.
(3) Maximising.The first two steps are meant to be performed as a function of ρ G and f , such that a set of these parameters is searched maximising the difference between connectivities of the patients group c S and the controls c C .This maximization allows to get the best possible recall of the set of disease-related genes V D for more details, see Fig. 9 in Appendix B.
The exact implementation of these steps is rather straightforward and intuitive.Filtering is summing up sets of expressed genes to create multiset of all differentially expressed genes among control C and patients group P: Notice, that while V C * i or V P * j are the sets of differentially expressed genes, M * C and M * P are multisets, which means that the same elements can be present multiple times in the same multiset, where v P * j is a single gene and m(v P * j ) is the multiplicity of this gene in combined multiset M P .The noise suppression step now consists of accepting only genes repeating itself more than f times, where f is the filtering threshold: where V * f is not a multiset anymore but becomes just an ordinary set again.In the following we will show that this filtering procedure performs well in the task of identifying disease-related genes.We expect this to be true not only for the synthetic data studied here, but also for real disease cohorts, particularly when the signal is weak.On this collective level, the calculation of connectivity is slightly more complicated than in the individual case, but still remains feasible.At first, the probability R X ≡ R X (P, ρ X , f ) of given gene to remain in the set V * f after filtering is calculated: where X ∈ {G, D, H} and f is the filtering level.While the typical filtering procedures will use positive integer values for f , the procedure can also be extended to nonpositive integer values.
(1) For f > 0, all the genes appearing more than f times in the multiset M * are taken: (2) For f = 0 all the genes from the multiset M * are taken: (3) For f < 0 all the genes appearing less or equal than − f times in the multiset M * are taken: Hence it is obvious that V * f P ∪ V * − f P = V * 0 P .We expect that, while positive values of f will allow an identification of the disease mechanisms, negative values of f will address the diverse disease coping mechanisms of the individuals, as well as other individual traits, in the cohort.

By analogy with the previous expressions [Eqs. (3)], we obtain
where we use X * = R X pX , which is true on average in the realizations of cohorts, with X ∈ {G, D, H}.
Here, c * f P is the collective connectivity calculated from the group of patients, while c * f C is the collective connectivity obtained from the control group in the way that P individuals from the control group are taken randomly.If the size of the control group allows, this step is repeated many times, such that c * f C becomes the average connectivity of the control group, providing us with a null model to properly assess the patient-derived collective connectivity, c * f P .In particular, it can be assessed, how the difference between c * f P and c * f C changes with the filtering threshold f , densities ρ x and the number of patients P. In case of a statistically significant difference between the two connectivities one can claim that the disease related cluster of genes has been detected.Figure 4 depicts how signal (red) and noise (green) are changing as a function of the disease signal parameter A and the filtering strength f .Also, the connectivity of the full set V D (orange) is provided as a reference.Thus, the orange curve indicates the maximal signal strength, which can be obtained by data analysis.Note that at low clustering (low p) due to differences in the sizes of gene sets, the connectivity of the full set can be systematically smaller than the connectivities derived from the other two sets.Furthermore, the numerical results (solid lines, together with their standard deviations) are compared with the analytical predictions (dashed lines).It is clear, from this Fig. 4, that agreement between numerical and analytical results is clearly seen.At A = 0.5, we can see that an increase of f reduces the background signal c * f C and enhances the disease signal c * f P .This is not always the case, because for higher f signal can degrade again, for more details see Figs.S1-S4 in Ref. [51].In Fig. 4 with increasing p and a suitable filtering, the red curve, representing signal, should be more and more different from the green curve, representing noise.The orange curve is the connectivity of the subgraph spanned by V D and hence the maximally achievable signal strength.
Figures S1-S4 [51] show these results for a wider range of A and f , confirming the general observations from Fig. 4. Furthermore, they show the interplay, in the generated data, between multiplicities, patient cohort size, and the signal strengths A and p, resulting in a nontrivial dependence of the accuracy on filtering: Moving along a row (i.e., at fixed A, varying f ), we see in many cases the red curve (signal) moving away from the green curve (noise) towards the orange curve (maximally achievable signal strength) and then back again.A more detailed view on this nontrivial dependence is offered by Fig. S9-S14.
At lower values of A the disease signal c * f P is only visible at high clustering p of disease genes.This points to the "multiplicative" nature of the two disease-related parameters (disease signal parameter A and network signal parameter p); if A is not very high, a strong clustering of disease genes in the network is needed to detect the set of disease genes.
The difference between the red and the green curves is an indicator, how well the network-based analysis is capable of discriminating between the patient group and the control group.Hence, it is helpful to study this difference c as a function of the parameters involved.This will be addressed in the following sections.

C. Optimal filtering
The aim of the method described above is to extract systematic differences between gene expression patterns of patients and controls with respect to a given network and, furthermore, obtain information on the set of disease-associated genes, V D .As a result of the data analysis one obtains filtered sets of genes V * f .These sets then allow to calculate c values.For the group of patients S P , V * f P → c * f P and for the control group S C , V * f C → c * f C .It has been stated before, that in the case of the control group, the average over the ensemble of many group compositions having P individuals can be taken.
When confronted with real data only two parameters which can be set: the threshold for differential gene expression leading to the gene expression density ρ G and the filtering threshold f .The rest of the parameters of the model is coming from the experimental setup or can be obtained (or estimated) from biological knowledge, and cannot be changed in the process of the data analysis.We therefore need to understand, how the reconstruction of the set V D depends on the values of ρ G and f , in a balance of maximizing information extracted from the data while minimizing noise.
It is helpful to summarize the parameters and observables of our generative model again, as the optimal values of ρ G and f will depend on these parameters, as well as on our capability to extract the connectivity differences from the observables.
In general, the parameters of the model can be divided into three categories: unknown (D, A, and p) parameters are coming from the biological reality; they can be estimated from the data after detailed analysis (Sec.IV D), but are unknown in the beginning; known (d, P, and G) parameters are the properties of the experimental settings and are fixed in the analysis process; and variable (ρ G and f ) parameters can be set during the process of data analysis.The observables are classified in two categories: direct observables (G * , c * f P , and c * f C ), which are obtained directly from the cohort data, and hidden observables (D * , H * , c * f D , and c * f H ), which need to be estimated indirectly from the data and, hence, are proceduredependent (see Sec. IV D).

Calculation of optimal filtering from model parameters
If all parameters are available, the optimal or near-optimal values of f and ρ G can be calculated directly.Optimal values of f and ρ G will be denoted as f and ρG and the values are obtained as follows.From Fig. 1, we see that the fraction ρ D /ρ H has its maximum for ρ G ∈ (0, D G , while the difference ρ D − ρ H is greatest for ρ G = D G and this is the searched value of this parameter, because in this specific point both the fraction and the difference of ρ D and ρ H have their maximal values.Hence, Having ρG , we can now compute f .We start from the number of occurrences of a given gene in the patient multiset M * P .In the patient cohort there are two types of genes, V D and V H . Optimization strategy is about extracting the greatest possible number of V D genes, while keeping the extracted V H genes on the lowest possible level.The probability of a given gene to be present in the patient cohort more than f times is and the probability of exactly n appearances is Furthermore, the average number of appearances is just Without a detailed knowledge about the shape of the distributions involved, the most reasonable choice is to put the filtering threshold f in between of n H and n D in equal distance from both: This choice provides a near-optimal recall.

Extraction of optimal filtering via data analysis
The situation of identifying the optimal choices of parameters ρ G and f is substantially different, when the key parameters characterizing the disease, D, A, and p are unknown.In particular, the network now becomes a crucial part of the data analysis procedure.As discussed above, the main observable of our analysis is the connectivity signal strength defined as the difference between the connectivity obtained from the group of patients, c * f P , and from the control group, c * f C : As shown in Appendix C, the quality Q (also defined in Appendix C) of the retrieved set of genes is positively correlated with c, which is easily obtained from the data analysis.Moreover, this correlation persists at relatively high levels for a broad range of (known and unknown) parameters.On these grounds, the full method can now be established.In the model, c depends on all the parameters, but in the data analysis part only two are changed, then c → c(ρ G , f ). and the same goes for Q → Q(ρ G , f ).A sweep of the parameter space ρ G × f is required, in order to find c max which is very likely to give also Q max (or a quality value close to it) and, hence, the set of the genes of the best possible quality.Q max = 1 means that whole set of disease related genes is extracted and none of the unrelated ones.

D. Estimation of disease parameters
From the data analysis and the analytical equations discussed so far, we can estimate the disease parameters.After the scan of the ρ G × f space, the point where c(ρ G , f ) is maximal, has been obtained.For the sake of simplicity, we assume that this point is unique.In practice (in particular in small empirical data sets) noise can lead to multiple, coexisting optimal or near-optimal points, which would then need to be evaluated separately.Furthermore, we assume that this point coincides with highest quality Q.In this case, the maximization of c leads to the optimal parameters ρG and f : ).This enables us to subsequently estimate values of the disease-related parameters: D, A, and p as D, Ã, and p.This estimation procedure is performed in a hierarchical order.First ρG yields D, then f and ρG yield Ã and lastly D, Ã, and c max yielding p.This procedure acknowledges the underlying dependence scheme of the quantities involved: ρG → ρG (D), f → f (D, A) and c max → c max (D, A, p).

Estimation of D
At first, the estimation of D from ρG , is based on Eq. ( 8) and yields where ρG is the value of ρ G for which c is maximal.This is easily done with the data analysis routine and gives an almost perfect estimation of D, where precision depends on the resolution of the ρ G parameter.

Estimation of A
The next step is the estimation of Ã, which is obtained from the f : It is similar to Eq. ( 12), but now with ρ G instead of ρ D .This approximation gives a better estimation and simpler analytical form.The reason is because in Eq. ( 12) the quality Q of the signal was maximized, while in Eq. ( 15) the actual network signal c is maximized.Even though they are close to each other, they are not always perfectly aligned.Putting ρG to the Eqs.( 2) gives Now plugging Eq. ( 16) into Eq.( 15) and combining with the restriction that A cannot exceed 1 yields It should be noted that this is an approximation.However, it works reasonably well, as can be seen in Figs. 5 and 6, where also the estimate for p is depicted.

Estimation of p
Having obtained D and Ã, the last step is the estimation of p as a p from c max .Equations (7) where also from Eq. ( 7) one has and again where In this way, all the hidden parameters can finally be estimated from the data.Having these values also the quality Q can be evaluated.See Figs.S5-S8 in Ref. [51] for more details regarding above estimation.

V. DISCUSSION
The challenge of systems medicine is to employ modeling and data analysis tools from systems biology for the interpretation of medical data.Statistical physics has routinely contributed data analysis techniques for gene expression data (see, e.g., Refs.[40,42,44]) In contrast to the many successful applications of mathematical and computational methods in Systems Biology, data sets in systems medicine are often very small and highly heterogeneous with few well documented cases of evidence converging towards universal principles.We can think of the "signal strength" in such data as the maximal amount of information an ideal analysis method would be capable of extracting from a given data set (e.g., discriminating the disease state from healthy controls).In typical medical data sets, we can expect this signal strength to be rather small.At the same time, we see a tremendous diversity in computational methods, often to the point where data analysis tools are tailor-made for a specific data set and only applied to this data set alone [52].In particular, there are few comparative studies, which could allow a medical researcher an informed choice in this diversity of computational methods.Notable exceptions are [53,54].
We employ well established tools, like a random walks and networks, to formulate a simple yet flexible model of gene expression profiles for clinical cohorts typically encountered in medical research.This abstract, generative model can be used to create test instances of data to try out and calibrate existing analysis tools.This model can also facilitate the design of new data analysis methods.With the filtering-based connectivity assessment described here, we give an example of such a new method.Here we ask, what the cohort-level, collective analysis (in contrast to the sequential analysis of individual patients within a disease cohort) can offer.We exploit here that the disease subgroup can be expected to be more uniform than the controls, because they share a phenotypic feature.Qualitatively, in our generative model, this leads to an enrichment of disease-associated genes in the filtered set.
Here we show that, surprisingly, key parameters of the data, like the number of disease-associated genes, can be reconstructed from the data, by using a comparatively small number of quantitative parameters: (1) the "recall" of diseaseassociated genes in the expression data (i.e., the offset in likelihood of being differentially expressed in the disease state, (2) the amount of clustering of differentially expressed genes in the given biological network.We can furthermore clearly delineate the region of signal strength, where the network facilitates the analysis, e.g., the identification of the disease-associated genes.
The filtering-based data analysis method presented here can be seen in competition with a multitude of related methods from the corresponding literature in bioinformatics, systems biology, and systems medicine [55].An example is the "key pathway miner" method [56][57][58], which shares similarities with the filtering approach described above.
The data generation part of our study can be used to test and compare the performance of these different analysis methods in a quantitative fashion under variation of the main parameters (like cohort size, network clustering, size of the disease gene set, etc.).
The generative model is flexible enough to simulate many different types of diseases and to draw statistical conclusions and, as a consequence, some insight into the disease.Furthermore, analytical solutions for the main quantities of the model are also provided, which facilitates our understanding of the parameter interdependencies.The most surprising fact is that results are mostly independent of the network structure, but rather depends solely on the network density.This is true especially for relatively sparse networks, which is the case of most biological networks.
In practice, the model can be used to create cohort data and explore the various parameter dependencies via a web-based PYTHON application, see Ref. [59].The method of analysis of the simulated data presented here works remarkably well and reveals estimates of the disease gene set with high accuracy even for small cohorts and weak signals.We believe that the method is capable of contributing to new discoveries in biology and medicine.
An important resource of disease-associated genes currently are genome-wide association studies (GWAS), as discussed in [60,61].Ultimately, disease associations of genes derived from such approaches based on population genetics need to converge with the more functional associations obtained from transcriptome profiles of patients (which is the type of data set discussed here).Biological networks can be expected to play an important role in uniting these two categories of disease associations of genes [11].Our generative model is intended as a computational resource along this way.
The "teleportation random walk" (TRW) algorithm is well conceptualised by a linear chain of events, with only two possibilities for each step in this chain; step A: with P(A) = p move to the neighboring vertex and color it, and step B: with P(B) = 1 − p jump (teleport) to the randomly selected vertex and color it.Hence a chain of events may look like: AAABABBAABB having graphical representation: where the link to the neighbor − is A, and random jump (lack of the link) • • • is B and a • is a vertex.Probability of such a chain of events is Calculation of c in a 1D connected chain is straightforward.It is dependent only on the probability of jump p, hence, where (1 − p) 2 is the probability for a given colored vertex to have no colored neighbors in the 1D chain, therefore 1 − (1 − p) 2 is a probability of having at least one neighbor, which is sufficient to regard this vertex as connected.

Assumptions
The above chain of events is precise on a 1D chain of vertices and is approximately valid for relatively sparse networks with low clustering coefficients.In such a case, the colored cluster is very well approximated by a 1D chain of colored vertices, forming pathlike structure, embedded in the structure of the network.
The connectivity c created by the TRW depends on the density of the links d in the network, but is rather independent of other aspects of the network structure.This is due to the fact that the TRW algorithm follows the structure of the network itself, effectively canceling structural influence on connectivity c, which quantifies the number of breaks in the path.
A second assumption is related to network size.For p = 1, teleportation B causes the formation of multiple clusters.In small networks it is very likely that some of them may overlap.This overlap is difficult to incorporate in analytical calculations.In order to reduce effects of clusters overlapping, network size should be much larger than the number of colored nodes N n.

Background contribution
In case of the real graph, c 1 is a contribution to the c from the 1D chainlike arrangement of colored vertices.In the graph different from the 1D chain, some colored vertices could be connected via additional edges lying outside of 1D arrangement.It creates possibility for vertex isolated in 1D chain to be connected with some other vertex via the "network background.This affects the connectivity outcome c, and is described by where n is a number of colored vertices, d is density of edges, and 3 is there because two potential edges were already used by 1D cluster-chain.The final expression, obtained from the combination of factors c 1 and c 2 , has the form of It describes average c c which is also depicted on Fig. 7, where three different colorings are plotted.The network and the number of colored nodes are the same, but the values of p are different.For p = 0, colored nodes are spread completely randomly, p = 0.5 displays moderate clustering of the colored nodes, while for p = 1.0, there is just one connected cluster of colored nodes.Furthermore, the connectivity c as a function of p is shown.

Subgraphs
Previously, c c was defined as the ratio of colored vertices having at least one colored neighbor and the number of all colored vertices in the graph.Via the concept of subgraphs, we can look at the connectivity from another perspective.In the subgraph consisting of all colored vertices extracted from the bigger original graph c (V c , E c ) ⊆ (V, E ), the connectivity c c is the percentage of nonisolated vertices.Connectivity of the random sample taken from the original graph where n is a number of vertices in the subgraph, d is density of edges, and * denotes random sampling over a given set.
We can now look at the connectivity obtained from randomly selected vertices of the colored subgraph E c ) and another subgraph consisting of nodes randomly selected from the noncolored ones * Regarding the subset of randomly selected vertices V * c : where n * c is a number of randomly selected colored vertices and n * c is a number of vertices randomly selected from the uncolored rest of the graph.Hence n * c + n * c is a total number of vertices in the graph u .For randomly selected subset of colored vertices V * c , one has where n c is a number of all colored vertices and n * c is a number of randomly selected colored vertices, n * c is the number of vertices randomly selected from the uncolored subset and R = n * c n c is probability of a colored node to be randomly selected.Final expression for c u comes from the weighted average of c * c and c * c : The above expression is an approximation for the average c over the ensemble of graphs and subgraphs.This approximation works very well for the most of the graphs used in the systems biology, as they tend to be sparse and to have low clustering coefficients.The value of c is an observable coming directly from the data analysis.It has been shown here that it can be also easily calculated for the TRW model presented in the main part of our investigation.

APPENDIX B: FILTERING AND MAXIMIZATION
This section depicts details of the data analysis procedure.It refers to two crucial parts of it.Expression filtering and connectivity computation is illustrated in Fig. 8. Tuning of analysis parameters to maximize c in order to extract the set FIG. 8. Illustration of the filtering procedure.First, dichotomized individual transcriptome profiles (obtained by evaluating threshold ρ G ) are summed up to form combined multiset on the network (combined network).Then several filtered subnetworks are created based on the number of occurrences of nodes in the multiset (employing the filtering threshold f ).Next, for each subnetwork, the fraction of nonisolated nodes (connectivity c f ) is calculated.This ensemble of connectivity values is subject of further analysis depicted in Fig. 9.For positive values of f , all vertices appearing more than f times are taken, while for negative values all vertices appearing equal or less times than f are selected.f = 0 is treated as a positive number as all the vertices appearing more than 0 times survive filtering, which is equivalent to no filtering at all. of genes with highest match to the disease-gene set is depicted in Fig. 9.

APPENDIX C: QUALITY ASSESSMENT 1. Quality measures
As discussed above, maximizing c is a good strategy for finding the optimal values of the parameters ρ G and f .These optimal values, ρG and f , allow us to extract the set of genes possibly closest to the disease gene set V D .In this Appendix, the assessment of the reconstruction quality of the disease gene set is discussed.The aim of our data analysis method is the extraction of a candidate set of disease genes, V * f P , possibly closest to the true disease gene set V D .Ideally, V * f G = V D .In reality it is rarely achieved.What can be achieved is a maximization of D * G * ratio, which is by definition D * G * = D * H * +D * .Hence, for D * H * +D * = 1, only true disease genes have been extracted such that V * f G ⊂ V D .Filtering has the goal of retaining only highly reliable candidate genes.It can happen that most of the set will be filtered out and as a result only a small part of V D will remain.This contradicts the aim to retrieve the largest possible number of disease genes.Therefore also D * D should be as high as possible.
Summarizing the task is to find a reasonable compromise between two contradictory factors: maximization of the precision PPV = D * G * (positive predictive value) and maximization of sensitivity TPR = D * D (true positive rate).The method described in the main text provides a heuristic, how to obtain a good solution.Using the quantities introduced above, we can now evaluate the quality of the candidate set of disease genes and its proximity to the optimal set.In order to deal with this task, it is convenient to define single parameter measuring quality.One of the most natural choices is the product of sensitivity and precision: A property of PPV is that PPV ∈ 0, 1 .For PPV = 0, there are no disease genes retrieved, V * f P ∩ V D = ∅, while for PPV = 1 only disease genes are retrieved, V * P ⊆ V D and V * P ∩ V H = ∅.
A property of TPR is that TPR ∈ 0, 1 , where for TPR = 0 also no disease genes are retrieved V * f P ∩ V D = ∅, but for One may assume that the best value of parameters ρ G and f can be extracted solely based on the ROC curve (Fig. 13).This is false, because the optimal values of the parameters change with other parameters of the model about which prior knowledge is inaccessible.Therefore the network is essential in identifying the optimal prediction.

FIG. 2 .
FIG. 2. Eight different combinations of the parameters ρ G and f resulting in eight different sets, i.e., groups of selected ("colored") nodes.Color coding is as follows: red: true positives V D ∩ V * P , yellow: false negatives V D \ V * P , and green: false positives V * P \ V D .A perfect set would have only red nodes and none of green or yellow.The closest to it is (d).It also has the highest c (where c indicates how strongly the ratio of nonisolated nodes (connectivity) of the patient group exceeds the one derived from the control group).Other parameter values: G = 1000, d = 0.006, D = 50, P = 20, p = 0.3, and A = 0.25.Different columns stand for subsequent values of ρ G : [(a), (c), (e), and (g)] 0.01 and [(b), (d), (f), and (h)] 0.05.Rows denote different values of f : [(a) and (b)] 5, [(c) and (d)] 3, [(e) and (f)] 1, and [(g) and (h)] 0.

FIG. 3 .
FIG. 3. Comparison of individual connectivity distributions for P = 20 patients (red) and C = 20 controls (green) for otherwise the same parameters as in Fig. 2: G = 1000, d = 0.006, D = 50, and p = 0.3.The dark red color indicates the overlap of the two histograms (red+green).From left to right: more differentially expressed genes; from bottom to top: stronger disease-control differences.Different columns stand for subsequent values of ρ G : [(a) and (c)] 0.01 and [(b) and (d)] 0.05.Rows denote different values of A: [(a) and (b)] 0.5 and [(c) and (d)] 0.25.

FIG. 4 .
FIG. 4. Connectivity c for collective expression for different A, p, and f (while parameters D, P, and ρ G are fixed).Simulations are made on an Erdős-Rényi graph with an edge density of d = 0.006.Red: c * f P , green: c * f C and orange: connectivity of the complete pure disease related gene set c f (V D ).Solid lines are from the numerical experiment with filled areas denoting the span of ±σ , while dashed lines are the average values obtained analytically.Different columns stand for subsequent values of f : [(a) and (c)] 0 and [(b) and (d)] 1. Rows denote different values of A: [(a) and (b)] 0.5 and [(c) and (d)] 0.25.

FIG. 7 .
FIG. 7. Example of the clustering (and connectivity) of selected ("colored") nodes for different values of the TRW parameter p: (a) 0, (b) 0.5, and (c) 1.0.(d) depicts the connectivity c(p) as a function of p.