Subgraph covers-An information theoretic approach to motif analysis in networks

Many real world networks contain a statistically surprising number of certain subgraphs, called network motifs. In the prevalent approach to motif analysis, network motifs are detected by comparing subgraph frequencies in the original network with a statistical null model. In this paper we propose an alternative approach to motif analysis where network motifs are deﬁned to be connectivity patterns that occur in a subgraph cover that represents the network using minimal total information. A subgraph cover is deﬁned to be a set of subgraphs such that every edge of the graph is contained in at least one of the subgraphs in the cover. Some recently introduced random graph models that can incorporate signiﬁcant densities of motifs have natural formulations in terms of subgraph covers and the presented approach can be used to match networks with such models. To prove the practical value of our approach we also present a heuristic for the resulting NP-hard optimization problem and give results for several real world networks.


INTRODUCTION
Many complex systems can be modeled as networks where vertices represent interacting elements and edges interactions between them.A large number of real world networks has been found to contain a statistically surprising number of certain small connectivity patterns called network motifs [1].Network motifs, which are also commonly referred to as basic building blocks of networks, are thought to play an important role in the structural and functional organization of complex networks.For instance, in biological and technological networks motifs are thought to contribute to the overall functioning of networks by performing modular tasks such as information processing [2].Hence, methods for identifying such characteristic connectivity patterns are of great importance for a better understanding of complex networks.
The prevalent approach to motif analysis is due to Milo et al. [1] and is based on comparing the subgraph frequencies of the original network with a statistical null model that preserves some features of the original network.Part of the analysis consists of generating a representative sample of the null model which is used to determine empirical values for the mean and variance of motif counts in the null model.Motifs for which the frequencies significantly deviate from the null model are then classified as network motifs.In their original paper, Milo et al. suggest that when detecting motifs of size n the null model should conserve the degree distribution of the original network as well as the motif counts of size n-1.For generating such networks they propose a simulated annealing approach.However, it is not clear whether the simulated annealing approach samples such null models uniformly.Moreover, in most applications it is computationally not feasible to preserve lower order motif counts for motifs larger than 4 vertices.Consequently, in most practical applications [1,3,4] lower order motif counts are not conserved and the configuration model [5] with the same degree distribution as the original network is used as a null model.This has the unwanted consequence that most subgraphs that contain an over-represented sub-motif are classified as network motifs.
In this paper, we introduce an alternative approach to motif analysis that is based on using subgraph covers as representations of graphs.A subgraph cover can be seen as a decomposition of the network into smaller building blocks.Given any network there are many ways of decomposing it into a subgraph cover.Consequently, one needs a way of comparing subgraph covers.For this, following the total information approach by Gell-Mann and Lloyd [6], we look at motifs as regularities of a network which can be used to obtain a more concise representation of the network.In our approach network motifs are defined as subgraph patterns that appear in a subgraph cover that represents the network using minimal total information.Note that this definition of network motifs is fundamentally different from the definition of Milo et al. [1].
Another aim of this paper is to establish a connection between motif analysis and random graph models.In contrast to most real world networks, commonly used network models are locally tree-like.Developing random graph models that can incorporate high densities of triangles and other motifs has been a long standing problem.Recently, two random graph models that can incorporate significant densities of motifs have been proposed [7,8].However, it remains unclear how one should select the set of motifs to be used in such models given a specific network.As we shall see later, these models can be formulated as ensembles of subgraph covers and total information optimal subgraph covers can be used to match networks with specific instances of these models.
The article is organized as follows: in Sec.2 we present the theory underlying our approach.Then in Sec.3, we examine the resulting optimization problem and propose a heuristic for it.In Sec. 4 we present empirical results for several real world networks and also test the heuristic on some synthetic networks with predefined motif structure.Finally, in Sec.5 we summarize our results and dis-2 cuss directions for future research.

THEORY
In this section we first introduce necessary graph and information theoretical concepts.We then define the total information for subgraph covers and following the approach by Gell-Mann and Lloyd [6], we use the smallness of the total information as a criterion for selecting a subgraph cover that is an optimal representation of a given network.Finally, we discuss the relation between total information optimal subgraph covers and model selection for random graphs.

Subgraph Covers
A graph G=(V(G),E(G)) is a ordered pair of sets where V(G) (|V (G)| = N ) is the set of points called vertices and E(G) is a set of links called edges that connect pairs of vertices.Depending on the kind of network, edges might be directed or undirected.Though in this article, we will not make an explicit distinction between directed and undirected graphs since the arguments and definitions apply to both equally well.In general, we will assume that G is sparse, i.e. |E(G)| = O(N ).Most real world networks are sparse [9].
In graph theory, motifs correspond to isomorphism classes.Two graphs G and H are said to be isomorphic (G H) whenever there exist a bijection φ : V (G) → V (H) such that (x, y) ∈ E(G) ⇔ (φ(x), φ(y)) ∈ E(H) for all x,y ∈ V (G).Such a map φ is called an isomorphism.Being isomorphic is an equivalence relation and the corresponding equivalence classes are called isomorphism classes.
A graph H is called a subgraph of G whenever V (H) ⊆ V (G) and E(H) ⊆ E(G).A set of subgraphs C, is said to be a subgraph cover of G whenever H∈C E(H) = E(G).Subgraph covers are representations of graphs meaning that given a subgraph cover the corresponding graph can be recovered fully from the cover.Trivial examples of subgraph covers are the set of all edges of G and G itself.Other examples are the maximal clique and star covers, which are the sets of all cliques/stars that are not subcliques/substars.An n-clique consists of n vertices that are all mutually connected and an n-star consists of single central vertex that is connected to n peripheral vertices and a subclique/substar is a clique/star that is a subgraph of some larger clique/star.While the maximal star cover is closely related to the adjacency list representation of the graph, clique covers are closely related to bipartite representations [5,10].
Given a cover C, its motif set M(C) is the set of the isomorphism classes of the subgraphs in C. Similarly given a set of isomorphism classes M, an M-cover C M is a subgraph cover of which every element belongs to some class in M.

The Total Information Approach
The total information framework [6] is based on the idea that, given an entity, one can use Shanon information or entropy to describe its random/non-regular aspects and algorithmic information content to describe its regularities or rule based features.In this approach, identifying certain regularities of an entity is equivalent to embedding it into an ensemble of objects that share these regularities while they differ in other aspects.
The first information measure of interest for the total information approach is the entropy, also known as the Shannon information.For an ensemble E(R, p r ), where R the is the set of possible outcomes and p r is the respective probability of an element r ∈ R, the entropy measures the uncertainty regarding the outcome of E and is given by: where K is a constant.When K=1 and logarithm is base 2, the entropy is measured in bits.
Another information measure that is needed in order to define the total information is the effective complexity.The effective complexity ( (E)) of an entity that is embedded into an ensemble E as a typical member is given by the algorithmic information content (AIC) of the ensemble.The algorithmic information content of such an ensemble E with respect to a universal computer U is the length of the shortest program that instructs U to output a description of E and then halt (i.e.(E) = AIC U (E)).In general the effective complexity is not computable and computer dependent, therefore in practice one is restricted to work with approximations in the form of upper bounds.The issue of how to define a practical effective complexity for subgraph covers is dealt with in Sec.3.
The sum of the effective complexity and the entropy is the total information required to describe both the random features and regularities of an entity using a certain model: For a given entity, there might be a multitude of ensembles into which the entity can be embedded as a typical member and it may not always be clear which set of regularities/model provides the best description of the entity.The total information provides a basis for comparing models that describe the same entity.When comparing models, the better model is the one that minimizes the total information and then subject to this constraint, minimizes the effective complexity.Together with additional constraints on computation time, the framework provides a method for identifying regularities/models that 'most' effectively describe a given entity which in many regards is independent of the observer [6].The minimization of the total information is closely related to the minimum description length [11] and minimum messaging length approaches [12].
Following the above definitions, we define the total information of subgraph covers by embedding them into uniform subgraph covers.These are the ensembles of all subgraph covers with fixed motif counts.For this we need to compute the number of different ways a motif m can appear on N vertices which depends on the automorphism group of the motif.An automorphism of a graph is a permutation of its vertex labels that preserves the edges of the graph.The number of all such permutations gives us the number of equivalent vertex labellings of the graph.To specify an instance of m on N vertices, one needs to specify the set of vertices m appears on and how it is embedded into this set.From the defini-tion of the automorphism group it follows that there are which, given m, is the information required to specify n m instances of m on N vertices.Generalizing the above expression, the entropy of a cover C with motif set M(C) and motif counts n m (m ∈ M ) is defined as the entropy of the uniform ensemble of all covers with motif counts n m : When needed, the entropy terms can easily be approximated using Stirling's formula.For instance, when n m and N are large enough and |m| > 2: As in the case of the entropy, we define the effective complexity of a cover using uniform covers with the same motif counts: (C) = AIC U (E(M (C), n m )).Consequently, the total information of a cover can be defined as: Following the total information approach, a cover is an optimal representation of the network if it minimizes the total information.As a result we can define the network motifs of G to be the motif set of a Σ-optimal subgraph cover of G: M(C Σ (G)).The Σ-optimal subgraph cover also gives a decomposition of the network in terms of these motifs.In general there might be more than one subgraph cover that minimize Σ.If this is the case, additional criteria such as the minimization of the effective complexity [6] have to be considered in order to pick one of the solutions over the others.
The quantity (C Σ (G)) can be interpreted as a measure of the complexity of G's subgraph structure which actually is in correspondence with other measures that are frequently used as indicators of a network's complexity such as the broadness of the degree distribution and/or clustering.While the broadness of the degree distribution gives the variety of the star shaped subgraphs that occur in the network, high clustering indicates that the network has a local structure that involves subgraphs other than trees.

Subgraph Covers and Model Selection
In this section we will consider two models that are closely related to subgraph covers: the model introduced by Bollobàs, Janson and Riordan [7] and the model introduced by Karrer and Newman [8].Although these models can account for large densities of nontrivial subgraphs, it is not clear how one should select the set of motifs to be used in such models when matching these with a given network.In their article Karrer and Newman [8] mention this as an important open problem.In the following, we formulate these models in terms of subgraph covers and discuss how Σ-optimal subgraph covers can be used to associate networks with such models.
Random graphs with clustering: In [7] Bollobàs, Janson and Riordan introduced a very general class of random graph models that is based on adding copies of certain motifs on to the vertices of a graph.For the sake of simplicity, we will only consider the homogeneous models i.e. the case where all vertices have the same type.
For the non-homogeneous version of the model as well as various analytical results concerning the properties of the model we refer the reader to the original paper [7].In the homogeneous case the model can be defined as follows: Let M be a set of motifs, each given by a labeled representative and for every m ∈ M , let k m be positive constant that corresponds to the density of the motif in the model.Then for each m ∈ M and |m|-tuple (v 1 , v 2 , ..., v |m| ) of vertices one adds a copy of m to G, such that i th vertex of m is mapped onto v i , with probability: Since we are mainly interested in simple graphs, we will assume that any parallel edges that are formed in this process are replaced with single edges in the network.The normalization factor 1/N |m|−1 ensures that the model has O(N) edges.Depending on the symmetry of m, the same subgraph might be added to the graph more than once, although the probability of this is very small.However, the model can be slightly modified in such a way that every m-subgraph is only considered once.This can be done by considering |m|-subsets of vertices instead of |m|-tuples.Then for each such subset, every distinct m-subgraph is added with probability: where the factor |Aut(m)| ensures that both models contain the same number of copies of m on average.With the above modification the model defines a multinomial distribution P (M,k) (•) over the space of M-covers.This is then projected on to the set of edges in order to obtain a distribution over graphs.Thus, the probability of a graph G in this model is given by: where C M (G) is the set of all M-covers of G. Uniform subgraph covers are essentially microcanonical versions of these models.Consequently, the presented approach can be seen as way of inferring the subgraph cover state of such models.The Σ-optimal cover can be further used as a basis for associating the network with non-homogeneous models of this type that also include correlations between subgraphs.
Generalized configuration models: Another random graph model that is closely related to subgraph covers is the generalized configuration model proposed by Karrer and Newman [8].This model is defined on the basis of a motif set M and a corresponding role sequence r.
Here, the role sequence specifies the number of different motifs attached to each vertex and how these motifs are attached to the vertex.The way in which a certain motif is attached to the vertex is given by the orbit of the automorphism group of the motifs the vertex belongs to.The orbit of a vertex is the set of vertices it can be mapped onto by the automorphism group.In order to generate a graph corresponding to a role sequence r, every vertex is assigned a number of subgraph-stubs corresponding to its role index.A graph is then generated by matching stubs corresponding to the same type of subgraph in appropriate combinations at random and connecting them in order to form the corresponding motif-subgraph.However, in this form the model allows for two or more stubs of the same vertex to be matched together which results in a subgraph that is a vertex contraction of the original motif.When such problematic cases are excluded from the model, every matching of the stubs actually corresponds to an M-cover.Consequently, the generalized configuration models can be formulated in terms of subgraph covers: the model corresponding to a role sequence r is the uniform ensemble of all subgraph covers with role sequence r.
Determining a role sequence for a network is essentially equivalent to choosing a subgraph cover for the network since every subgraph cover produces a specific role sequence for the network.The Σ-optimal cover can be considered as a viable candidate for assigning a role sequence to a network.On the other hand, an important property of the generalized configuration models is that biconnected subgraph counts are essentially determined by the motif set while singly connected subgraphs can be mostly accounted for by the role sequence.Consequently, one can also consider restricting the analysis to biconnected subgraphs when determining a role sequence for the network.This also significantly reduces the number of subgraphs that have to be considered in the analysis since the majority of connected subgraphs of sparse networks are only singly connected.
The models described above suggest that in principle one could also consider more general/non-uniform ensembles of subgraph covers to define the total information.For instance, in the case of the generalized configuration model one could use the ensemble of all subgraph covers that result in the same role sequence.However, there is no known simple way of calculating the entropy of such ensembles even if only single-edge subgraphs are considered, which would be equivalent to the classical configuration model.In addition, such ensembles have high effective complexity.

THE Σ-OPTIMAL SUBGRAPH COVER PROBLEM
In general when finding optimal subgraph covers, one would like to consider the most general set of potential motifs.However, in practice there are several technical limitations, the first being the graph isomorphism prob-lem.That is, there exists no known polynomial time algorithm for resolving the problem of whether two finite graphs are isomorphic.The same holds for finding the automorphism group of a graph.Fortunately, there are several software packages that can efficiently compute the automorphism group of small graphs [13].Second, the problem of finding whether a graph G contains a certain motif as a subgraph is NP-complete.Thus, finding subgraph instances can be computationally expensive especially for large motifs.Third, the number of connected motifs grows faster than exponentially with size.For instance, there are over a million different directed motifs of size 6.Therefore, the set of candidate motifs of which the subgraph instances are to be included in the analysis has to be restricted so that the analysis can be completed in reasonable time.Restricting the set of candidate motifs to all connected motifs up to a certain size seems to be an obvious choice.On the other hand, one can also include special classes of motifs of arbitrary size into the set of candidate motifs.
If one wants to include special classes of motifs into the set of candidate motifs, any prior knowledge of the structure of the network can be used to make an educated guess about which motifs are more likely to produce covers with small total information.For instance, when examining the network representing an electronic circuit, the motifs corresponding to various known subcomponents of the circuit should be included in the set of candidate motifs.Also if the network at hand is known to have a broad degree distribution, star shaped motifs can be included.Similarly, if some motifs are known to favor a certain type of dynamical behavior that is thought to be relevant to the network performing certain tasks, these patterns and their generalizations can be included into the analysis.As previously mentioned, if one intends to use the subgraph cover in order to determine a role degree sequence for the network, the set of motifs can be restricted to biconnected motifs.Disconnected motifs can be excluded from the analysis since the cover that independently contains the connected components of such subgraphs always has lower total information.
Another issue that has to be addressed in practice is that the algorithmic information content is not computable and in addition it is computer dependent.This can be resolved by substituting the algorithmic information content of the ensemble with the code length of a reasonable encoding of it.Another simplification we make is to assume that motifs are independently encoded which results in a effective complexity term that is additive in the motifs.One obvious way of encoding motifs is to use edge lists.In this case we have: where S(|V (m)|, |E(m)|) is the entropy of the ensemble of all graphs with the same vertex and edge counts as m and log* is the iterated logarithm.On the other hand, one can also use a predefined/fixed encoding or catalog of the candidate motifs to define their effective complexities.
After the simplifications above, the total information reduces to: The choice of encoding used to define the effective complexity depends on the set of candidate motifs.The edge list encoding has the advantage of being independent of the set of candidate motifs and therefore is a natural choice when considering all motifs up to a certain size.On the the hand, given a specific set of candidate motifs, the catalog approach in general results in shorter code lengths compared to the edge list encoding.This makes the catalog approach more suitable when the set of candidate motifs contains special classes such as cliques, stars, cycles etc. since these have obvious better/shorter encodings than their edge lists.
Even with the candidate motifs restricted, finding a Σ-optimal subgraph cover is a non-trivial optimization problem.As formulated above, the problem of finding a Σ-optimal subgraph cover is a nonlinear set covering problem where the set to be covered is the edge set of the graph and the subsets are the edge sets of the subgraph instances of the candidate motifs.Set covering problems are known to be NP-hard even in the linear case [14].Consequently, in most practical applications exact solutions are elusive and a heuristic has to be used.

The greedy algorithm
The greedy algorithm we propose is based on the stepwise construction of a subgraph cover.At each step the algorithm finds the motif that covers not yet covered edges of G most efficiently in terms of total information per edge.Given a partial cover C, the efficiency of a set S m of m-subgraphs is defined as: where E(C) and E(S m ) are the set of edges covered by C and S m respectively and Σ(S m ) is the total information corresponding to S m .More precisely, Σ(S m ) = S(m, |S m |) + (m) + log * (|S m |).Following this definition, an optimal instance set of m is defined as a set of m-subgraphs that minimizes σ.At each step, the algorithm determines the efficiency of all motifs in the candidate motif set by determining an optimal instance set for each of them.In the next step, the algorithm checks for each motif whether including its optimal instance set into the cover increases the overall total information of the cover.Then from the motifs of which the optimal instance set does not increase the total information, most efficient one is selected.Having found the most efficient motif, the corresponding optimal instance set is added to the cover and the set of covered edges is updated.The process is repeated until all edges of the graph are covered.To ensure that the algorithm terminates, we require that the single edge motif is always included in the set of candidate motifs.The total information of partial covers is calculated by adding to them the single edge subgraphs corresponding to the uncovered edges.Here, one should note that motifs can not be selected based solely on their efficiency because in general adding the optimal instance set of a motif to the cover decreases the efficiency of other motifs which, in certain cases might lead to an increase of the overall total information.
Algorithm 1 GreedyOptimalCover (G(E,V),MS) Here, OptimalInstanceSet is a function that computes an optimal instance set given a motif and a set of covered edges, Σ is the total information and MS is the set of candidate motifs.
Given a motif m and a set of covered edges, finding an optimal instance set is a nontrivial optimization problem on its own.When subgraphs in the cover are not allowed to share edges, finding an optimal instance set is equivalent to finding a maximum independent set of m-subgraphs, that is a set of m-subgraphs of maximum cardinality such that no two of the subgraphs in the set have an edge in common.This problem is equivalent to the maximum independent vertex set problem and is NP-complete [14].As a result some type of heuristic has to be employed.The descriptions of two such heuristics can be found in the supplemental material.Depending on the heuristic, finding an optimal instance set requires some or all of the subgraph instances of m to be computed.There exist several well known algorithms that can be used for this purpose [15,16].
construct examples where the subgraph cover consisting of only single edges has lower total information or even entropy than the uniform subgraph cover that generates the graph.Table 2: The motifs of the network representing the Western States Power Grid of the United States found using connected subgraphs up to size 6.The motif counts correspond to the cover with lowest total information obtained over 10 runs.The range of the motif counts obtained are also shown in parenthesis.

Network
Table 2 shows the motifs found for the network representing the Western State Power Grid of the United States [17].The obtained motif structure indicates that, among other motifs, cycles and cliques play and important part in the organization of this network.In tables 3 and 4 the motifs found for the gene transcription networks of E.coli [18] and C.cerevisiae [1] are shown.For table 3 only biconnected motifs up to size 5 were considered while table 4 shows the results obtained using all connected subgraphs up to size 5. Including singly connected motifs in the candidate motif set has almost no effect on the biconnected motifs found and mostly results in star shaped motifs or motifs that consist of one node intersections of previously found biconnected motifs.The two networks share 3 out of 4 motifs in the case of biconnected motifs.In table 5 we show the results obtained for two networks representing electronic circuits that are digital fractional multipliers [1].In this case the algorithm not only finds the same motifs for both networks but the counts of the motifs also scale almost exactly with network size.In table 6 the network motifs found for the metabolic networks [19] of several species from different domains of life are shown.We find almost the same motifs in all of these networks and most motif counts also scale approximately with network size.
The analysis of various networks shows that networks that have similar functions also have the same motif structure.This further supports that motifs play an important role in the structural organization of complex networks.Furthermore, the motif counts of the obtained covers also scale approximately with the node and edge counts of the networks in the same class.The results also show that subgraph covers can be used to obtain representations that are up to 20% shorter when compared to edge list representations.As previously mentioned Σ-optimal subgraph covers can be used as a basis for associating networks with generalized configuration models which can be used to make various predictions 10 TABLE I: The motifs of the network representing the Western States Power Grid of the United States found using connected subgraphs up to size 6.The ranges of the motif counts obtained are also shown in parenthesis.

EMPIRICAL RESULTS
In the following, we apply the above algorithm to several real world networks from different fields.We also consider some synthetic networks that are realizations of uniform subgraph covers with predetermined motif frequencies in order to test the heuristic.
Due to computational resources available, the size of the subgraphs used in the analysis is limited to 5 in the directed and 6 in the undirected case.We also consider biconnected subgraph covers in relation with generalized configuration models.All results were obtained using the maximal independent set heuristic (for details see the supplemental material) for finding optimal instance sets and edge lists for encoding motifs.In the following tables, N and E stand for the number of vertices and edges respectively.In addition to the total information of the obtained cover Σ, the tables also show the total information of the corresponding edge cover, ERI, as a benchmark.Both these quantities are rounded to the closest integer and are given in bits.
Because of the random choices involved in finding optimal instance sets, the algorithm might find different covers for the same network on different runs.The covers shown in the figures are the best solutions obtained over 10 runs.A more detailed discussion on the variability of the heuristic can be found in the supplemental material.In Table I, the ranges of motif counts obtained over 10 runs are also shown.Here, we should stress that the proposed heuristics are primarily aimed at demonstrating the feasibility of using Σ-optimal subgraph covers as a basis of motif analysis and other heuristics might be devised for the resulting covering problem.
Table I shows the motifs found for the network representing the Western State Power Grid of the United States [17].The motif structure indicates that, among other motifs, cycles and cliques play an important role in the organization of this network.
In tables II and III the motifs found for the transcrip-about the properties of these networks.The method can be futher tested by comparing properties of networks with these predictions.However such comparisons are beyond the scope of this article and will be treated separately in later articles.
Network Table 4: Motifs of the transcription networks obtained using all connected motifs up to size 5.
In tables 3 and 4 the motifs found for the gene transcription networks of E.coli [18] and C.cerevisiae [1] are shown.For table 3 only biconnected motifs up to size 5 were considered while table 4 shows the results obtained using all connected subgraphs up to size 5. Including singly connected motifs in the candidate motif set has almost no effect on the biconnected motifs found and mostly results in star shaped motifs or motifs that consist of one node intersections of previously found biconnected motifs.The two networks share 3 out of 4 motifs in the case of biconnected motifs.

TABLE II:
The motifs of the transcription networks of E.coli and S.cerevisiae obtained using all biconnected motifs up to size 5.
about the properties of these networks.The method can be futher tested by comparing properties of networks with these predictions.However such comparisons are beyond the scope of this article and will be treated separately in later articles.Table 4: Motifs of the transcription networks obtained using all connected motifs up to size 5.

Network
In tables 3 and 4 the motifs found for the gene transcription networks of [18] and C.cerevisiae [1] are shown.For table 3 only biconnected motifs up to size 5 were considered while table 4 shows the results obtained using all connected subgraphs up to size 5. Including singly connected motifs in the candidate motif set has almost no effect on the biconnected motifs found and mostly results in star shaped motifs or motifs that consist of one node intersections of previously found biconnected motifs.The two networks share 3 out of 4 motifs in the case of biconnected motifs.

11
TABLE III: The motifs of the transcription networks of E.coli and S.cerevisiae obtained using all connected motifs up to size 5.
tion networks of E.coli [18] and C.cerevisiae [1] are shown.For table II, only biconnected motifs up to size 5 were considered while table III shows the results obtained using all connected subgraphs up to size 5. Including singly connected motifs in the candidate motif set has almost no effect on the biconnected motifs and mostly results in star shaped motifs or motifs that consist of one vertex intersections of previously found biconnected motifs.The networks share 3 out of 4 motifs in the case of biconnected motifs.
In table IV we see the results for two networks representing electronic circuits that are digital fractional multipliers [1].For these networks the algorithm not only finds the same motifs for both networks but the motif counts also scale almost exactly with network size.
In table V the network motifs found for the metabolic networks [19] of several species from different domains of life are shown.We found almost the same motifs in all of these networks and most motif counts also scale approximately with network size.
The analysis of various networks shows that networks having similar functions also have the similar motif struc- In table 5 we show the results obtained for two networks rep tronic circuits that are digital fractional multipliers [1].In this rithm not only finds the same motifs for both networks but th motifs also scale almost exactly with network size.12 TABLE IV: The motifs of electronic circuits (digital fractional multipliers) obtained using all connected motifs up to size 5.
ture.This further supports that motifs play an important role in the structural organization of complex networks.Furthermore, the motif counts also scale approximately with the vertex and edge counts of the networks in the same class.The results also show that subgraph covers can be used to obtain representations that are up to 20% shorter compared to edge list representations.As previously mentioned, Σ-optimal subgraph covers can be used as a basis for associating networks with generalized configuration models which can be used to make various predictions about the properties of these networks.The method can be further tested by comparing properties of the analyzed networks with these models.However, such comparisons are beyond the scope of this article and will be treated separately in later articles.
Finally, we also consider some synthetic networks that are realizations of uniform subgraph covers with predetermined motif counts in order to test whether the heuristic is able to recover the underlying motif set/subgraph cover in such cases.As shown in Table VI, for all random networks the algorithm is able to recover the motif set.For Network 1 the algorithm recovers the underlying subgraph cover exactly.Network 2 is generated to mimic the motif structure found for an electronic circuit (s838-Table IV) and the algorithm is able to recover the original subgraph cover with only one extra subgraph.On the other hand, for Networks 3 and 4 the motif counts differ significantly from the counts of the uniform subgraph covers used to generate the networks, especially with respect to the 5-star counts.This is probably caused by the fact that these networks contain a large number of 5-stars of which only some are explicitly contained in the underlying cover.Consequently, finding an optimal instance set of 5-stars becomes more difficult.This effect is more pronounced in Network 3 because it is denser compared to Network 4.
As is the case with any heuristic, the quality of the Motifs of various metabolic networks and their respective counts in the optimal cover found using biconnected subgraphs up to size 5. AA=Aquifex aeolicus(bacteria), AB=Actinobacillus actinomycetemcomitans(bacteria), EC=Escherichia coli (bacteria),CE=Caenorhabditis elegans (eukaryote), AG=Archaeoglobus fulgidus (archea), AP=Aeropyrum pernix(archea).The table shows all motifs that occur at least 4 times in any one of the obtained covers.For each network at most 2 motifs are not shown in this table.
In table 6 the network motifs found for the metabolic networks [19] of several species from different domains of life are shown.We find almost the same motifs in all of these networks and most motif counts also scale approximately with network size.

Discussion
In this paper, we introduced an information theoretical approach to motif analysis in networks that is based on using subgraph covers as representations of graphs.We also proposed a greedy heuristic for approximating the resulting optimization problem.The subgraph covers obtained for various networks show 13 TABLE V: The motifs of various metabolic networks obtained using biconnected subgraphs up to size 5. AA=Aquifex aeolicus(bacteria), AB=Actinobacillus actinomycetemcomitans(bacteria), EC=Escherichia coli (bacteria),CE=Caenorhabditis elegans (eukaryote), AG=Archaeoglobus fulgidus (archea), AP=Aeropyrum pernix(archea).The table shows all motifs that occur at least 4 times in any one of the obtained covers.For each network, at most 2 motifs are not shown in this table.
solution depends on the structure of the network.One can construct examples where the greedy heuristic fails to recover all the motifs used to generate the network.In general, the greedy heuristic favors patterns that are dense, symmetric and occur in large numbers in the network.Thus, if the graph contains only a few copies of a motif that is not very dense, the algorithm might not be able to recover that motif.Also if a motif contains a sub-motif that is more dense and symmetric compared to the entire motif, the greedy algorithm might pick the sub-motif over the motif itself since the sub-motif covers edges more efficiently.

DISCUSSION
In this article, we introduced an alternative approach to motif analysis in networks that is based on finding a subgraph cover of the network that represents it using minimal total information.We proposed a heuristic for the resulting NP-hard optimization problem.The subgraph covers obtained for various networks show that the algorithm finds nearly identical motifs for networks with similar functions.Moreover, by considering subgraphs of various sizes simultaneously, the method is able to detect even large motifs consistently.
Another advantage of the method is that it provides an explicit decomposition of the network into motif subgraphs.This allows motifs to be studied within the context of the rest of the network rather than in isolation.
We also showed that total information optimal subgraph covers can be used to match networks with random graph models that incorporate the obtained motif structure.This allows more accurate modeling of networks in general.
Subgraph covers can readily be generalized to graphs with labeled/colored vertices and edges as well as graphs with parallel and self edges.Such labels might be cho- As shown in Table 1, for both random networks the algorithm is able to recover the motif set.Network 1 is generated to mimic the motif structure found for an electronic circuit (s838-see table 5) and the algorithm is able to recover the original subgraph cover with only one extra subgraph.On the other hand, for Network 2 the motif counts of the obtained cover differ significantly from the counts of the uniform subgraph cover used to generate the network.This can be partly explained by presence of subgraphs corresponding to motifs which were not explicitly added during the generation of the network.As is the case with any heuristic, the success of the greedy heuristic depends on the structure of the network.One can construct examples where the greedy heuristic fails to recover all the motifs used to generate the network.For instance if a motif contains a sub-motif that is more dense and symmetric compared to the entire motif, the greedy algorithm picks the sub-motif over the motif itself since the submotif covers edges more efficiently.Another aspect that one should keep in mind is that the greedy heuristic favors patterns that are most pronounced in the network which are motifs which are dense, symmetric and occur in large numbers in the network.Thus if only a few copies of a motif that is not very dense are added to a graph, the algorithm might not be able to recover the motif.Here, one should also notice that the cover used to generate the graph is not necessarily the one with minimal total information.It is possible to 9 sen so that they correspond to known functional roles of vertices or the community structure of the network.On the other hand, the obtained subgraph covers could also be used as a starting point for detecting communities in networks or for inferring functional roles of vertices.Communities that differ with respect to their internal organization can also expected to differ with respect to their motif structure.Similarly, one would expect the functional role of a vertex to be strongly correlated with the motifs it is a part of.
The total information approach can also be extended to ensembles more general than uniform subgraph covers.Moreover, model selection approaches other than the total information approach can also be used.Such alternative formulations essentially correspond to using a different cost function in the covering problem.
The presented analysis strongly suggests that subgraph covers can be used to compress network data.In such applications, the total information might be replaced by the expected code length of the subgraph cover.
Finally, there is also room for improvement on the side of the heuristics.We consider this to be an important topic for further research.While the greedy algorithm can be improved, other widely used approximation schemes such as simulated annealing or genetic algorithms can also be applied to the problem.
)| different ways a motif m can appear on a set of |m| vertices.Thus a motif m with automorphism group Aut(m) can appear on N vertices in N !(N −|m|)!|Aut(m)| different ways.Consequently, the entropy of a set of n m distinct instances of m is given by:

Table 3 :
The motifs of transcription networks of E.coli and S.cerevisiae obtained using all biconnected motifs up to size 5.

Table 3 :
The motifs of transcription networks of E.coli and S.cerevisiae obtained using all biconnected motifs up to size 5.

Table 5 :
Motifs of electronic circuits (digital fractional multip connected motifs up to size 5.

Table 1 :
the resulting set covering problem.The motifs obtained for two networks that realizations of uniform subgraph cover ensembles.The quantities corresponding to these ensembles are given in paranthesis.

TABLE VI :
The motifs obtained for networks corresponding to realizations of uniform subgraph covers.The quantities corresponding to the ensembles used to generate the networks are given in parenthesis.