Community Detection in SparseTime-Varying NetworksbyJillian Rae SlindB.Sc., The University of Saskatchewan, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2014c Jillian Rae Slind 2014AbstractCommunity detection is an important aspect of network analysis that hasfar-reaching consequences, in particular for biological research. In the studyof systems biology, it is important to detect communities in biological net-works to identify areas that have a heavy correlation between one another orare significant for biological functions. If one were to model networks thatevolved over time, a di↵erential network would be a vital part or product ofthat analysis. One such network could have an edge between two vertices ifthere is a significant change in the correlation of expression levels betweenthe two genes that the vertices are designed to model.For this particular network, there are no community detection algorithmsthat suce. An analysis of the current community detection algorithmsshows that most heuristic-based methods are too simple or have too higha cost for detecting communities on such sparse networks. A prototypi-cal algorithm is presented that is preferential to high weight edges whendetermining community membership. This algorithm, Weighted SparseCommunity Finder or WSCF, is an incremental algorithm that develops com-munity structure from highly-weighted community seeds, which are 3-vertexsubstructures in the network with a high local modularity.A preliminary analysis of this detection algorithm shows that it is func-tional on data sets consisting of up to 600 genes, with more on a morepowerful machine. The communities detected are di↵erent than the onesprovided by the benchmark algorithms because of the high precedence placedon higher-weight edges. This prototypical algorithm has the potential forrefinement and expansion to provide the ability to find significant results forapplications in the field of Systems Biology.iiPrefaceThis dissertation is ultimately based on the original, unpublished, indepen-dent work by the author, J. Slind.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Community Detection in Networks . . . . . . . . . . . . . . . 31.2 Time-Varying Networks . . . . . . . . . . . . . . . . . . . . . 31.3 Di↵erential Protein-Protein Interaction Networks . . . . . . 41.4 Finding Community Structure in Sparse Di↵erentialNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Other Community Detection Strategies . . . . . . . . . . . . 122.3 Modelling Dynamic Networks . . . . . . . . . . . . . . . . . . 133 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Preliminary Experimental Results . . . . . . . . . . . . . . . 294.1 A Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . 314.3 Comparison to Benchmarks . . . . . . . . . . . . . . . . . . . 334.3.1 Unweighted Benchmarks . . . . . . . . . . . . . . . . 334.3.2 Weighted Benchmarks . . . . . . . . . . . . . . . . . . 34ivTable of Contents4.4 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40vList of Figures1.1 Protein Interaction Network Example. . . . . . . . . . . . . . 21.2 Time Di↵erential Network Example. . . . . . . . . . . . . . . 51.3 Example Sparse Di↵erential Network . . . . . . . . . . . . . . 72.1 Community Detection Suboptimal In Greedy Algorithm . . . 102.2 Multi-Dimensional Regulatory Module . . . . . . . . . . . . . 133.1 Weighted Sparse Community Finder Example Graph . . . . . 173.2 BuildCommunities Example . . . . . . . . . . . . . . . . . . . 203.3 MergeCommunities Example 1 . . . . . . . . . . . . . . . . . 223.4 MergeCommunities Example 2 . . . . . . . . . . . . . . . . . 233.5 GrowCommunities Example 1 . . . . . . . . . . . . . . . . . . 263.6 GrowCommunities Example 2 . . . . . . . . . . . . . . . . . . 274.1 Progression of WSCF on Example Graph . . . . . . . . . . . 304.2 Scalability Analysis Results . . . . . . . . . . . . . . . . . . . 324.3 Lanchichinetti and Fortunato Benchmark Graph . . . . . . . 344.4 Lanchichinetti and Fortunato Benchmark after WSCF . . . . 354.5 Lanchichinetti and Fortunato Benchmark with Marked Edges 36viList of Algorithms1 Weighted Sparse Community Finder . . . . . . . . . . . . . . . 182 BuildCommunities . . . . . . . . . . . . . . . . . . . . . . . . . 193 MergeCommunities . . . . . . . . . . . . . . . . . . . . . . . . 214 GrowCommunities . . . . . . . . . . . . . . . . . . . . . . . . . 25viiAcknowledgementsI would like to extend my everlasting gratitude my supervisor, Dr. RaymondNg, for guiding me through this process and providing invaluable insightinto this project. I would like to commend him further for being incrediblyunderstanding during the challenges I faced in completing this thesis, andfor his everlasting support. This would not have been possible without him.I would also like to thank Dr. Laks Lakshmanan for being the secondreader for this thesis and providing useful feedback for increasing the qualityof my report.viiiDedicationI would first like to dedicate this to my family in Saskatoon, Saskatchewan,namely my parents, Ronald and Valerie Slind. You have always been therefor me and supported me both mentally and financially through this process.You provided me with the encouragement and confidence to keep going eventhough this process did come with its challenges. I am eternally grateful.I would also like to dedicate this to my boyfriend, Dennis Quach. Thankyou for your everlasting support, especially when the times have gottenrough. You have always been there for me, thank you.Lastly, I would like to dedicate this to my everlasting friend, Erin Mould-ing, without whom I would be much worse o↵. Thank you for providing theoccasional ride when necessary, food and companionship, and taking care ofme during the times when I was unwell. I do not know what I would havedone without you.ixChapter 1IntroductionOver the recent decades, advancements in technology have allowed for theanalysis of large amounts of data simultaneously. Because of this, biomed-ical and biological research fields have moved away from a “reductionist”experimental approach to a more holistic one, called Systems Biology. Inparticular, Systems Biology studies the dynamical changes in biological sys-tems in search of new and emergent properties. These biological systems in-clude Protein-Protein Interaction (PPI) networks, metabolic networks, cellsignalling networks, gene regulatory networks (focusing on the interactionbetween proteins and DNA), and others. Some other systems involve in-teraction networks between species in an ecosystem and other macro-scalesystems, but that is beyond the scope of the paper.A popular Systems Biology study is the analysis of PPI networks. Thesenetworks are important because they characterize fundamental processeswithin the cell, especially those related to cancer pathologies. Becauseof their importance to physiological and pathological processes, there is astrong desire to build as many interactomes (networks comprising wholesets of interactions within a cell) as possible. An example of one of theseundertakings is the Human Interactome Project at the Center for CancerSystems Biology [22].We can analyze biologically significant networks, such as PPI networks,by simply modelling these networks as graphs, having each vertex representa protein and every edge be present if there is an interaction between theproteins. Figure 1.1 shows an example PPI network, with di↵erent groups ofvertices coloured with a label that shows their commonality. These group-ings can be used to elucidate protein function by characterizing an unknownusing the function of the group it belongs to, and to identify important pro-teins within the network structure. Such important proteins might be onesthat control the entire function of the group by interacting with (and regu-lating the function of) every other member of the group, or be responsiblefor communicating between one group and another.In the realm of graph theory, these groupings are called communities,i.e., a partition of a graph that contains more interconnectivity between its1Chapter 1. IntroductionProteasomeEndo/exonucleaseEGF-like domainsMatrix metalloproteaseSerpinsNuclear hormonereceptorsHypoxia inducablefactorTubulinMitotic spindlecheckpointATP transporterproteinsPeroxisomalproteinsCell cycle / cytokinesisTGF-βCell cycleregulationMyosinLamininActininCasein kinaseTranscription regulationVascular endothelialgrowth factors VEGFIntracellular signallingcascadeBreast cancer anti-estrogen resistance JAK/STATcascadeKaryopherin docking complexNF-kappaBregulationNucleocytoplasm transportFigure 1.1: An example of community structure in a protein-proteininteraction network. Each community of proteins is represented in adi↵erent colour, and labeled according to their function. Figure from [14].21.1. Community Detection in Networksvertices than it does with the rest of the graph. In this work, as well asother works that focus on community detection in weighted graphs, thisdefinition to place more importance on the weighted edges of the graph.That is, communities in weighted networks are areas of the graph with alarger density of “heaver” (high-weight) edges between the vertices in thecommunities than between them.1.1 Community Detection in NetworksCommunities in graphs are defined as subgraphs that have a higher den-sity of edges between its members than between the members and otherregions of the graph. An example of this is again seen in Figure 1.1. Thesecommunities can be seen in a variety of networks, including collaborationnetworks between authors [19], social networks [2], and biological networks.These communities can help identify important aspect of network topologyby identifying substructures that serve a particular function or role withinthe network.Detecting these community structures is still a dicult problem of inter-est to many scientists spanning physics, computer science, biology, mathe-matics, engineers and the social sciences [15]. There are a lot of algorithmictechniques that have been developed over the years to detect these struc-tures [9] [20], such as hierarchical clustering and the maximization ofnetwork modularity. These techniques are explained in detail in Chapter 2.The current techniques are powerful, but there is no conclusive definition ofwhat constitutes the “best” community structure for every given network.In addition, they each have their own shortcomings when applied to di↵erenttypes of weighted networks.1.2 Time-Varying NetworksTo add a further complication to the problem of community detection, areality of biological networks is that they are not static over time. Thatis, the network structure is subject to change given predictable fluctuationssuch as those depending on circadian rhythms, to networks that start todestabilize under disease progression. Thus, it is significant to the biolog-ical community to model how these networks change over time. It is thenreasonable to conclude that the ability to model how community structurechanges over time becomes very important in the analysis of these biologicalnetworks.31.3. Di↵erential Protein-Protein Interaction NetworksOne highly used technique of temporal network analysis is the creationof a di↵erential network, a network that highlights the changes between onestate and the next [13]. An example of a di↵erential network is shown inFigure 1.2. This figure represents the transition from one state of the net-work to another after stimulation by Epidermal Growth Factor (EGF) [4].The provided network is specific to interactions with GRB2 domains, whichare areas on the protein that function as adaptors for receiving signals fromother proteins. The red-shaded nodes are those that interact with GRB2complexes after stimulation by EGF, green are those with less interactions,and blue are those that interact with the GRB2 complex in non stimulatedcells (the control group). The border thickness indicates the intensity of thechange, and the boxes in each node indicate the time point in which thechanges occur. The nodes that are greyed out are ones that do not respondto EGF and thus no significant change was reported.1.3 Di↵erential Protein-Protein InteractionNetworksNetworks like those above have the capacity to hold a lot of informationwith respect to how a network changes over time. A simple extension to anetwork like this would be to expand this to look at PPIs in general. Eachnode would represent a protein. Each edge could be present between twonodes if there is a significant change in the correlation of expression thatthe two proteins share. Expression, in this case, is the amount of protein(or its precursor nucleotide sequence) present in the cell at a time x. If thecorrelation changes significantly, then it is either the case that two proteinsthat once had correlated expression levels are no longer correlated (or havechanged their correlation significantly), or the two proteins that did not haveany sort of correlation are now expressing in correlation with one another.Such a network would be expected to be sparse, given that there arestrict requirements for what is determined to be a significant change incorrelation. An example of such a network is given in Figure 1.3. The edgesthat are blue represent a significant increase in correlated expression levels,whereas the edges in red represent a significant decrease. The edge labelrepresents the magnitude of the change, with 1.00 being the highest and0.00 representing no change. Note that the edge label does not representthe direction of the change. The edges in grey are left out of the graph,because they have fold changes lower than significant levels. Although thereis not a defined method to generate these di↵erential networks, this type of41.3. Di↵erential Protein-Protein Interaction NetworksFigure 1.2: An example of a time di↵erential network. This represents thedi↵erential in protein interactions with GRB2 domains after stimulation byEGF. The coloured bars within each node represent the relative foldchange of that protein relative to the time elapsed in the EGF stimulationexperiment. Figure from [4].51.4. Finding Community Structure in Sparse Di↵erentialNetworksnetwork serves as the motivation for the algorithmic prototype presented inthis paper.Finding the community structure in these would indicate a series of nodesthat have relationships with one another that have been destabilized. Thiscould be indicative of a particular function in the cell that is rendered over-or under-operational due to the change, or it could be used as a diagnostictool to identify network destabilizations that are outside of those generatedby a lack or overproduction of a protein vital to the network.1.4 Finding Community Structure in SparseDi↵erential NetworksThere can be possible expansions to the graph if to describe more featuresof the network to the analyst, although these are not incorporated into thecommunity detection strategy detailed below. For example, even if the edgesare labeled to represent the amount of change in correlation, the visual cueof edge thickness can also be used to represent the edge weight, allowingfor more intuitive interpretation of the relative edge weights. Addition-ally, the vertices can carry an intrinsic weight value and colour representinga molecule’s change in expression level, with a larger thickness indicatinga greater change, and red/blue colouring indicating the direction of thechange.There are several methods to find communities in biological networks,but none that are tailored for the sparse weighted networks as described inSection 1.3. In this paper, we examine a large number of these methods indetail and propose a prototype for an incremental method to generate com-munities from high-scoring network seeds. The algorithm begins with de-tecting high-scoring three-vertex community seeds, then incrementally triesto grow them by adding additional vertices and corresponding edges untilthe complete communities are formed. In Chapter 2, the current works incommunity detection and their applications to our sparse di↵erential net-work are discussed. In Chapter 3, the alternative incremental algorithm ispresented and discussed. In addition with the presentation of the algorithm,some preliminary experimental results are presented in Chapter 4 and dis-cussed in terms of the algorithm’s capabilities. The results are concludedand discussed in Chapter 5.61.4. Finding Community Structure in Sparse Di↵erentialNetworksa b cd e f g0.990.880.83 0.750.86 0.930.73 0.89Figure 1.3: Example sparse di↵erential network. This graph representscorrelated expression level changes in a protein-protein network. The edgelabels represent the magnitude of these changes. The edges in red or bluerepresent statistically significant increases or decreases in correlatedexpression levels, and the greyed-out edges represent a lack of significantchange.7Chapter 2Related Works2.1 ModularityA major advance in the detection of community structure in biological net-works was the introduction of network modularity, presented as an eval-uation criterion for a greedy community-detection algorithm, by Clausetet. al. in 2004 [7]. Modularity is a centrality-based measure that rewardshighly-connected communities, and punishes sparsely-connected communi-ties. Modularity Q is calculated by equation 2.1, in which m represents thetotal number of edges in the graph. Aij is 1 or 0, depending on whether theedge exists between vertices i and j. The degree of the vertex x is repre-sented by kx; and (ci, cj) is 1 if the vertices are in the same community, 0otherwise. Q = 12mXij (Aij kikj2m )(ci, cj) (2.1)Modularity is preferential to communities that have an interconnectedstructure similar to a clique. This metric is used as a qualitative measure-ment of an algorithm’s performance in community detection, as well as aheuristic in other community detection algorithms [17].Blondel et al. introduced a greedy algorithm that iteratively builds com-munities by joining vertices together that would net the highest increase inmodularity, then collapsing those communities into a singular node and re-peating the process to produce a hierarchical community structure [5]. Thisalgorithm, however, depends on a sequential sweep of all pairs of verticesand is highly dependent on the sequence ordering. Additionally, the questionarises as to what becomes the most e↵ective stopping criterion for an itera-tive growth algorithm to halt at the“best” community assignment. Becauseof Blondel’s hierarchical community structure, the question becomes whichlevel of the hierarchical structure denotes the most accurate assignment ofvertices to communities. For determining novel communities, however, thedecision of which hierarchical level is the most appropriate might becomearbitrary.82.1. ModularityWith further development into modularity optimization techniques andthe applications of modularity to a large scale networks, the modularitymetric has been adapted to further suit the needs of the researchers. Sincethe di↵erential network discussed in the previous chapter has weighted edges,the modularity used should reflect that property. Expanding to weightednetworks is fairly simple, as discussed by Newman in 2004 [18]. As shownin equation 2.2, kx is replaced by sx, where sx is the sum of all weights ofthe edges adjacent to vertex x, and m is replaced with W , which is the totalsum of the weights of all edges in the network.Qw = 12W Xij (Wij sisj2W )(ci, cj) (2.2)Other extensions of modularity include the inclusion of directed graphs [1],and the evaluation of overlapping community structure [23]. These expan-sions have been applied as evaluation metrics for a large number of com-munity detection algorithms [9], notably the greedy ones mentioned above,as well as other types of algorithms, including simulated annealing and ex-tremal optimization.A major problem in the use of greedy algorithms for community detec-tion, especially those that use modularity as a heuristic, is that they havethe chance of being caught in local maxima. This is because, like all greedyalgorithms, these algorithms will always step in the direction of the great-est increase of modularity. However, as exemplified in figure 2.1, it is notalways the case that this is the overall most optimal step. From the com-munity found in subfigure a, a greedy algorithm will not elect to expand thecommunity to the community in subfigure b because the overall modularitywould be less than the initial modularity. However, the most optimal so-lution would be to include all of the nodes in the community, as shown insubfigure c. This shows that the community in subfigure a can be considereda simple example of a local maxima for community detection.Simulated annealing is an optimization technique that avoids gettingtrapped in local maxima by introducing a computational temperature vari-able, T . While T is high, the algorithm is allowed to explore di↵erentconfigurations at a higher cost. The idea is to begin at a high T and slowlydecrease it to allow the system to ascend into a large maxima while over-coming cost barriers. In Guimera` and Amaral (2005), simulated annealingwas used to maximize modularity in a module detection algorithm, by set-ting the overall cost (C) to be negative modularity, or M [12]. At eachvalue of T , a random number of updates are performed and accepted with92.1. Modularityab c de 0.90.90.90.90.9 0.90.9(a) Initial community.Modularity 0.13.ab c de 0.90.90.9 0.90.9 0.9 0.9(b) Potential new community.Modularity 0.10.ab c de0.9 0.90.9 0.9 0.9 0.90.9(c) Optimal community.Modularity 0.16.Figure 2.1: A simple example of a greedy algorithm failing to find the mostoptimal community. The modularity of the community in figure a isgreater than the modularity of the combined community in figure b, so thenode d and its edges (in blue) will not be an acceptable addition to thecommunity. However, the overall modularity of all the nodes in onecommunity (figure c) is greater than the initial community, and thus is themost optimal goal. Thus, figure a represents a local maximum in theoverall community detection for this graph.102.1. Modularitythe probability given in equation 2.3, where Cf is the cost after the update,and Ci is the cost before the update.p = ( 1 Cf Ciexp ⇣CfCiT ⌘ Cf > Ci (2.3)The Guimera` and Amaral algorithm performed this community detec-tion using two di↵erent forms of events as updates. Movement events in-volve individual node movements from one community to another, and merg-ing/splitting events involve the merging or splitting of a community. In gen-eral, they used ni = fN2 movement events and nc = fN merging/splittingevents at each temperature point T . N is the number of nodes in the net-work, and typically, f = 1. They cooled the system down by a factor of0.005Ti, where Ti is the current temperature for a given step.As can be suggested by the parameters chosen by Guimera` and Amaral,simulated annealing requires a large amount of calculation at each step, anda large number of steps. While it has a high probability of finding the “best”modularity configuration, it is really only feasible on smaller networks, as thecomplexity of this algorithm is still staggeringly high. As per the parametersdefined above, the complexity must be O(N) = Pft=0(Tt(modul(N) ⇤ (nc +ni))), is still staggeringly high, where modul(n) is the cost of calculating themodularity in a graph with n nodes, which must be at least O(n2).To achieve the accuracy of simulated annealing without as high of acomplexity, Duch and Arenas proposed the use of Extremal Optimizationto detect communities [8]. In the extremal optimization algorithm, nodesare initially partitioned into two random partitions, and then at each stepthe node with the lowest fitness is moved from one partition to another,until an optimal modularity state is reached. Fitness, in this case, refersto a node’s individual contribution to modularity, given by the equationi = ar(i), which is the fraction of edges that have one or more nodesincident to community r, one of which must be i. Then, all links betweeneach component are severed and one proceeds recursively with each resultantcomponent, until the global modularity can no longer be improved. Thisdi↵ers from a graph bipartitioning in that the algorithm does not finish atnecessarily the same time in each recursive component, and the number ofnodes and subcomponents in each partition is dependent on the evolutionprocess of the algorithm. Local fitness of each node is computed by takingits local modularity (corresponding to value in the modularity summation),and dividing it by its degree.Extremal Optimization has a total complexity of O(n2 log n). However,112.2. Other Community Detection Strategiesits eciency comes with an accuracy tradeo↵, in that the recursive bisec-tioning of the algorithm can become error-prone in larger networks with alarge number of communities, because a single “incorrect” bisectioning hasthe potential to throw o↵ the results completely. Additionally, by usinglowest-fitness as the selection criterion for which node to move, the algo-rithm can get trapped in local optima, although this can be fixed by usinga probabilistic selection method [6].2.2 Other Community Detection StrategiesThere are several other community detection strategies that are not modularity-dependent. This subsection will detail a couple of the more recent modularity-free developments in community detection.Ball et al. have developed an expanded community detection to detectoverlapping communities in networks, as community overlap is not uncom-mon in real-world networks [3]. Their algorithm focuses on the developmentof generative network models and using an expectation maximization algo-rithm to solve the likelihood of those models. It is fast and performs wellwith large networks and is competitive with other algorithms, however, thenumber of communities K must be specified beforehand.Another method involves incorporating multiple dimensions of data toinfer regulatory modules, with dimensions such as copy number variationand DNA methylation alongside RNA expression and gene expression pro-files [16]. With all these di↵erent types of data, one can apply a regressionmethod such as a variant of Partial Least Squares to infer network modules.This work in particular introduces a new regression method called sparseMulti-Block Partial Least Squares (sMBPLS).Partial Least Squares (PLS) methods find the relations between two ma-trices (an input matrix and a response matrix) by projecting both variablesto a new space and determining the linear regression of these variables.Multi-Block PLS methods break data down into conceptually meaningfulblocks before performing the PLS regression, thus allowing for a greater in-terpretation of the results than a standard PLS because it is then possibleto examine each block specifically and report information on the relationof the specific block with the response matrix in the presence of the otherblocks. In Li et al.’s sMBPLS method, sparse constraints are imposed ontheir multidimensional data to generate multidimensional modules to use inan MBPLS algorithm [16].Instead of analyzing whole blocks, the sMBPLS method further decom-122.3. Modelling Dynamic NetworksFigure 2.2: Example Multi-Dimensional Regulatory Module (MDRM).Subsets of the three data sets (data blocks) on the left have similar profilesto the gene expression profile over the same subset of samples. Thus, thedata from the selected samples and columns across the three data sets arean MRDM. Figure from [16].poses those data blocks into a smaller set of blocks called Multi-DimensionalRegulatory Modules (MDRMs). MDRMs are defined as the profiles ex-tracted from the input matrices along k samples (across some subset ofcolumns i) that have a strong association or show a similar coherent patternwith the same k samples in the response matrix. An example of an MDRMcan be seen in figure 2.2.Because of the diversity of the data used, this method allows for mod-ules to be discovered that would otherwise be more dicult to find whilerestricted to one type of data. If one was specifically looking to function-ally characterize these modules, this would be an appropriate method touse. However, these modules are not necessarily indicative of network com-munities, and vice versa. However, if would still be interesting to applythis technique to analyze time-dependent matrices to generate interestingdynamical network properties.2.3 Modelling Dynamic NetworksA fascinating extension of the detection of communities in biological net-works is their application to temporal data. In vivo, biological networks donot stay static over time. Therefore, it is interesting to model the changes inthese networks over time, especially their community or modular structure.Keeping track of these changes can lead to interesting results, including thediscovery of novel genes in cancer systems [24]. This subsection will detail132.3. Modelling Dynamic Networkssome of the methods in literature that deal with network modelling withrespect to dynamic systems.Park and Bader’s method of modelling dynamic systems is to use ahierarchical agglomerative clustering method to generate network clusters,and a matching algorithm to match clusters across multiple time points[21]. This involves the formation of stochastic block models for the networkat each time point, then merging these models together into a dynamicblock model. Although these block models are separate for each given timepoint, the models are coupled in between time points by reweighing theneighbouring likelihood scores between the generative models. This also hasthe added benefit of noise reduction and smoothing between points.The clusters are then linked together using an expectation-maximizationalgorithm that extends the Jaccard correlation of shared neighbours to amulti-partite matching based on a probabilistic model. They generated aMarkov random field based on the joint probabilities of assignment matricesacross multiple time points, and simplified the otherwise huge state spaceby considering each cluster’s assignment matrix independently. Once thegroups were stabilized, the resulting structure was close enough to a treestructure to perform a belief-propagation algorithm by reformulating theresultant field to a factor graph.Although their approach is very strong and more applicable to networksof 2000 vertices (as opposed to other machine learning approaches, whichcan only handle very small networks), it does not necessarily converge on asolution. Additionally, the bipartite matching has a worst-case runtime ofO(K3), where K is the number of clusters. This can prove to be problem-atic if there are networks with a large number of communities. However,this matching would prove to be an interesting extension of the communitydetection algorithms used currently.Another method to track biological networks as they undergo temporalchanges is by use of dynamic bayesian networks (DBNs). Zhu et al. con-structed a DBN combining multiple short time series into a long time seriesand applying the Granger causality test to the model to infer causal rela-tionships between individuals across time points [6]. However, this methodassumes no causal structure within a single time ”slice”, and only looks atindividuals as opposed to modules or communities.One significant problem with inferring a temporal network in real-worlddata is that in most solutions, the time points are treated as the only pointsbetween which the network changes. Thorne and Stumpf presented a solu-tion that generates a temporal network that models hidden states betweeneach time point [25]. They model these hidden states by using a non-142.3. Modelling Dynamic Networksparametric extension of a Hidden Markov Model. This method works wellfor a large number of networks, as it does not require the number of hiddenstates to be specified ahead of time, and could be applied to time seriesthat have irregular time points. While their paper also does not considerchanges in community structure, it would be interesting to see its applicationto networks.15Chapter 3Methods3.1 AlgorithmsThe prototypical method presented in this thesis is a top-down approachto evaluating weighted community structure. It makes use of the networktopology to only examine the heaviest-weighted edges and then incremen-tally examines edges of lower and lower weight, until a threshold, t, isreached. It takes a network of size n vertices and e edges with edge weightswij where i and j are the endpoints of an edge.Algorithm 1, the Weighted Sparse Community Finder, encapsulates thelogic behind the prototype, which calls methods to build community seedsbefore incrementally growing and merging them together. The incremen-tal nature of the algorithm means that edges with weights greater than orclosest to the largest threshold tl will first be considered as seeds for newcommunities, and then for candidates to join other communities. Then edgeswith weights closer to the smallest threshold ts will be considered as seedsand then candidates for growth, respectively.The threshold of consideration will be decremented by some amount d ateach iteration, which is set to 0.05 for the purposes of algorithm evaluation.The thresholds will always be a number 0 t 1. This threshold providesan e↵ective stopping criterion for the algorithm – since it is important tofocus on edges of higher weights, the low-weight edges will not be consideredif they are below the threshold t. If the threshold is too high and the resultsare not meaningful, the threshold can be relaxed and the algorithm can berun again considering edges of even lower weight.At each iteration, the methods BuildCommunities, GrowCommunities,and MergeCommunities are called, respectively. The intended logic of eachmethod is given in algorithms 2, 3, and 4, and is discussed in the paragraphsbelow. GrowCommunities is called a second time after MergeCommunities toallow for an edge that has been rejected by the first iteration of GrowCommunitiesto potentially join a newly-merged community. To illustrate the e↵ects ofeach method, the example graph in Figure 3.1 will be modified after eachexplanation.163.1. Algorithmsa b c de f ihgj0.70.70.70.70.70.850.850.850.99 0.990.990.990.990.99Figure 3.1: An example graph for the illustration of Weighted SparseCommunity Finder.173.1. AlgorithmsAlgorithm 1: Weighted Sparse Community FinderData: The graph, G(V,E),decrement d, andterminal (lowest) threshold T < 1.Result: A set of communities, SS ;t 1while t > T doe Et /* edges with weights above threshold T */S BuildCommunities(S, e)S GrowCommunities(S, e)S MergeCommunities(S)S GrowCommunities(S, e)decrement t by dThe BuildCommunities algorithm takes a set of edges E and tries toform three-vertex community seeds using purely those edges. Each com-munity “seed” is defined as a 3-clique (or a “triangle”) in the graph, or a“stem” of two edges connected by a single common node. According to themodularity calculation given in equation 2.2, each triangle will always havea positive modularity. A stem is a viable community seed if and only if themodularity of that stem is positive. The method traverses through each edgeand all edges adjacent to that edge to try to form communities. Duplicatecommunity seeds are handled during the MergeCommunities method.The results of the BuildCommunities algorithm at a high threshold t isshown in Figure 3.2. In subfigure a, the nodes the triangular community(d, g, i) is highlighted in red. In subfigure b, three more “stem” communitiesare shown in separate colours – they are present as stems because they donot have an edge that forms a triangle between them. Additional “stem”communities can be imagined using the same criteria – two high weightededges without a third high weighted edge forming a triangle between them.This generation of mini-communities as “seeds” for larger communitiesis especially important when handling sparse graphs, since the communi-ties found may be very small, but still have some meaning. The currentmethods described in Section 2 use non-incremental community generationwith very large modularity calculations, or do not have a stopping criterionto determine what level of a hierarchical community structure defines the183.1. AlgorithmsAlgorithm 2: BuildCommunitiesData: Set of edges E, set of communities, SResult: An updated set S containing new community seeds.i 0for edge e 2 E do(n1, n2) e.nodesfor node n3 connected to n1 do/* ignore if it is the same edge found */if n2 == n3 thennext/* See if there is at least one triangle connectingn1, n2, and n3 */elseconnected falsefor node n4 connected to n3 doif n4 == n2 thensi {n1, n2, n3}S.add(si)i+ +connected true/* If there are no triangles found, see if stemn1, n2, n3 has good modularity */if connected false thensi {n1, n2, n3}if getModularity(si) > 0 thenS.add(si)i+ +193.1. Algorithmsa b c de f ihg j0.70.7 0.7 0.70.70.85 0.850.850.99 0.990.990.990.990.99(a)a b c def ihg j0.70.7 0.7 0.70.70.85 0.850.850.99 0.990.99 0.990.99 0.99(b)Figure 3.2: An example of BuildCommunities finding initial three-vertexcommunity seeds. Since the threshold t is high, no edges with scores lessthan 0.9 (in this case) are considered. Thus, one of the potential initialcommunities that is found by BuildCommunities in subfigure a ishighlighted in red. Another potential community set is highlighted in red,blue, and violet in subfigure b.203.1. Algorithmsactual communities. Since this method uses incremental community gener-ation and has a determined stopping criterion, it is well-suited to the sparsedi↵erential graphs as motivated in Section 1. It is important to note thatwhile it is designed for sparse communities, it will still find communities ina less sparse graph as well.The complexity of BuildCommunities isO(E⇤d⇤(d+getModularity(s))),where E is the number of edges in the graph. d is the maximum degreeamong vertices in the graph, which is expected to be low given these net-works are sparse. getModularity(s) is the function that calculates themodularity of a subset of vertices s, which an overall runtime of ⇥(2Es),the sum of the degrees of the vertices in s. Since this is a value that canbe stored, this operation only needs to be completed once for every uniquesubset s of S, as they are encountered.The MergeCommunities method is designed to take each pair of commu-nities and determine if the modularity of them as one community is greaterthan the average of both community modularities. To do this, all of the ver-tices and corresponding edges are added into a single community, includingedges that were never part of either community but connect the two com-munities together. If this modularity is greater, then the merged communityreplaces the two sub communities. An example of this is illustrated in Figure3.3, where two triangular communities are merged to form a 4-clique withone missing edge. Edges that have both vertices contained in the mergedcommunity that would not otherwise be present in either individual com-munity are also added to the merged community, as exemplified in Figure3.4.Algorithm 3: MergeCommunitiesData: S, the set of current communities, with size nResult: S, set of communities, size nfor x 1...S.size dofor y x+ 1...S.size doif getModularity(sx + sy) > getModularity(sx) thensx sx + syremove sy from SThe complexity of MergeCommunities is O(S2 ⇤ getModularity(si)),where S is the current number of communities and si is the larger of thetwo communities that are attempted to be merged. The complexity for213.1. Algorithmsa b c de f ihg j(a)a b c def ihg j(b)Figure 3.3: An example of MergeCommunities taking two communitiesand merging them together. The two communities in subfigure a have acommon edge between vertices g and i, represented as two edges forclarity. MergeCommunities compares the combined modularity of bothcommunities versus the modularity keeping them separate. If thedi↵erence in overall modularity is positive, then the communities aremerged together, producing subfigure b.223.1. Algorithmsa b c de f ihg j(a)a b c def ihg j(b)Figure 3.4: Another example of MergeCommunities taking twocommunities and merging them together. The edge (g, i) is not present ineither community in subfigure a, but since both vertices it is adjacent toare present in the community in subfigure b, it is added to the mergedcommunity.233.1. AlgorithmsgetModularity(si) is as above. While this method is initially na¨ıve andslow, there are several speedups that could be implemented in further itera-tions. One such improvement would only consider merging two communitiestogether if there was a common edge between the two communities, or therewere several edges that connect the two communities together.Between each Build and Merge step, the method GrowCommunities isinvoked. It takes a set of edges E and a set of communities S and tries to addeach edge to one or more communities from the set. If the edge has a nodethat belongs to a community already, then the method will try to add thatedge to that given community. If both nodes belong to communities already,then the edge is discarded. This can happen because if the nodes belong tothe same community, then that edge already belongs to that community. Ifthe nodes belong to disparate communities, then the edge has a low enoughweight to be not considered in the Build step for each of those communities’seeds.The remaining edges either have a single node with community mem-bership, or neither nodes have community membership. Let us define eij asan edge such that i has community membership to a set of communities Si,and j does not have any membership to any communities. Then for eachcommunity in Si, the edge is added if the modularity of the communitywhen node j (and all corresponding edges connecting j to other members ofthe community) is added is greater than the modularity of the communitywithout j, as shown in Figure 3.5.If both i and j belong to no communities, then the algorithm will tryto add them both (and corresponding edges connecting them) into everycommunity in S, again only adding them if the modularity of the communitywith the added nodes is greater than the modularity of the communitywithout them. This is exemplified in Figure 3.6. Because the modularity ofa community is highly dependent on having edges that connect its members,the modularity change will only be positive if there are in fact edges thatexist between an endpoint of the edge to be added and a member of thecommunity.The time complexity of GrowCommunities is O(E⇤C⇤getModularity(s))where E is the number of the edges in the graph, C is the current number ofcommunities, and getModularity(s) is as defined as above, with s being thelargest possible community in C. This again is a slow, prototypical versionof this method that will na´ıvely try to add a node pair that is not alreadyconnected to a community to every community in the current solution. Animportant speedup in the next iteration of this method would be to only tryto add both nodes to a community if there is an edge already connecting243.1. AlgorithmsAlgorithm 4: GrowCommunitiesData: Set of edges E, set of communities, SResult: A set of hopefully larger communities, Sforeach edge e in E do(n1, n2) e.nodes/* If the edge has at least one node connected to acommunity... */if n1.communities != null OR n2.communities != null then/* If BOTH nodes are connected to a community, justskip */if n1.communities != null AND n2.communities != null thennextelse/* The communities are going to be either n1’scommunities or n2’s communities, depending onwhich set isn’t null */List communities n1.communities.exists ?n1.communities : n2.communities/* And the node to add will be the other one. */Node d n1.communities.exists ? n2 : n1foreach community c 2 communities doif getModularity(c+ d) > getModularity(c) thenc c+ d/* Otherwise, neither node is connected to a community,so let’s try to add the edge to all the communities.*/elseforeach community c 2 S doif getModularity(c+ n1 + n2) > getModularity(c) thenc c+ n1 + n2253.1. Algorithmsa b c de f ihg j(a)a b c def ihg j(b)Figure 3.5: An example of GrowCommunities extending the community{d, i, g} with a single edge. The edge (g, h) has a single node connected toa community so GrowCommunities tries to add it to its incidentcommunity, producing the community in subfigure b.263.1. Algorithmsa b c de f ihg j(a)a b c def ihg j(b)Figure 3.6: Another example of GrowCommunities extending thecommunity {d, i, g} with a single edge. The edge (e, j) is completelydisconnected from any communities, so GrowCommunities tries to addboth of the nodes e and j and all relevant edges to the community,producing the community in subfigure b.273.2. Implementationeach of them to some member of the community.3.2 ImplementationThe program was implemented in C++ and compiled using the Apple LLVMversion 5.0. The adjacency matrices were represented as a vector of bitvectors to minimize the space required by the program. In order to maximizethe eciency of these bit vectors, the dynamic bitset class from the BoostC++ Library was used.3.3 Data SetsTo initially prototype this program, test cases were developed of equal-weight triangular disconnected graphs, equal-weight square disconnectedgraphs, and equal-weight 4-clique disconnected graphs. These were gener-ated in varying sizes of n (the number of nodes) to determine the scalabilityof the program.To analyze the algorithm’s e↵ectiveness on data sets closer to what maybe found in real applications of the program, a series of benchmarks wereutilized. The commonly-used benchmarks featured in Girvan and Newman[11] o↵er a way to test the algorithm on a series of undirected, unweightedgraphs. They featured graphs that separated easily into four equally-sizedcommunities, containing edges with with no weight values assigned to them.In addition to these benchmarks, weighted benchmarks (with correctsolutions) were proposed by Lancichinetti and Fortunato [15], o↵ering amethod to generate networks with a power law distribution of degrees andand community sizes, similar to those found in nature. These networksare the hardest to make community detection predictions o↵ of, due to thelarge variance in degree distribution and community size. Lancichinetti andFortunato also made it possible to have weighted and directional edges intheir benchmark graphs, which allows for the examination of this algorithmon networks with di↵erent weight distributions.28Chapter 4Preliminary ExperimentalResults4.1 A Test CaseTo analyze the performance of this algorithm e↵ectively, first a simple ex-ample was generated for the purposes of tracing the algorithm through thecommunity detection process. The example graph is the one featured inFigure 3.1. From that graph, four communities are determined as a resultof WSCF, as shown in subfigure f in Figure 4.1, with each community’s edgesrepresented in a di↵erent colour. Figure 3.1 is duplicated in subfigure a inFigure 4.1 for comparison.The results feature two larger communities (in purple and blue) and twosmaller communities (in green and red). The smaller communities featurea triangle-shaped community {d, g, i}, as well as a line segment community{d, g, h}. Even though there is an edge (i, h), it is not included in any ofthe communities and the two communities are not combined to a largercommunity. This is because the edge (i, h) is of relatively low weight, andthe merged community {d, g, h, i} would have a lower modularity than eachof the separate communities. This is also why the community {g, h, i} doesnot form or contribute to a community, even though there is a triangleinvolving these nodes.This decision process is exemplified in Figure 4.1, which shows the se-quential progression of WSCF on the graph. Each individual sub-figure repre-sents when the algorithm makes a change to the current set of communities,except for the removal of duplicates that happens in MergeCommunities.Subfigure a is the initial graph, a duplicate of Figure 3.1. Subfigure b showsthe communities found after the initial FindCommunities step, identify-ing four unique 3-vertex communities. As per the implementation of theprogram, the 3-vertex communities exist as either triangles or two-edge seg-ments. To show multiple communities that e↵ectively contain the same edge,multiple edges are drawn between those pairs of vertices. Although multi-294.1. A Test Casea b c de f ihg j0.70.7 0.7 0.70.70.85 0.850.850.99 0.990.99 0.99 0.990.99(a)a b c de f ihg j(b)a b c de f ihg j(c)a b c de f ihg j(d)a b c de f ihg j(e)a b c de f ihg j(f)Figure 4.1: Progression of WSCF on the example graph from Figure 3.1.Each subfigure represents a change in the community set found by WSCF.The communities are represented by coloured edges connecting nodestogether, each colour belonging to its own community. Additional edgesare drawn to represent membership of an edge to many communities.304.2. Scalability Analysisples of the same community are found, they are not shown, and they arecollapsed into unique communities during every MergeCommunities step.The next time the community set is modified is shown in subfigure c,with two new communities discovered, presented in cyan and orange. Inthe subfigure d, the newly found communities are merged with the blueone, creating one large community. Once the threshold drops to 0.7, anew community in is found in subfigure e, in orange. This community ismerged with the community that is adjacent to it in the following Mergestep, forming subfigure f . Thus, the four communities that are shown inthat subfigure are the ones discovered by WSCF.4.2 Scalability AnalysisTo test the scalability of WSCF, a series of test graphs were generated, fea-turing disjoint 3-clique, 4-clique, and 4 “square” subgraphs. The “square”subgraphs consist of four vertices with a degree of two, thus joined by onlyfour edges, forming four-sided polygon if drawn on paper. An example of a4-“square” subgraph would be a graph consisting of only the vertices e, d, jand i from Figure 3.1 or Figure 4.1a. The 3-clique subgraph can be repre-sented using vertices a, b and c, and the 4-clique subgraph can be representedusing vertices d, g, h and i if there was another edge connecting vertices dand h. The number of vertices, n, was varied so that the time elapsed untilprogram completion could be measured. The number of edges in each graphE is equal to n in the 3-clique and 4-“square” testcases, and equal to 1.5nin the case of 4-clique graphs.The program was timed using the C/C++ timer.h class, reporting theseconds elapsed from the beginning of the program to the end. Due torounding errors, it is expected that the reported elapsed times are withintwo seconds of the actual elapsed times. WSCF was tested on a 2013 MacbookPro with a 2.6 GHz Intel Core i5 processor and 8GB memory. It was testedstarting at n = 50, and subsequently larger intervals in increments of 50until the testing machine ran out of application memory.WSCF was successfully tested within a reasonable time on the 3-cliquegraphs for sizes of n up to 600. This number is reasonable because of theexperimental design required to gather the data, but will need to be im-proved for much larger graphs. A chart that compares the elapsed times ofeach of the graph sets at the di↵erent levels of n is shown in Figure 4.2.Communities were found in the 3-clique graphs as expected — one foreach 3-clique. However, in the 4-“square” graph, the communities found only314.2. Scalability AnalysisFigure 4.2: Scalability analysis results of the WSCF algorithm againstgraphs consisting of disjoint subgraphs that are 3-cliques, 4-cliques, and4-“squares”, respectively.324.3. Comparison to Benchmarksconsisted of two edge segments. The combined modularity of the “square”communities were not larger than the individual modularities of the segmentcommunities. This results in a stark di↵erence in runtime between 3-cliqueand 4-“square” test cases, due to the sheer volume of communities in the4-“square” cases, since the runtime of the program is highly dependent onthe number of communities.Although it is expected that that the 4-clique graphs would form 4-member communities, the 4-clique graphs do not form communities — theyform 3-clique communities instead. This may be due to a number of reasons,the truth of which can be explored in a future project. Because there arean unexpected number of communities in the 4-clique test case, the elapsedtime to run the program is noticeably higher, as shown in Figure 4.2.4.3 Comparison to Benchmarks4.3.1 Unweighted BenchmarksBecause this program is designed for sparse graphs, the initial unweightedbenchmarks as proposed by Girvan and Newman [11] are not representativeof the target graphs. The original benchmark set was for 128 vertices, eachwith a degree of 16, belonging to 4 communities by having a significantportion of their edges existing only within their communities, and the restexisting outside of their communities. To appropriate the same type ofmodel for sparse networks, a degree of six was imposed for each vertex,as opposed to 16. Instead of expecting the algorithm to find four largecommunities, the expected community size was set to 8, forming 16 distinctcommunities. Because these benchmarks were all unweighted, a weight of0.99 was assigned to each edge in the graph.Even though the graph generating program was modified to producegraphs that are sparse enough for the algorithm to handle, the same erroras the 4-clique community issue was encountered — the program found alarge amount of three-vertex communities without merging them togetherto form the larger communities. Perhaps this is due to the lack of iterationand direction that the program had - without high-weighted seeds (that havea higher weight than other potential seeds) to spearhead the communities,the program will just find random seeds everywhere and try to expand andmerge those to no avail.334.3. Comparison to Benchmarks259262927418781417201016231324111283131521301925612220Figure 4.3: Benchmark graph generated from [15], with the communitiesas determined by the benchmark program highlighted in di↵erent colours.Visualized using GraphViz [10].4.3.2 Weighted BenchmarksTo generate weighted benchmark graphs for analysis, the code from Lanci-chinetti and Fortunato [15] was utilized, using a relatively small base-caseof 32 nodes and an average degree of 4. The smaller graph size was chosenso that the community detection could be visually compared for “correct-ness”, that is, the correct communities of vertices are found. A weightedgraph with its appropriate community determination is shown in Figure4.3. The graph generation program grouped the vertices into four distinctcommunities, shown in di↵erent colours in the figure.To determine the communities using WSCF, the starting weight cut o↵was set to 0.95. The cuto↵s were decremented in steps of 0.05 until theweights considered reached 0.5, at which point the program terminated.The communities found by WSCF, however, are vastly di↵erent from thecommunities found by the benchmark generation program. As shown inFigure 4.4, the communities detected are much smaller and more plentifulthan the communities from Figure 4.3. Most notably, a lot of communitiesstayed at the size of n = 3, with a few small exceptions. One major issue344.3. Comparison to Benchmarks18 11281017252692927612302431211323 16315250714204182219Figure 4.4: Benchmark graph generated from [15], with the communitiesas determined by WSCF highlighted in di↵erent colours. Visualized usingGraphViz [10].with the communities detected in this figure is that one community consistsof several completely disjoint communities, which are highlighted by thered edges in the Figure 4.4. The community highlighted by the orangeedges, however, is larger than three and appears as it if could possibly bea community of interest. However, the large inclusive communities fromFigure 4.3 are not present.It was assumed that the communities determined in Figure 4.3 consistedof edges that had greater weight than the others, thus WSCF would findsimilar communities due to its preference for highly-weighted edges. It wasfound, however, that this is not the case. In fact, there are several high-weight edges that are considered outside of communities, and low-weightedges that are considered within communities, or communities consistingof mostly low-weight edges at all. In Figure 4.5, edges with weights abovethe seventy-fifth percentile are marked in red, edges with weights below thefiftieth percentile are marked in blue. Edges that have weights below thefiftieth percentile are two low to be considered originally for communitygeneration in WSCF. It is apparent that the aforementioned benchmark does354.4. Findings223126251230216242023315929272519714178160281111310418Figure 4.5: Benchmark graph generated from [15], with the edge weightsmarked by coloration. If the edge exists above the seventy-fifth percentile,it is coloured red. If the edge weight is below the fiftieth percentile, that is,too low for WSCF to consider when determining community seeds, it ismarked blue. Important to note is that there are a large number of blueedges within the communities shown in Figure 4.3, as well as a few redinter-commiunity edges. Visualized using GraphViz [10].not construct communities with edge weight as a consideration as much asnode degree and interconnectivity, which is not the goal of WSCF.4.4 FindingsWhile WSCF produced results that are not similar to the intended communi-ties imposed by the benchmarks, it can be noted that the benchmark graphshad communities that were outside of the intents and purposes of the al-gorithm, thus making them impossible to detect using WSCF. The bench-mark graph generating software tends to prefer communities that have a lotof interconnectivity regardless of the weight of the edges that connect thecommunities, which is a vital component of WSCF. Thus, the di↵ering results364.4. Findingsactually o↵er a unique perspective the weighted graphs, making sure thatedges with large weights are involved in a community if at all possible.One of the challenges for the future development of this algorithm willbe to correct the MergeCommunity metrics towards more interconnectivity,especially between high-weighted edges. Once the algorithm becomes morepreferential to clique structures, there is a good chance that it will be apowerful community detector, especially when the graph is sparse. Someother future considerations include better structures for memory manage-ment to allow for larger data sets to be included, as well as the investigationmentioned at the end of section 4.2.37Chapter 5ConclusionA significant problem in systems biology remains the expression and analysisof human biological networks over time, and the analysis of the di↵erencesbetween these networks as they change at various time points. A popularapproach to examining network structures is the detection of network com-munities. In this paper, a large amount of the current community detectionpractices were presented and discussed, and prototype for a new method,WSCF, was presented. A preliminary analysis of WSCF was given, includingscalability analysis as well as its applications on benchmark graphs as seenin literature. While its performance on the literature benchmark graphswas unexpected, this is due primarily to the actual benchmark communitygeneration algorithms that seem to ignore edge weights when generating thecommunity structure.The current available algorithms mostly consist of modularity-basedmethods, which employ a wide variety of algorithms that maximize themodularity metric to determine community structure. Simpler algorithms,like greedy algorithms, are not going to be ideal to solve such a nuancedproblem. Simulated annealing and extremal optimization both o↵er theirown solutions, at high complexity and accuracy tradeo↵s. Other methodsthat do not involve modularity are reliant on numerous types of data — sincethe problem of finding community structure in sparse di↵erential networkscan only be assumed to have a couple of types of data (correlation changebetween two genes and the expression levels of the given gene). However, ap-plying multiple dimensions to sparse di↵erential community detection mightelucidate interesting results.Because there is no single perfect algorithm for the detection of com-munity structure in this specific network type, a prototypic algorithm wasdeveloped and tested. The algorithm was heavily weighted towards involvingedges with larger weights in the community structures because this wouldgenerate communities of genes whose mutual expression levels have changedsignificantly over time. A suggested area that this may be the case is in pa-tient data, before and after drug treatment, or during disease progression.In particular, this type of analysis has the potential to pinpoint areas in38Chapter 5. Conclusionwhich mutual expression levels have been destabilized.The algorithm itself is incremental — starting with high-weight commu-nity “seeds” and expanding the community, vertex by vertex and mergingcommunities together until a larger community structure (with a high mod-ularity value) is discovered. Due to this incremental method, the priorityfor community detection is placed on the high-weight edges, with inter-connectivity as a secondary metric. That is, the higher-weight edges willbe preferentially added to a community structure than a lower-weight edgewhose adjacent vertices have a lot of connection in the community struc-ture. This is to ensure that the communities found primarily consist ofthe aforementioned disparate changes in mutual expression correlations andthus find relevant results. Additionally, the small community seed gener-ation and expansion ensures that this algorithm will find communities insparse graphs.While this algorithm is able to perform on data sets in the sizes ofhundreds, some interesting challenges for the future are to employ memorymanagement and work distribution over a cluster of processors to achievemore ecient detection. An improvement of the modularity heuristic toallow for more flexibility of preferences between higher-weighted edges andhigher connectivity edges would be beneficial to further tailor the results tothe needs of specific experimental design. Algorithms based on this heuristichave the potential to make an impact on systems biological research in thefuture.39Bibliography[1] Alex Arenas, Jordi Duch, Alberto Ferna´ndez, and Sergio Go´mez. Sizereduction of complex networks preserving modularity. New Journal ofPhysics, 9(6):176, 2007.[2] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan.Group formation in large social networks: membership, growth, andevolution. In Proceedings of the 12th ACM SIGKDD international con-ference on Knowledge discovery and data mining, pages 44–54. ACM,2006.[3] Brian Ball, Brian Karrer, and MEJ Newman. Ecient and principledmethod for detecting communities in networks. Physical Review E,84(3):036103, 2011.[4] Nicolas Bisson, D Andrew James, Gordana Ivosev, Stephen A Tate, RonBonner, Lorne Taylor, and Tony Pawson. Selected reaction monitoringmass spectrometry reveals the dynamics of signaling through the grb2adaptor. Nature biotechnology, 29(7):653–658, 2011.[5] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Eti-enne Lefebvre. Fast unfolding of communities in large networks. J. Stat.Mech., October 2008.[6] Stefan Boettcher and Allon G Percus. Optimization with extremaldynamics. Complexity, 8(2):57–62, 2002.[7] Aaron Clauset, M. E. J. Newman, and Christopher Moore. Findingcommunity structure in very large networks. Phys. Rev. E., 70:066111,2004.[8] Jordi Duch and Alex Arenas. Community detection in complex net-works using extremal optimization. Physical review E, 72(2):027104,2005.[9] Santo Fortunato. Community detection in graphs. Physics Reports,486(3):75–174, 2010.40Bibliography[10] Emden R. Gansner and Stephen C. North. An open graph visualizationsystem and its applications to software engineering. SOFTWARE -PRACTICE AND EXPERIENCE, 30(11):1203–1233, 2000.[11] Michelle Girvan and Mark EJ Newman. Community structure in so-cial and biological networks. Proceedings of the National Academy ofSciences, 99(12):7821–7826, 2002.[12] Roger Guimera and Luis A Nunes Amaral. Functional cartography ofcomplex metabolic networks. Nature, 433(7028):895–900, 2005.[13] Trey Ideker and Nevan J Krogan. Di↵erential network biology. Molec-ular systems biology, 8(1), 2012.[14] Pall F Jonsson, Tamara Cavanna, Daniel Zicha, and Paul A Bates.Cluster analysis of networks generated through homology: automaticidentification of important protein communities involved in cancermetastasis. BMC bioinformatics, 7(1):2, 2006.[15] Andrea Lancichinetti and Santo Fortunato. Benchmarks for testingcommunity detection algorithms on directed and weighted graphs withoverlapping communities. Physical Review E, 80(1):016118, 2009.[16] Wenyuan Li, Shihua Zhang, Chun-Chi Liu, and Xianghong JasmineZhou. Identifying multi-layer gene regulatory modules from multi-dimensional genomic data. Bioinformatics, 28(19):2458–2466, 2012.[17] M. E. J. Newman. Modularity and community structure in networks.PNAS, 103(23):8577–8582, 2006.[18] Mark EJ Newman. Analysis of weighted networks. Physical Review E,70(5):056131, 2004.[19] Mark EJ Newman. Finding community structure in networks using theeigenvectors of matrices. Physical review E, 74(3):036104, 2006.[20] MEJ Newman. Communities, modules and large-scale structure in net-works. Nature Physics, 8(1):25–31, 2011.[21] Yongjin Park and Joel S Bader. How networks change with time. Bioin-formatics, 28(12):i40–i48, 2012.41Bibliography[22] Jean-Franc¸ois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Ame´lie Dricot, Ning Li, Gabriel F Berriz, Francis D Gib-bons, Matija Dreze, Nono Ayivi-Guedehoussou, et al. Towards aproteome-scale map of the human protein–protein interaction network.Nature, 437(7062):1173–1178, 2005.[23] Huawei Shen, Xueqi Cheng, Kai Cai, and Mao-Bin Hu. Detect over-lapping and hierarchical community structure in networks. Physica A:Statistical Mechanics and its Applications, 388(8):1706–1712, 2009.[24] Sriganesh Srihari and Mark A Ragan. Systematic tracking of dys-regulated modules identifies novel genes in cancer. Bioinformatics,29(12):1553–1561, 2013.[25] Thomas Thorne and Michael PH Stumpf. Inference of temporally vary-ing bayesian networks. Bioinformatics, 28(24):3298–3305, 2012.42
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Community detection in sparse time-varying networks
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Community detection in sparse time-varying networks Slind, Jillian Rae 2014
pdf
Page Metadata
Item Metadata
Title | Community detection in sparse time-varying networks |
Creator |
Slind, Jillian Rae |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | Community detection is an important aspect of network analysis that has far-reaching consequences, in particular for biological research. In the study of systems biology, it is important to detect communities in biological networks to identify areas that have a heavy correlation between one another or are significant for biological functions. If one were to model networks that evolved over time, a differential network would be a vital part or product of that analysis. One such network could have an edge between two vertices if there is a significant change in the correlation of expression levels between the two genes that the vertices are designed to model. For this particular network, there are no community detection algorithms that suffice. An analysis of the current community detection algorithms shows that most heuristic-based methods are too simple or have too high a cost for detecting communities on such sparse networks. A prototypical algorithm is presented that is preferential to high weight edges when determining community membership. This algorithm, Weighted Sparse Community Finder or WSCF, is an incremental algorithm that develops community structure from highly-weighted community seeds, which are 3-vertex substructures in the network with a high local modularity. A preliminary analysis of this detection algorithm shows that it is functional on data sets consisting of up to 600 genes, with more on a more powerful machine. The communities detected are different than the ones provided by the benchmark algorithms because of the high precedence placed on higher-weight edges. This prototypical algorithm has the potential for refinement and expansion to provide the ability to find significant results for applications in the field of Systems Biology. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-08-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0165964 |
URI | http://hdl.handle.net/2429/50043 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2014-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2014_september_slind_jillian.pdf [ 1.34MB ]
- Metadata
- JSON: 24-1.0165964.json
- JSON-LD: 24-1.0165964-ld.json
- RDF/XML (Pretty): 24-1.0165964-rdf.xml
- RDF/JSON: 24-1.0165964-rdf.json
- Turtle: 24-1.0165964-turtle.txt
- N-Triples: 24-1.0165964-rdf-ntriples.txt
- Original Record: 24-1.0165964-source.json
- Full Text
- 24-1.0165964-fulltext.txt
- Citation
- 24-1.0165964.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0165964/manifest