Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Random models and heuristic algorithms for correlation clustering problems on signed social networks Wahid, Dewan Ferdous 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_may_Wahid_Dewan Ferdous.pdf [ 969.47kB ]
Metadata
JSON: 24-1.0347257.json
JSON-LD: 24-1.0347257-ld.json
RDF/XML (Pretty): 24-1.0347257-rdf.xml
RDF/JSON: 24-1.0347257-rdf.json
Turtle: 24-1.0347257-turtle.txt
N-Triples: 24-1.0347257-rdf-ntriples.txt
Original Record: 24-1.0347257-source.json
Full Text
24-1.0347257-fulltext.txt
Citation
24-1.0347257.ris

Full Text

Random Models and HeuristicAlgorithms for Correlation ClusteringProblems on Signed Social NetworksbyDewan Ferdous WahidB.Sc. Hons., University of Chittagong, 2008M.Sc., University of Chittagong, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE COLLEGE OF GRADUATE STUDIES(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)April 2017c© Dewan Ferdous Wahid, 2017The undersigned certify that they have read, and recommend to the Col-lege of Graduate Studies for acceptance, a thesis entitled: Random Modelsand Heuristic Algorithms for Correlation Clustering Problemson Signed Social Networks submitted by Dewan Ferdous Wahid inpartial fulfilment of the requirements of the degree of Master of ScienceSupervisor, Dr. Yong Gao, Professor, Computer Science, UBC, OkanaganCo-supervisor, Dr. Paramjit Gill, Associate Professor, Statistics, UBC, OkanaganSupervisory Committee Member, Dr. Heinz Bauschke, Professor, Mathematics, UBC,OkanaganUniversity Examiner, Dr. Zheng Liu, Associate Professor, School of Engineering, UBC,OkanaganExternal Examiner, Professor (please print name and faculty/school above the line)(Date Submitted to Grad Studies)Additional Committee Members include:(please print name and faculty/school above the line)(please print name and faculty/school above the line)iiAbstractIn social sciences, the signed directed networks are used to represent themutual friendship and foe attitudes among the members of a social group.Recent studies show that different real-world properties (e.g. preferential at-tachment, copying etc.) can be observed in the web-based social networks.In this thesis, we study the positive/negative - in/out - degree distributionsin three online signed directed social networks. We observe that all signed-directed degree distributions in the web-based social networks with multipleedges possibilities (in both directions) follow a power law with exponents inthe range 2.0 ≤ γ ≤ 3.5. We present three random models, which capturethe preferential attachment and copying properties, for web-based signeddirected social networks. The signed-directed degree distributions in thenetworks simulated by the proposed random models also indicate a power-law trait with an exponent in the range 2.0 ≤ γ ≤ 3.5.We also present a heuristic algorithm for the Correlation Cluster-ing (CC) which is a class of community detection problem in the signednetwork. The CC problem can be defined as follow: for a given signednetwork, finding an optimal partition in the vertices such that the edgesinside a group are positives and the edges between two groups are negative.We present the algorithm based on the relaxing integer linear programmingformulation of the minimum disagreement CC problem and rounding theapproximate ultrametric distance matrix by using a given threshold. Theexperimental results show that, in the random signed G(n, e, p) network,the runtime of this algorithm is nearly independent for the cases e ≥ 0.4and p ≤ 0.6 , where e and p are the probabilities of connecting two ver-tices by an edge and an edge to be positive respectively. But this algorithmdoes not give any convincing argument in the variation of the minimum dis-agreements due to the changing of the given threshold. We also apply thisalgorithm to the International National Bilateral Tread Growth Networkderived from the bilateral trading data in 2011-2015 from the InternationalTrade Center (ITC) to identify the groups of countries with average positivetrade growth.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . ixChapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . 1Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . 42.1 Graphs and Networks . . . . . . . . . . . . . . . . . . . . . . 42.2 Random Network Models . . . . . . . . . . . . . . . . . . . . 62.2.1 Erdo˝s-Re´nyi Model . . . . . . . . . . . . . . . . . . . . 62.2.2 Watts-Strogatz’s Small-World Model . . . . . . . . . . 62.2.3 Baraba´si-Albert Preferential Attachment Model . . . . 72.2.4 Cooper-Frieze Model . . . . . . . . . . . . . . . . . . . 82.2.5 Copying Model . . . . . . . . . . . . . . . . . . . . . . 92.2.6 k-Tree Random Model . . . . . . . . . . . . . . . . . . 102.2.7 Signed Random Network Models . . . . . . . . . . . . 112.3 Algorithmic Problems on Graphs . . . . . . . . . . . . . . . . 122.3.1 Shortest Path . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Minimum Spanning Tree . . . . . . . . . . . . . . . . . 132.3.3 Maximal Cliques Problem . . . . . . . . . . . . . . . . 142.3.4 Maximum Clique Problem . . . . . . . . . . . . . . . . 152.4 Network Communities and Clustering . . . . . . . . . . . . . 162.4.1 Cluster Graphs . . . . . . . . . . . . . . . . . . . . . . 162.4.2 Quasi-Threshold Graphs . . . . . . . . . . . . . . . . . 172.4.3 Cographs . . . . . . . . . . . . . . . . . . . . . . . . . 17ivTABLE OF CONTENTS2.4.4 Threshold Graphs . . . . . . . . . . . . . . . . . . . . 172.4.5 Community Editing Problems . . . . . . . . . . . . . . 172.5 Balance Theory and Clustering . . . . . . . . . . . . . . . . . 18Chapter 3: Random Models for Signed Directed Social Net-works . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Signed Directed Social Network . . . . . . . . . . . . . . . . . 203.2 Motivation for Modeling Signed Directed Social Networks . . 203.2.1 Empirical Networks . . . . . . . . . . . . . . . . . . . 213.2.2 Analyzing Real-World Networks . . . . . . . . . . . . 213.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Model A: Preferential Attachment Model . . . . . . . . . . . 283.4.1 Model Definition . . . . . . . . . . . . . . . . . . . . . 283.4.2 Degree Dynamics . . . . . . . . . . . . . . . . . . . . . 293.5 Model B: Edge Copying Model . . . . . . . . . . . . . . . . . 343.5.1 Model Definition . . . . . . . . . . . . . . . . . . . . . 343.5.2 Comparison with Preferential Attachment Model . . . 353.5.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.4 Degree Dynamics . . . . . . . . . . . . . . . . . . . . . 363.6 Model C: Clique Copying Model . . . . . . . . . . . . . . . . 473.6.1 Model Definition . . . . . . . . . . . . . . . . . . . . . 473.6.2 Structural Balanced . . . . . . . . . . . . . . . . . . . 483.7 Simulation and Results Discussion . . . . . . . . . . . . . . . 50Chapter 4: Heuristic Algorithm for Correlation ClusteringProblems . . . . . . . . . . . . . . . . . . . . . . . . . 544.1 Correlation Clustering Problem . . . . . . . . . . . . . . . . . 544.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Heuristic Algorithm for Correlation Clustering Problems . . . 574.3.1 Integer Linear Programming Formulation . . . . . . . 574.3.2 Relaxed-ILP . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 Ultrametric Distance Matrix . . . . . . . . . . . . . . 604.3.4 Closest Ultrametric . . . . . . . . . . . . . . . . . . . . 614.3.5 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.6 The Algorithm and Implementation: Summary . . . . 634.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 Random G(n, e, p) Signed Networks: . . . . . . . . . . 654.4.2 International Bilateral Trade Growth Rate Network . 67Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 69vTABLE OF CONTENTS5.1 Random Models for Signed Directed Social Networks . . . . . 695.2 Heuristic Algorithm for Correlation Clustering Problems . . . 70Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71viList of TablesTable 3.1 Power-law exponents γ and the corresponding p-valuesfor different signed-directed-degree distributions for aboveempirical data sets. . . . . . . . . . . . . . . . . . . . 25Table 3.2 The list of observed attributes in the real-world net-works. We denote these attributes by A1, A2, and A3respectively. . . . . . . . . . . . . . . . . . . . . . . . . 26Table 3.3 Power-law exponents γ and the corresponding p-valuesfor different signed-directed-degree distributions in thesynthetic networks generated by preferential attach-ment model. . . . . . . . . . . . . . . . . . . . . . . . 50Table 3.4 Power-law exponents γ and the corresponding p-valuesfor different signed-directed-degree distributions in thenetwork instances generated by edge copying model. . 51Table 3.5 Power-law exponents γ and the corresponding p-valuesfor different signed-directed-degree distributions in thenetwork instances generated by clique copying model. 52Table 3.6 Summary of capturing observed attributes by the pro-posed random models. . . . . . . . . . . . . . . . . . . 53Table 4.1 Summary of the International Bilateral Trade GrowthRate Network 2011-2015. . . . . . . . . . . . . . . . . . 68viiList of FiguresFigure 2.1 Triads with odd number or no positive edges or arebalanced (T3, T1) and with even number edges areimbalanced (T2). . . . . . . . . . . . . . . . . . . . . 19Figure 3.1 Cumulative signed-directed-degree distribution Pcum(d)of Wiki-RfA social network. . . . . . . . . . . . . . . 22Figure 3.2 Cumulative singed-directed-degree distribution Pcum(d)of Slashdot social network. . . . . . . . . . . . . . . . 23Figure 3.3 Cumulative signed-directed-degree distribution Pcum(d)of Epinions social network. . . . . . . . . . . . . . . . 24Figure 4.1 (a) Runtime (in sec.) for changing n when e and pare fixed. (b) Runtime (in sec.) for changing e whenn = 50 and p are fixed. (c) Runtime (in sec.) forchanging p when n = 50 and e are fixed. . . . . . . . 66Figure 4.2 Minimum disagreement due the changing of thresholdin ten random signed network instances G1, ..., G10with fixed n = 100, e = 0.5, p = 0.5s. . . . . . . . . . . 67Figure 4.3 Clusters of countries when threshold = 0.45. . . . . . 68viiiAcknowledgmentsFirst, I would like to thank to my supervisor Professor Dr.Yong Gaofor his mentorship and support throughout this program and research. Hispatience, motivation, enthusiasm and immense knowledge have been thedriving force behind all the work in this thesis.A very special thank goes to my co-supervisor Dr. Paramjit Gill for hisencouragements and insightful comments.A special gratitude goes out to the university examiner Dr. Zheng Liufor his profound insights to my dissertation.I would also like to thank Dr.James Nastos for always being there when-ever I needed help in object oriented programming.Finally, I am incredibly grateful to my family, friends and roommatesfor always being the solid anchor in my life.ixChapter 1IntroductionScientists are using networks to explain different real-world complex phe-nomena for a long times. For example, in social sciences, people are usingthe concept of social interaction by the words ‘web’, ‘social fabric’ and ‘net-work.’ In 1934, Moreno [Mor34] first used a structure, called ‘sociogram’ torepresent the formal properties of social configuration. In sociogram, he rep-resented individuals by ‘points’ and their social attitudes to one another by‘lines.’ Moreno proposed to use the sociometric ‘star’ to identify leaders andisolated individuals based on the popularity of the social group. In 1946, Hei-der [Hei46] first used the signed version on the network, in which the edgesare labeled by positive and negative signs, to represent the mutual attitudesof friendship and foe behaviors in a social group respectively. Heider [Hei46]used signed directed networks to introduce the notion of balance theory. In1965 Cartwright and Harary [CH56] formalized the definition of balancestate in graph-theoretic language. According to their definition, a signednetwork is in balance state if there exists two or more subgroups/partitionsin the network such that the mutual interactions among the members in thesame subgroup are supportive (i.e. connected with positive edge) and theattitudes between two subgroups are hostile (i.e. connected with negativeedges). Later, the both directed and undirected versions of signed networkshave been growing significantly in the different scientific disciplines such ascomputer science, biology, and physics, etc.. The analysis of these networksis evolving in both data-centric and problem-centric perspectives.Before the beginning of the world-wide-web era, the signed networksinvolved a small number of vertices and edges in general and usually derivedfrom studying the physical world [TCAL16]. With the development of theonline social networks the number of the web-based signed networks, such asEpinions Trust Network [LHK10], Slashdot [LHK10], have been increasingsignificantly in the recent times. In a web-based signed social network,the mutual positive and negative interactions are often determined by thelike/dislike, or trust/distrust between two users of a common platform. Thevertex sets in the web-based signed social networks are most often enormousin size, but the networks are sometimes very sparse and noisy.1Chapter 1. IntroductionTo examine new ideas or to find solutions for the real-world complexnetwork problems, it is always desirable to test those ideas/solutions on anartificial network with tractable structural properties that can precisely sim-ulate the real-world networks phenomena. Over the years, several attemptshave been proposed to design random models for the web and social net-works that can capture different real-world properties. The first attemptto design such model can be seen in Watts-Strogatz’s [WS98] small-worldmodel proposed in 1998. This model successfully captures the ’small-world’property: having high density and small diameter, which can be found inmany real-world networks such as neural, power grid and film actors col-laboration networks. This property was studied by some social scientistsincluding Milgram [Mil67] in the early 1960s. In 1999, Faloutsos, Falout-sos, and Faloutsos [FFF99] observed that the degree distributions in manyreal-world networks such as the Internet network follow a certain power-law also known as ’scale-free’ property. The preferential-attachment modelproposed by Baraba´si and Albert [BA99] in 1999 successfully captured thescale-free property in the random networks. The copying model capturesthe process of creating a new web page by copying and then modifying linksfrom existing web-pages [KRR+00].Due to the evolution of web-based signed networks, there is now a con-siderable necessity of designing random models to capture different aspectsof these networks. Recently, Ciotti et al. [CBC+15] studied the signed-degree distributions in the signed social networks such as Epinions-trust(www.epinions.com) and Slashdot (www.slashdot.org) networks. Theirstudy suggests that the signed-degree distributions of those networks fol-low a power law with exponent 2.2 ≤ γ ≤ 4.5. Ciotti et al. [CBC+15]also proposed the power-law degree distribution model for signed undirectednetwork to capture this property. To the best of our knowledge, therehas been no study on the random modeling and the degree distributionsin the signed directed networks. In this thesis, we have studied singed-directed degree distributions in three real-world signed directed social net-works: Wikipedia Request for Adminship (www.wikipedia.org), Epinions-trust (www.epinions.com) and Slashdot (www.slashdot.org). Our studysuggests that the signed-directed degree distributions in those networks obeythe power-law with an exponent in the range 2.0 ≤ γ ≤ 3.5. Then, we pro-pose three random models for signed directed social network to capture theproperties observing in the real-world networks.According to Cartwright and Harary [CH56]’s study in 1956, a balancedsigned network can be partitioned into one or more mutually hostile (i.e.negatively connected) balanced communities. A balanced community is a2Chapter 1. Introductiongroup of vertices in a social network that is “positively connected”, i.e., themutual interaction among the member inside the community are support-ive/friendly. In 2004, Bansal et al. [BBC04] formulated the CorrelationClustering problem to find a optimal partition in the signed networks.Later, this problem is becoming a very natural way of identifying communi-ties in network analysis [MMP12] as well as other scientific areas such as ma-chine learning and data mining [CDK14, GMT07], portfolio analysis in riskmanagement [FF14, HLW02], biological system networks [HBN07, DESZ07]etc..The Correlation Clustering problem is NP-hard [BBC04]. In re-cent years, several approximate algorithms have been proposed by Bansalet.al. [BBC04], Ailon, Charikar, and Newman [AAELvZ12], Charikar et al.[CGW03], Demaine et al. [DEFI06] etc.. In this thesis we propose a heuris-tic algorithm for the Correlation Clustering problem and then applyit to International Bilateral Trade Growth network.The rest of the chapters of this thesis are organized as following.Chapter 2 introduces definitions, notation and background material forthe rest of the thesis.Chapter 3 proposes three models to generate signed directed randomnetwork in which the signed-directed-degrees follow power-law distributions.These models are preferential attachment model, edge copying model andclique copying model.Chapter 4 presents a heuristic algorithm for the Correlation Clus-tering (CC) problems by solving the relaxed integer linear program ofthe CC problem and then finding the closest ultrametric distance matrixproblem from the solution matrix of the relaxed problem.Summary and concluding remarks are presented in Chapter 5.3Chapter 2PreliminariesIn this chapter, we present graph-theoretic background material and no-tions used in the following chapters. All graph-theoretic definitions, termi-nologies, models, and algorithms are used in this thesis follow the books Net-work Analysis [BE05], Algorithm Design [KT06], and Handbook of Graphsand Networks [BS06].2.1 Graphs and NetworksA network is an abstract structure, which represents the mutual interac-tions among different objects/members called vertices. For example, a socialgroup is a network composed of vertices (members/persons) and the mutualfriendship/foe attitudes between the members as the connection betweenthese vertices. Mathematically, a network can be represented by a graph.A graph G = (V,E) is a structure formed by a set V of vertices and aset E of edges that connect pair of vertices. In this thesis, we use networkand graph interchangeably.An edge e ∈ E that connects two vertices can be written as e = (u, v), or{u, v} or simply as uv, where u, v ∈ V . Different attributes can be assignedto an edge e (e.g. sign, direction etc.), based on the nature of the relationbetween the vertices u and v.Based on the edge direction attribute, we have two types of networks:undirected and directed.An undirected network is defined by G = (V,E), where each edge inE is undirected. In an undirected network, (u, v) and (v, u) represent thesame edge. The vertices u and v are called endpoints of the edge e = (u, v).The number of distinct edges having a vertex v as an endpoint is called thedegree of v. The degree of v ∈ V in G is denoted by dG(v).A directed network is also defined byG = (V,E), where each edge (u, v) ∈E is directed. In the edge (u, v), the vertex u is called source vertex and v iscalled target vertex. That is, in directed network, (u, v) and (v, u) representtwo distinct edges between u and v in two opposite directions. The number42.1. Graphs and Networksof distinct edges having a vertex v as the source vertex is called the out-degree of v and is denoted by doutG (v). Similarly, the number of distinctedges having a vertex v as the target vertex is called the in-degree of v andis denoted by dinG (v).If we consider both edge attributes, sign and direction together, then wecan categorize two types of networks: signed undirected network, and signeddirected network.A signed undirected network or simply signed network is defined by G =(V,E, s), where V is the vertex set and E is the edge set. Also, the functions : E → {+,−} assigns a sign to each edge in E. Based on the sign, the edgeset E can be written as E = E+ ∪ E−, where E+ is the set of all positiveedges, E− is the set of all negative edges, and E+ ∩ E− = ∅. That is, theedges (u, v) = (v, u), if s(u, v) = s(v, u), where u ∈ V and v ∈ V are theendpoints. The number of distinct positive edges having a vertex v as anendpoint is called positive-degree of v and is denoted by d+G(v). Similarly,the number of distinct positive edges having a vertex v as an endpoint iscalled the negative-degree of v and is denoted by d−G(v).A signed directed network is also defined by G = (V,E, s), where Vis the set of all vertices and E is the set of all directed edges in G ands : E → {+,−}. Also, E = E+ ∪E−, E+ ∩E− = ∅, where E+ is the set ofall positive directed edges and E− is the set of all negative directed edges.The number of distinct positive edges having a vertex v as the source vertexis called the positive-out-degree of v and is denoted by d+outG (v). Also, thenumber of distinct positive edges having a vertex v as the target vertex iscalled the positive-in-degree of v and is denoted by d+inG (v). Similarly, thedistinct number negative edges having a vertex v as the source and targetvertex are defined by the negative-out-degree d−outG (v) and negative-in-degreed−inG (v) respectively.A dynamic network that evolves with the time t is denoted by Gt =(Vt, Et), where Vt and Et are the vertex and edge sets in Gt at time t. Here,the sets Vt and Et depend on time t, i.e. Vt = f(t) and Et = g(t), where fand g are some functions of t.A subgraph or subnetwork H = (V ′, E′) of a network G = (V,E) is alsoa network such that V ′ ⊆ V and E′ ⊆ E. A subgraph H = (V ′, E′) of agraph G = (V,E) is said to be an induce subgraph if V ′ ⊆ V , E′ ⊆ E, wherefor every pair u, v ∈ V ′ , (u, v) ∈ E′ only if (u, v) ∈ E.A path P in G = (V,E) is a sequence of distinct vertices v1, ..., vk suchthat (vi, vi+1) ∈ E where 1 ≤ i ≤ k. We can denote a path between twovertices u, v ∈ V by P (u, v). If there exist a path between vertices u and v,they are called connected vertices. A path v1, ..., vk is said to be a cycle if52.2. Random Network Models(v1, vk) ∈ E. A cycle of three vertices is called a triangle.A clique C is a set of vertices in G = (V,E) such that for all u, v ∈C, u 6= v implies (u, v) ∈ E. In other words, a clique is a set of verticeswhich are pairwise adjacent. A clique is said to be a maximal clique if it isnot a subgraph in any other clique. A maximum clique in G is the cliquewith the maximum number of vertices. A maximum clique is a maximalclique, but the converse is not always true.A tree is an undirected, acyclic connected graph. A graph in which everydisjoint connected component is called forest. A spanning tree T = (V,ET )of an undirected graph G = (V,E) is a tree that includes all vertices of Gand ET ⊆ E.2.2 Random Network ModelsThe proposed network models can be divided into four groups: classicalrandom network model, small-world property model, scale-free models, andsigned random network models. A brief discussion and definitions are givenin the following.2.2.1 Erdo˝s-Re´nyi ModelErdo˝s-Re´nyi [ER59] model is first classical model for generating randomnetwork.Definition 2.1 (Erdo˝s-Re´nyi Model). The model generates a networkG(n, p),where n is the fixed number of vertices and p is the probability of joiningany two vertices by an edge. Each edge is created independently of otheredges, therefore, the probability distribution P (d) of the degree d of a vertexv in G(n, p) isP (d) =(n− 1d)pd(1− p)n−1−d. (2.1)When, np the average degree of a vertex is a fixed constant c, then thisprobability approaches to the Poisson probability cde−cd! as n→∞ [BE05].2.2.2 Watts-Strogatz’s Small-World ModelMany real-world networks (e.g. neural networks, the power-grid net-work of the western US and the collaboration network of film actors) exhibit62.2. Random Network Models‘small-world’ properties, which are having a small diameter and highly clus-tered networks. Watts-Strogatz[WS98] models simulate ‘small-world’ prop-erties of real-world networks. Where the model parameters are, the numberof vertices N , the average degree of a vertex d such that ln(N) < d < N andprobability of edge rewiring β ∈ [0, 1]. The Watts-Strogatz’s Small-Worldmodel is defined as follows:Definition 2.2 (Small-World Model). Initially, the network G starts witha set of vertices V of size n placed in a cyclic order and with no edge. Then,it follows:(1) Connect each vertex v ∈ V with next d2 vertices on both sides of vfrom the cycle.(2) For each vertex u ∈ V ,(a) select a vertex v ∈ V such that u 6= v and (u, v) /∈ E,(b) rewire edge (u, v) with the probability β.When the rewiring parameter β → 1 the generated model network workapproaches to the Erdo˝s-Re´nyi network [WS98].Beside the ‘small-world’ properties, many real-world networks (e.g. Inter-net, telephone call networks etc.) show ‘scale-free’ property, which is theexistence of hubs. That is, these networks show a power-law (heavy-tailed)distribution in vertex degrees. The following network models are proposedto capture the ‘scale-free’ property of real-world networks.2.2.3 Baraba´si-Albert Preferential Attachment ModelBaraba´si-Albert [BA99] proposed this simple but elegant random modelto grasp two important phenomena of real-work networks such as world-wide-web (www) etc. The first phenomenon is growth, i.e. the networkgrows with the time and there is no restriction on the number of verticesthat can be added to the network. The second phenomenon is preferentialattachment, which often is referred as the ‘rich-getting-richer ’ property. TheBaraba´si-Albert model is defined as follows:Definition 2.3 (Baraba´si-Albert Model). At t = 0, the random processstarts with an initial connected network Gk0 of size |V0| = m0 and m0 ≥k. Then, a sequence of vertices, v1, v2, ..., vN , are entered to the existingnetwork inductively, one vertex at a time, to produce a sequence of networks72.2. Random Network Models{Gkt } by connecting with k number of existing vertices. At time t, the newvertex vt+1 enters to the network Gkt (Vt, Et) and connects with an existingvertex v ∈ Vt with the probabilityP[vt+1 connects with v] =dGkt(v)∑Nv∈Vt dGkt (v), (2.2)where, dGkt(v) is the degree of the vertex v in Gkt and∑Nv∈Vt dGkt (v) is thetotal degree of all vertices in Gkt .Baraba´si-Albert [BA99] proved the following theorem to find the powerlaw bound of the degree distribution.Theorem 2.4. The probability distribution P (d) of the degree d of a vertexis reduced to d−3 for large d when t→ 0.2.2.4 Cooper-Frieze ModelThe Cooper-Frieze model, proposed in [CF03], is a a mixture of preferen-tial attachment (by degree) and uniformly at random (u.a.r) selection. Thismodel needs to fix a set of parameters in advance. The fixed parameters aredefined as follows ([CF03]):Procedure selection at each step t:α : Probability to follow OLD procedure,1− α : Probability to follow NEW procedure,Procedure NEW:p = (pi : i ≥ 1) : Probability that new vertex generates i new edges,β : Probability that the target vertices are selected uniformly,1− β : Probability that the target vertices are selected accordant to degree,Procedure OLD:q = (qi : i ≥ 1) : Probability that existing vertex generates i new edges,δ : Probability that the source vertex is selected uniformly,1− δ : Probability that the source vertex is selected accordant to degree,γ : Probability that the target vertices are selected uniformly,1− γ : Probability that the target vertices are selected accordant to degree,This model also needs two fixed integer parameters j0 and j1 such thatpj = 0, j > j0 and qj = 0, j > j1. The model definition is given in following.82.2. Random Network ModelsDefinition 2.5 (Cooper-Frieze Model). The random process starts with aninitial network G0 with single vertex v0 and no edge. Then a sequence ofrandom networks {Gt} evolves according to the following procedure.At time t, the edges are added by choosing either NEW or OLD methodwith probability 1 − α or α respectively. In NEW method, a new vertexvt+1 is added to the network Gkt (Vt, Et), and connects vt+1 with one or moreexisting vertices. In OLD method, a number of new edges are added to aselected existing vertex v.This model follows a mixture of preferential attachment (by degree) anduniformly at random (u.a.r) rules to select the endpoints, target vertex inthe NEW method and source & target vertices in the OLD method, for thenewly added edges.Let, at time t, let E[Xd(t)] be the expected numbers of vertices withdegree d in Gt and {βd} is a sequence of positive integers. Cooper et al.[CF03] proved that the following theorem for t→∞ and small k.Theorem 2.6 ([CF03]). There exist a constant M > 0 such that almostsurely for all t, k ≥ 1|E[Xd(t)]− tβd| ≤Mt1/2 log tTherefore, for t → ∞ and small k, E[Xd(t)] can be approximated by tβd,i.e. E[Xd(t)] ≈ tβd, for t → ∞, where βd is a sequence of positive integers,which obeys power-law bounds [CF03].2.2.5 Copying ModelKumar et al. [KRR+00] proposed this model to capture the copyingproperty which is observed in the web-base networks. The core idea behindthis copying property is that, when a new web-page (vertex) is created, mostoften it copies all out-links (directed edges) from an existing web-page andthen modifies some links. Based on this observation the copying model canbe defined as follows:Definition 2.7 (Copying Model). At time t, a new vertex vt+1 enters tothe network Gkt (Vt, Et) and creates k directed edges (out-links) as follows:(1) Selects a ‘prototype’ vertex v ∈ Vt uniformly at random.(2) For all (v, w) ∈ Et, such that w ∈ Vt, adds (vt+1, w) to the networkGkt+1.92.2. Random Network Models(3) For each edge (vt+1, w) ∈ Et+1, such that w ∈ Vt+1,(i) with the probability 1 − α re-wires the edge with a randomlyselected vertex u ∈ Vt.(ii) with probability α, keeps the edge unchanged.Let E[Xd(t)] be the expected number of vertices of degree d in the net-work generated by the copying model at time d. Then the following resultswas proved by Kumar et al. [KRR+00].Theorem 2.8. For d > 0, the limit P (d) = limt→∞E[Xd(t)]t exist, and satisfiesP (d) = P (0)r∏i=11 + αi(1−α)1 + 2i(1−α)andP (d) = Θ(d2−αa−α ).The motivation of this model is that it creates a lot of induced bipartitesubgraphs that are common phenomena in real world web networks. But thenetworks generated by copying model do not show high clustering, which isanother common phenomenon of web networks.2.2.6 k-Tree Random ModelGao [Gao09] proposed the k-Tree random model which can generate ran-dom networks with a well-defined graph structure. The degree distributionin the simulated networks obeys a power-law. The model definition is givenin following ([Gao09]):Definition 2.9 (k-Tree Random Model). The random process starts with aninitial clique Gk0 of size |Vt| = k+1. A sequence of vertices, {v1, v2, ..., vN}, isadded to the existing network inductively to generate a sequence of randomnetworks {Gkt }. At time t, a new vertex vt+1 enters the existing networkGkt (Vt, Et) and generate Gkt+1 as follows.(1) Selects k-clique, Ct, uniformly at random from {Gkt }.(2) Connects vt+1 with all k vertices in CtLet, Xd be the random variable for the total number of vertices of degreed in Gkt and {βd} be a sequence of positive integers. Gao [Gao09] provedthe following theorem to approximate the expected number of vertices withdegree d.102.2. Random Network ModelsTheorem 2.10 ([Gao09]). Let, E[Xd(t)] be the expected number of be ver-tices with degree d in the random k-tree Gkt . There exists a constant N =N(k) (independent of d) such that for any n > N ,|E[Xd(t)]−−tβd| ≤ Cwhere C = C(k) is a constant that is independent of d and n and βd obeysa power law boundd−(1+ kk−1).In 2011, Sridharan et al. [SGWN11] showed that the edge embeddednessdGkt(e) = D of k-tree random network also follows a power-law D−(1+ kk−2).2.2.7 Signed Random Network ModelsRecently, Ciotti et al. [CBC+15] has proposed two models for signedsocial networks: Binomial degree distribution model and Power-law degreedistribution model.Definition 2.11 (Binomial Degree Distribution Model). This model con-structs a signed random network by applying the following procedure:(1) Generating Unsigned Network: Generate an unsigned networkG(V,E)of size |V | = N , by connecting any pair of vertices through an edgewith probability p.(2) Attributing Signs: Attribute sign to each edge in G as follows:(i) Divide all vertices in V into two groupsA andB with probabilitiesm and 1−m respectively.(ii) An edges is attributed by the positive sign if the end vertices arein the same group, otherwise attributed by the negative sign.Definition 2.12 (Power-law Degree Distribution Model). According to thismodel, we can construct a signed random network with power-law positiveand negative degree distributions by applying following procedures:(1) Generating Unsigned Network: Generate an unsigned networkG(V,E)of size |V | = N by a power-law degree distribution network model, e.g.Baraba´si-Albert model, copying model, etc..(2) Attributing signs: Attribute sign to each edge in G by following thesimilar procedure from the above Binomial Distribution Model.112.3. Algorithmic Problems on Graphs2.3 Algorithmic Problems on GraphsMany algorithmic problems have been studied on graphs over the years.Here, we only give a short discussion on the algorithmic problems relevantto this thesis.2.3.1 Shortest PathThe shortest path problem on a weighted, directed, and connected graphG = (V,E,w) in which V is the vertices set, E is the edge set, and w : E →R+0 is the distance (weight) function for each edge in E, can be defined asfollows:Problem 2.1. Shortest Path.Instance: Given a weighted, directed, and connected graph G = (V,E,w)and a fixed source vertex s.Task: Find the shortest path from s to every other vertex in v ∈ G.Different algorithms have been proposed for solving the shortest path prob-lems such as Dijkstra’s algorithm [Dij59], Bellman-Ford algorithm [Bel58],Floyed-Warshall algorithm [Flo62], etc.. Here, we discuss the Dijkstra’s al-gorithm for solving the single-source shortest path problem.Dijkstra’s Algorithm: Let S ⊆ V such that the final shortest-path ofthe vertices in S from a fixed source s have already been determined. Let S¯be the complement of S in V , i.e. S¯ = V \S and d(s, S¯) be the shortest-pathdistance from s to any vertex in S¯.Consider a path P = (s, ..., u, v) from the source vertex s to a vertexv ∈ S¯ and u ∈ S. Therefore, the path distance d(s, u) must be the shortest-path distance from s to u. Thus the shortest-path from s to v can be writtenasd(s, v) = d(s, u) + wuv,where wuv is the distance (weight) of the edge (u, v). Therefore, we can findthe shortest-path distance from s to any vertex in S¯ byd(s, S¯) = minu∈S; v∈S¯{d(s, u) + wuv}.Initially, Dijkstra’s algorithm starts with vertex set S0 = {s}. Then, asequence of vertices sets S1, S2, ... ⊆ V is constructed which satisfies thefollowing conditions.122.3. Algorithmic Problems on Graphs1. In S0, the shortest-path distance d(s, s) = 0.2. If S = {s, u1, ..., ui}, where s, u1, ..., ui ∈ V , then d(s, u1) ≤ ... ≤d(s, ui).3. In the step of constructing the set Si, the shortest-path distances fromthe source s to all of the vertices u1, ..., ui are already known.The generic Dijkstra’s algorithm can be represented by the pseudo-codegiven in Algorithm 1 .Algorithm 1: Dijkstra’s AlgorithmData: A directed, weighted graph G(V,E,w) and source vertex s.Result: S ← the set of explored vertices.d(s, u)← the shortest-path distance from s, ∀u ∈ S.Initialization: S ← s; d(s, s)← 0;while S 6= V doSelect a vertex v ∈ S¯ such thatdmin(u, v) = minu∈S;(uv)∈E{d(s, u) + wuv};d(u, v)← dmin(u, v);S ← S ∪ {v};endThe runtime of a straightforward implementation of Dijkstra’s algorithmneeds O(mn) whereas a min-priority queue implemented by a Fibonacci-heap based implementation takes O(m log n) [FT87].2.3.2 Minimum Spanning TreeLet G = (V,E,w) be an undirected, weighted connected graph in whichw : E → R+0 is the weight function that defines the weight of an edge(u, v) ∈ E by wuv. A minimum spanning tree of G is a spanning tree withminimum total edges weight. Consider the weight for an edge (u, v) ∈ E iswuv, which is defined by the cost to connect the vertices u, v ∈ V . Then, wecan define the minimum spanning tree problem as follows:Problem 2.2. Minimum Spanning Tree;Instance: An undirected, weighted connected graph G = (V,E,w).Question: Find a spanning tree T = (V,ET ) such that, w(T ) =∑(u,v)∈ET wuvis minimized. Here w(T ) is the total weight of the edges in the spanningtree T .132.3. Algorithmic Problems on GraphsThere are several algorithms for finding the minimum spanning tree: Kruskal’salgorithm [Kru56], Prim’s algorithm [Pri57], etc. In the following section,we briefly discuss the Kruskal’s algorithm.Algorithm 2: Kruskal’s Algorithm[Cor09]Input: An undirected, weighted graph G(V,E,w).Output: Minimum spanning tree T (V,ET ).Initialization: ET = φfor each vertex v ∈ V doMake-Set(v)endEsort ← sorted edges in E into nondecreasing order by weight wfor each edge (u, v) ∈ Esort doif Find-Set(u) 6=Find-Set(v) thenET ← ET ∪ (u, v)endendreturn ETKruskal’s Algorithm: In the Kruskal’s algorithm, proposed in [Kru56],the edges set ET is a forest on the vertex set V of G. At each step, Kruskal’salgorithm finds a safe edge (u, v) ∈ E, with least-edge-weight that connectstwo disjoint connected components, to add to ET . The implementation ofthis algorithm needs to use Union-Find data structure to maintain indi-vidual disjoint sets of elements. The operation Find-Set(u) is used to seewhether or not a vertex u belongs to a set by returning the representing ele-ment from the set. The operation Union(u, v) is used to merge two disjointsets that contains the vertices u and v. The pseudo-code of the Kruskal’salgorithm is given in Algorithm 2 .2.3.3 Maximal Cliques ProblemThe decision problem to identify all maximal cliques in a network canbe formulated as follows:Problem 2.3. Maximal Cliques.Instance: A graph G = (V,E).Question: Find all maximal cliques in G.142.3. Algorithmic Problems on GraphsMoon and Moser [MM65] showed that if |V | = n then G has at most 3n/3maximal cliques. Bron and Kerbosch [BK73] proposed a recursive back-tracking algorithm for identifying maximal cliques in an undirected graph.This algorithm is known as Bron-Kerbosch algorithm.Consider an undirected graph G = (V,E) and three vertices sets R,P ,and X. The Bron-Kerbosch algorithm finds the set R that belongs themaximal clique with all vertices of V , the set P that belongs the maximalcliques with some vertices of V and the null set X. The pseudo-code of theclassical implementation of the Bron-Kerbosch algorithm is given bellow.Algorithm 3: Bron-Krebosch AlgorithmBron-Krebosch(R,P,X)if P and X are both empty thenreport R as a maximal clique;endfor each vertex v in P doBronKrebosch(R ∪ v, P ∩N(v), X ∩N(v);P ← P \ v;X ← X ∪ v;end2.3.4 Maximum Clique ProblemThe problem to identify a maximum clique in a network can be formu-lated as follows:Problem 2.4. Maximum Clique.Instance: A graph G = (V,E).Task: Find a set of vertices S ⊆ V , if there exist, of size at least k suchthat for every u, v ∈ S, (u, v) ∈ E, i.e. S is a maximum clique in G.The maximum clique problem is NP hard [GJ79] and it is computationallyequivalent to some other algorithm problems, e.g., minimum vertex coverproblem, maximum independent set problem.152.4. Network Communities and Clustering2.4 Network Communities and ClusteringLet G = (V,E) be an unweighted and undirected graph in which V isthe vertex set, and E is the edge set. A community in G is a set C ⊆ V inwhich any vertex v ∈ C has comparatively more connections compared to theconnection with any other vertex in V \ C. The attribute of the high edgedensity inside a community can be categorized by different graph classes.Gao [Gao14] proposed the following general graph-theoretic definition ofcommunity.Definition 2.13 (Π-Community). A Π-community (or a Π-graph) in a net-work is a maximal, connected, and induced subgraph that belongs to theΠ-graph class.Therefore, identifying communities in a network can be interpreted asidentifying subgraphs induced by a particular Π-graph class. A Π-graph classis a subgraph defined with a particular structural property. A Π-graph classis called hereditary if it satisfies the following property: for G ∈ Π, everyinduced subgraph H of G is also in Π.In the following subsections, we briefly discuss on some of the graphclasses.2.4.1 Cluster GraphsAn undirected graph G = (V,E) is said to be a cluster graph if everyconnected component in G is a maximal clique. That is, a cluster graph isa collection of disjoint maximal cliques.For any three vertices i, j, k ∈ V , the graph G is a cluster graph if andonly if (i, j), (j, k) ∈ E =⇒ (i, k) ∈ E, where (i, j) = (j, i);∀i, j ∈ V .This above transitive property of the cluster graph can also be inter-preted as P3-free. Note that, an induced subgraph of three ordered verticesis called P3 if any those three vertices are connected as a simple path. There-fore, we can say that a graph G is a cluster graph if and only if G is P3-free.The cluster graphs have numerous applications in different fields suchas in data mining to find a group of an object with maximum inter-classsimilarity and minimum intra-class similarity [HPK11], in computationalbiology to cluster and visualize the gene expression data [SMKS03], in imagesegmentation [WL93], etc..162.4. Network Communities and Clustering2.4.2 Quasi-Threshold GraphsAn induced subgraph of four ordered vertices is said to be a P4 if the allfour of those vertices are connected as a simple path. A C4 is an inducedsubgraph of four ordered vertices in which the first and the last vertices areconnected to form a close circuit.A graph G is said to be (P4, C4)-free if and only if P4 and C4 are theforbidden induced subgraph classes in G. That is, in G there exist no pathand close circuit of size four.In literature, the (P4, C4)-free graphs are also known as quasi-thresholdgraphs [MWW89, YCC+96], comparability graphs of tree [Wol65] and triv-ially perfect graphs [Gol78]. A quasi-threshold graph can be recognized inlinear time [YCC+96].2.4.3 CographsA graph G is said to be a cograph, if there exist no P4 induced subgraphin G. That is, cograph is a P4-free graph. In a cograph every connectedinduced subgraph has a disconnected complement.In literature, different authors studied cographs independently with dif-ferent names: decay graphs [Sum74], D∗-graphs [Jun78], and 2-party graphs.2.4.4 Threshold GraphsA connected graph is 2K2-free, first studied by El-Zahar and Erdo˝s[EZE85], if it does not contain a pair of independent edges as an inducedsubgraph.A graph G is said to be threshold (or (P4, C4, 2K2)-free) if and only ifthere exist no P4, C4, 2K2 induced subgraphs in G. It was first studied byChva´tal and Hammer [CH73].2.4.5 Community Editing ProblemsBased on Gao’s [Gao14] Π-community definition, we can generalize thegraph editing problems as the Π-Community Editing, which is defined asfollows.Problem 2.5. Π-Community Editing(G)Instance: An unweighted graph G = (V,E).Task: Determine a modified graph G′ = (V,E′) with minimum number ofedge editions (insertions or deletions) in which each connected component172.5. Balance Theory and Clusteringis a Π-community.Based on the desired Π-community we can redefine Problem 2.5 to dif-ferent community editing problem. A brief on the well known communityediting problems is given next.Clustering Editing Problem: The Clustering Editing problem is aclass of community editing problems. It can be defined from Problem 2.5if the desired Π-community network is a cluster graph. The ClusteringEditing problem is a NP-hard [KM86]. In this thesis, we devote Chapter 4to design a heuristic algorithm for the Correlation Clustering Editingproblem which is a class of community editing problem on signed networks.Quasi-Threshold Editing Problem: The Quasi-Threshold Editingproblem can be formulated from Problem 2.5 if the desired Π-communitynetwork is a quasi-threshold graph. Nastos and Gao [NG13] first studied theQuasi-Threshold Editing problem to define the communities in socialnetworks. They also showed that this problem is NP-hard.Cograph Editing Problem: The Cograph Editing problem can bealso be formulated from Problem 2.5 if the desired Π-community network isa cograph. Liu et al. [LWGC12] showed that Cograph Editing problemis NP-complete. An Edge P4 centrality-based divisive algorithm for identifycograph communities in graph is proposed by Jia et al. [JGG+15].2.5 Balance Theory and ClusteringIn 1946, Heider [Hei46] first introduced the balance theory to explainthe cognitive balance by resolving the sentimental inconsistency in a socialsystem. Heider also first used the signed network to represent the mutualsentimental interaction among the members of the social system. In thepositive (negative) network each vertex represents a member/person, anda sign edge represents the mutual friend (hostile) interaction between twomembers. According to the balance theory, the balanced state in a socialsystem is based on the following principles:“friend of my friend is my friend”“enemy of my friend is my enemy”182.5. Balance Theory and ClusteringIn signed network, Heider’s balance states can be represented by a triad(sub-network with three vertices) and then the number of positive edges inthe triad. A balanced state in a triad can be achieved by only having an oddnumber of positive edges. On the other hand if a triad has an even numberof positive edges then it is imbalanced. In Fig.2.1, the triads T3 and T1 arein balanced state, and the triad T2 is in imbalanced.Davis [Dav77] extended this idea by considering “enemy of my enemyis my enemy”, which can be represented by the triad T0, is also a balancedstate. Cartwright and Harary [CH56, Har59] formalized the definition ofthe balance theory in graph-theoretic language. They also showed that asigned network is said to be in balance state or structural balanced if thevertex set of the network can be partitioned into mutually hostile subgroupsin which the internal attitudes (edges) among the members of a subgroupare friendly to each other.Figure 2.1: Triads with odd number or no positive edges or are balanced(T3, T1) and with even number edges are imbalanced (T2).A signed network is called k-clusterable or k-correlation-clusterable ifits vertex set can be partitioned into k subgroups in such way that thesigned edges inside a subgroup are positive and the signed edges betweentwo subgroups are negative.19Chapter 3Random Models for SignedDirected Social Networks3.1 Signed Directed Social NetworkIn a social group, the mutual attitude among the members (e.g. persons)can be represented by a signed directed network G = (V,E, s), where V thevertex set, E the directed-edge set and the function s : E → {+,−} assignsa sign for each directed-edge in the network [Hei46, Dav77].In G, each vertex v ∈ V represents an individual member, and eachsigned-directed-edge e ∈ E represents the attitude from a source vertex(member) to a target vertex. Based on the sign, E can be partitioned asE = E+ ∪ E−, where E+ ∩ E− = ∅, and E+ and E− are the set of allpositive and negative edges respectively. Therefore, each edge e ∈ E hastwo attributes: sign and direction. A positive edge e ∈ E+ directed from amember A to another member B indicates the friendly attitude from A toB. Similarly, a negative edge e ∈ E− directed from a member A to anothermember B indicates the hostile attitude from A to B.In the rest of this chapter, we will simply denote signed directed socialnetwork as G = (V,E).3.2 Motivation for Modeling Signed DirectedSocial NetworksOur motivation to design random models for signed directed networksfollows the observations of the signed-directed degree distributions in threereal-world signed directed social networks: Wikipedia Request for Admin-ship (WikiRfA) [WPLP14], Epinions Social Netwrok [LHK10], and SlashdotSocial Network [LHK10]. A brief description of the studied networks is givenin the following.203.2. Motivation for Modeling Signed Directed Social Networks3.2.1 Empirical NetworksWiki-RfA: To be an admin from an editor, in Wikipedia (www.wikipedia.org), an application has to be submitted either by a candidate or by anymember on behalf of a candidate. Then any member of the Wikipedia com-munity can vote to either support (+1) or oppose (−1) or neutral (0) to theadminship request. In this network, each vertex represents a communitymember (either an editor or an applicant or both) and each signed-directed-edges represents a vote from a community member to an applicant. Thedata were collected from all the votes for RfA process between 2003 to May2013. Since many candidates applied for adminship for several times dur-ing that period, there exist multiple signed-directed-edges between the samepair of members.Slashdot: The Slashdot (www.slashdot.org) is a technology-related newswebsite where users can tag each other as friends or foes. In this network,each tag is represented by a signed-directed-edge from a tag-given-user toa tag-receiving-user. If a user A tags another user B as a friend (foe) thenthere is a positive (negative) edge from A to B. This network contains81,867 vertices (users) and 545,671 edges (links).Epinions: This network data obtained from a general consumer reviewsite called Epinions (www.epinions.com), where users can post their opin-ions on various products. Users can also rate each other as trustworthy(positive) or not (negative) base on their reviews. This network contains131,828 vertices (users) and 841,372 edges (mutual attitudes).3.2.2 Analyzing Real-World NetworksTo analyze the signed-directed degree distributions in our studied real-world networks, we fit the power-law distribution p(d) ≈ d−γ to all ofthe four types signed-directed-degrees (i.e. positive-in-degree, negative-in-degree, positive-out-degree, and negative-out-degree) and calculate the val-ues of exponent γ with the corresponding p-value individually. We use theprocedure and implementations given by Clauset et al. [CSN09] to estimatethe exponents γ and the corresponding p-values. In this procedure use about2500 synthetic data sets to test the null-hypothesis against the given dataset and the corresponding p-value.The results of the fitting power law models are given in the Table 3.1.213.2. Motivation for Modeling Signed Directed Social Networks(a) (b)(c) (d)Figure 3.1: Cumulative signed-directed-degree distribution Pcum(d) of Wiki-RfA social network.223.2. Motivation for Modeling Signed Directed Social Networks(a) (b)(c) (d)Figure 3.2: Cumulative singed-directed-degree distribution Pcum(d) of Slash-dot social network.233.2. Motivation for Modeling Signed Directed Social Networks(a) (b)(c) (d)Figure 3.3: Cumulative signed-directed-degree distribution Pcum(d) of Epin-ions social network.243.2. Motivation for Modeling Signed Directed Social NetworksEmpirical data sets resultsDatasets |V |, |E| Dist. type n γ p-valueWiki-RfA 11381 pos-out-deg 9331 3.500 0.529185612 pos-in-deg 3036 3.500 0.013neg-out-deg 5170 2.640 0.010neg-in-deg 2860 3.250 0.218Slashdot 81867, pos-out-deg 42105 2.020 0.000545671 pos-in-deg 61894 2.830 0.906neg-out-deg 14611 2.000 0.000neg-in-deg 29297 2.200 0.000Epinions 131828, pos-out-deg 88180 1.730 0.000841372 pos-in-deg 69900 1.720 0.000neg-out-deg 18499 1.800 0.000neg-in-deg 31791 2.30 0.000Table 3.1: Power-law exponents γ and the corresponding p-values for differ-ent signed-directed-degree distributions for above empirical data sets.Also, the Fig.3.1 - Fig.3.3 show the cumulative signed-directed-degree dis-tributions in the studied real-world social networks.In Table 3.1, we can observe that the power-law model fitting on allof the four signed-directed-degree distributions in the Wiki-RfA network asbeing statistically significant (i.e. p-value ≥ 0.01). On the other hand, inSlashdot network, the power-law model fitting is statistically significant onlyfor the positive-in-degree distribution. But for the Epinions network, none ofthe power-law models for the signed-directed-distributions are statisticallysignificant.Again in the Table 3.1, we see that the values of the power law com-ponents γ are in the range 2.5 ≤ γ ≤ 3.5 for the Wiki-RfA and are in therange 2.0 ≤ γ ≤ 3.0 for the Slashdot. But the values of γ in Epinions areless than 2.5. Therefore, from the above observations, we can say that in anetwork in which the signed-directed-degree distributions has a tendency tofollow the power law the component γ should be in the range 2.0 ≤ γ ≤ 3.5.Let’s look at the evolving process of the Wiki-RfA network. This datawas collected over a period, and many candidates had requested for theadminship for several times during this time because of failing in the election.Therefore, there exist multiple edges between the same pair of members with253.2. Motivation for Modeling Signed Directed Social Networksthe same or different signs and directions. On the other hand, in the Slashdotand Epinions networks, there exist only one edge between a pair of members.From this above observations, we can conclude that the signed-directed-degree distributions in an attitude based signed directed social network withmultiple edge possibilities between a same pair of vertices follow the power-law.In Table 3.1, we can observe another property in the column n whichrepresents the number of vertices having positive/negative-in-degrees andpositive/negative-out-degrees in the Wiki-RfA and Slashdot networks. InWiki-RfA network, the numbers of vertices having positive-out-degrees andnegative-out-degrees are greater than the numbers of vertices having positive-in-degrees and negative-in-degrees respectively. That is the members in thisnetwork have more prone to give votes (either positive or negative) thanreceiving votes. On the other hand, in Slashdot network, the numbers ofvertices having positive-out-degrees and negative-out-degrees are less thanthe numbers of vertices having positive-in-degrees and negative-in-degreesrespectively. That is the members this network have more prone than votes(either positive or negative) compare to giving votes. But this inverse rela-tion between the numbers of in and out degree vertices (for both positiveand negative) is not visible in the case of the Epinions network.Again, from the first observation, we know that all of the four signed-directed-degree distributions in Wiki-RfA and only positive-in-degree distri-bution in Slashdot follow the power law property with the component γ inthe ranges 2.0 ≤ γ ≤ 3.5. But the signed-directed-degree distributions inEpinions network do not follow the power-law property. Now from our firstand second observation, we can say that a signed directed social networkin which the signed-directed-degree distributions have a tendency to followpower law has an inverse relation between the numbers of in and out degreevertices (for the case both positive and negative).Attributes/Properties Wiki-RfA Slashdot EpinionsA1 Power-law component in the range 2.0 ≤ γ ≤ 3.5 Yes Yes NoA2 # of in and out degree vertices are inversely related Yes Yes NoA3 Exists multiple edges in both directions Yes No NoTable 3.2: The list of observed attributes in the real-world networks. Wedenote these attributes by A1, A2, and A3 respectively.The summary of observed attributes in the real-world signed directednetworks is given in the Table 3.2.2. These observations inspired us todesign random models for signed directed networks (given in sections 3.4, 3.5,263.3. Literature Reviewand 3.6) with the above observed properties and with some other specifiedcontrolling features in the network structure.3.3 Literature ReviewThe study of classical models for random graphs or network dates backto Erdo˝s and Re´nyi with the series of papers [ER59, ER60, ER61]. Butthe first attempt to model a random network to explain real-world phe-nomena is observed in 1998 by Watts and Strogatz [WS98]. Their randommodel represents the Milgram’s [Mil67] ‘small-world’ properties which arehighly clustered and having a small diameter in social networks. In re-cent empirical studies suggests that the real-world complex networks mostlydemonstrate the ‘scale-free’ attributes. For example, the degree distribu-tions of the Internet router networks studied by Faloutsos, Faloutsos andFaloutsos [FFF99] and the telephone call networks studied by Aiello et al.[ACL00] both follows power-laws. Since then, different network models withpower-law degree distributions and with other structural features have beenproposed to duplicate the scale-free phenomena in the real-world complexnetworks.In 1999, Baraba´si and Albert [BA99] proposed a random model withpreferential attachment trait. Later, in 2003, Bolloba´s et al. [BR03] pre-sented a rigorous proof for the power-law degree distribution for this model.Preferential attachment models with adjustable parameter investigate byDorogovtsev et al. [DMS00], Aielllo et al. [ACL01] and Jordan [Jor06]. In2003, Cooper et al. [CF03] proposed a more generalized form of preferentialattachment model which removes the restrictions on the creation of edgesbetween two existing vertices and the number of new edges adding to thenetwork.Beside the preferential attachment models, Kumar et al. [KRR+00] pro-posed a random network model, called copying model, to capture the linkcopying property in creating a new web-page in the world-wide-web network.They showed that the degree distribution in this model follows the scale-freeproperty.A random model for the complex network with a well defined graphstructural property was proposed in [Gao09], which is called k-Tree randommodel. Later, Sridharan et al. [SGWN11] showed that the edge embedded-ness of k-tree random network follows a power-law distribution.In 2015, Ciotti et al.[CBC+15] proposed two models for signed socialnetworks: binomial degree distribution model and power-law degree distribu-273.4. Model A: Preferential Attachment Modeltion model. Both of these models follow two main steps to produce a signedrandom network. In the first step, models generate an unsigned network inwhich the (unsigned) degree follow either binomial or power-law degree dis-tribution. In the second step, determine sign to each edge of the simulatednetwork by dividing the vertices into two groups. If the endpoints of an edgeare in the same group then label this edge as a positive edge, otherwise; ifthe endpoints of an edge are in different groups then label this edge as anegative edge. Ciotti et al.[CBC+15] showed that the positive and negativedegree distributions in the simulated networks obey power-law. But they didnot present any analytical proof for the degree distribution in these models.The main features of these models are that the generated signed networksare structurally balanced, and that the each vertex is a member of one ofthe two mutually exclusive groups.3.4 Model A: Preferential Attachment Model3.4.1 Model DefinitionThe random process start with a signed directed initial network Gk0 =(V0, E0) of size |V0| = 2k + 1. Suppose there exist exactly k positive and knegative directed edges in Gk0. At time step t+ 1, we add a new vertex vt+1to construct Gkt+1. The new vertex vt+1 connects with k existing verticesfrom Gkt as their positive-out-neighbor with the probabilityP[vt+1 is positive-out-neighbor of v] =2d+outGkt(v)∑v d+Gkt(v); (3.1)where v ∈ Vt. Also, vt+1 connects with k existing vertices as their negative-out-neighbor with the probabilityP[vt+1 is negative-out-neighbor of v] =2d−outGkt(v)∑v d−Gkt(v); (3.2)where v ∈ Vt. On the other hand, the new vertex vt+1 also connects with kexisting vertices as their positive-in-neighbor and with k existing vertices astheir negative-in-neighbor with the following probabilities.P[vt+1 is positive-in-neighbor of v] =2d+inGkt(v)∑v d+Gkt(v); (3.3)283.4. Model A: Preferential Attachment ModelP[vt+1 is negative-in-neighbor of v] =2d−inGkt(v)∑v d−Gkt(v); (3.4)where v ∈ Vt.3.4.2 Degree DynamicsIn this section, we investigate the degree dynamics in Gkt≥1, which wesimply denote as Gkt . That is, we ignore the initial network Gk0. Let therandom model A for signed directed network generates the σ-algebra whichcan be denoted by Ft = σ(Gkt , t ≥ 1).First, we analyze the dynamics of the positive-in-degree in Gkt . Let,X+ind (t) be the random variable for the number of vertices with positive-in-degree d in Gkt . By the following lemma, we find the minimum positive-in-degree for a vertex and the total positive-degree (in and out) in Gkt .Lemma 3.1. For any vertex v ∈ Vt, positive-in-degree of v, d+inGkt (v) ≥ kand the total positive-degree in Gkt is∑v dGkt(v) = 4kt.Proof. By ignoring the initial vertices, when a new vertex enters to thenetwork, it selects k existing vertices as its positive-in-neighbors. Therefore,each vertex enters in the network with exactly k positive-in-degrees, i.e.d+inGkt(v) ≥ k, for all v ∈ Vt≥1.At time t, the new entering vertex adds k additional positive edges inthe both directions with respect to itself. That is, in total 2k new positiveedges are added at the time when the new vertex is entering to the existingnetwork Gkt . So, at the end of the time t, the total number of the positivedegrees in Gkt is increased by 4k. Therefore, if we ignore the initial network,the total number of positive degrees in Gkt is∑v dGkt(v) = 4kt.According to the model construction, at time t + 1, an existing vertexv ∈ Vt can only increase its positive-in-degree if the new vertex vt+1 connectsas a positive-in-neighbor of the vertex v. Therefore, the probability of avertex v ∈ V receives a positive-in-degree is (using Eq.(3.3))P[v receives a positive-in-degree] =2 d+inGkt(v)∑v∈Vt d+Gkt(v). (3.5)293.4. Model A: Preferential Attachment ModelAgain, for given d+inGkt(v) = d, the conditional probability that v ∈ Vt receivesa positive-in-degree is (using Eq.(3.5))P[v receives a positive-in-degree|d+inGkt(v) = d] =2d∑v∈Vt d+Gkt(v). (3.6)Since, vt+1 connects as positive-in-neighbor with k existing vertices, thenfor given Gkt , by using Eq.(3.6) and Lemma 3.1 , the expected number ofvertices with positive-in-degree d that receive a positive-in-degree in Gkt+1 is2kd4ktX+ind (t) =d2tX+ind (t) (3.7)which is independent of k.Let {β+ind } be a sequence of positive integers. We now show that,∣∣E[X+ind (t)]−tβ+ind ∣∣ is asymptotically bounded by a constant, where {β+ind }satisfies the following equationsβ+ind =d− 1d+ 2β+ind−1, β+ink ≈ 1, (3.8)as t→∞.Theorem 3.2. Let E[X+ind (t)]be the expected number of vertices withpositive-in-degree d in the random network Gkt generated by the Model A.Then ∣∣E[X+ind (t)]− β+ind t∣∣ ≤ C,where C is a constant and β+ind has a power-law bound d−3.Proof. First, consider the base case d = k. If we ignore the initial vertices,then according to the Lemma 3.1 , any vertex v ∈ Vt has at least k positive-in-degree in Gkt , i.e. d+inGkt(v) = k; ∀ v ∈ Vt. That is, if v ∈ Vt connectswith vt+1 as a positive-out-neighbor, then d+inGkt+1(v) = k + 1 . Therefore,from Eq.(3.7), for given Gkt , the expected number of vertices with positive-in-degree (k + 1) in Gkt+1 and k in Gkt isd2tX+ink . (3.9)Again, at the time step t+1, the new vertex vt+1 has exactly k-positive-in-degree and k-positive-out-degree in Gkt+1. Therefore, for given the value303.4. Model A: Preferential Attachment Modelof X+ink , the net difference in the number of vertices with positive-in-degreek from Gkt to Gkt+1 isE[X+ink (t+ 1)∣∣Ft]−X+ink (t) = 1− k2tX+ink . (3.10)By taking expectation on the both side of the Eq.(3.10), we getE[X+ink (t+ 1)]= 1− k2tE[X+ink (t)]+ E[X+ink (t)],≈ E[X+ink (t)]+ 1, (3.11)as t→∞.According to the model structure, at time t, exactly one vertex enters tothe network with positive-in-degree k. Now, if we ignore the initial network,then E[X+ink (t)]= 1 at t = 1. Then, by solving the above recurrenceEq.(3.11), we getE[X+ink (t)]= t+O(1). (3.12)Now, consider the general case d > k. Due to the model construction,we have to consider two cases to estimate the difference in the number ofvertices with positive-in-degree d during the transition from Gkt to Gkt+1.First, if the new vertex vt+1 connects with v ∈ Vt as a positive-in-neighbor,and if d+inGkt(v) = d − 1, then positive-in-degree of v increases to d in Gkt+1.Therefore, from Eq.(3.7) for given Gkt , the expected number of vertices withpositive-in-degree d in Gkt+1 and (d− 1) in Gkt isd− 12tX+ind−1. (3.13)Second, if vt+1 connects with v ∈ Vt as a positive-in-neighbor, and if d+inGkt (v) =d, then the positive-in-degree of v increases to (d + 1) in Gkt+1. Therefore,again from Eq.(3.7), for given Gkt , the expected number of vertices withpositive-in-degree (d+ 1) in Gkt+1 and d in Gkt isd2tX+ind . (3.14)Therefore, by using Eq.s (3.13) and (3.14), for given the value of X+ind , thenet difference in the number of vertices with positive-in-degree d from Gktto Gkt+1 isE[X+ind (t+ 1)∣∣Ft]−X+ind (t) = d− 12t X+ind−1 − d2tX+ind . (3.15)313.4. Model A: Preferential Attachment ModelBy taking expectation on the both side of the Eq.(3.15), we getE[X+ind (t+ 1)]= E[X+ind (t)]+d− 12tE[X+ind−1(t)]− d2tE[X+ind (t)]. (3.16)Assume β+ind = 0 for d < k in {β+ind }. Let E[X+ind (t)]can be approximatedby tβ+ind . Then, Eq.(3.16) satisfies(t+ 1)β+ind = tβ+ind +12t((d− 1)tβ+ind−1 − dtβ+ind ),β+ind =d− 1d+ 2β+ind−1. (3.17)Again, from the Eq.(3.11), as t→∞, we have(t+ 1)β+ink = 1 + tβ+ink ,β+ink = 1. (3.18)Here, Eq.s(3.17) and (3.18) give the recurrence equations, which are satisfiedby the sequence {β+ind }.Let ∆+ind (t) = E[X+ind (t)]−tβ+ind . To show that, E[X+ind (t)] can be approx-imated by tβ+ind , we need to proof by induction that∣∣∆+ind (t)∣∣ is boundedby a constant.For the base case d = k, from the Eq.s(3.12) and (3.18), we get∣∣∆+ink (t)∣∣+ tβ+ink = t+O(1),∣∣∆+ink (t)∣∣ = O(1), (3.19)i.e.∣∣∆+ink (t)∣∣ is bounded by a constant which is independent of d and t.Consider,∣∣∆+ind (t)∣∣ is also bounded by a constant which is independentof d and t. Thus, we get∣∣∆+ind (t)∣∣ = ∣∣E[X+ind (t)]− tβ+ind ∣∣ ≤ O(1). (3.20)Now, from the Eq.(3.16), we get∆+ind (t+ 1) + (t+ 1)β+ind = ∆+ind (t) + tβ+ind +12t((d− 1)(∆+ind−1(t)+ tβ+ind−1)− d(∆+ind (t) + tβ+ind )).323.4. Model A: Preferential Attachment ModelBy rearranging the above equation, we have∆+ind (t+ 1) =d− 12t∆+ind−1(t) +(1− d2t)∆+ind (t)+d− 12tβ+ind−1 −(1 +d2)β+ind . (3.21)Now, using the definition of β+ind from the Eq.(3.17), we getd− 12β+ind−1 −(1 +d2)β+ind =d− 12β+ind−1 −d+ 22β+ind=d− 12d+ 2d− 1β+ind −d+ 22β+ind= 0. (3.22)Then, from the Eq.s (3.21) and (3.22), we get∣∣∆+ind (t+ 1)∣∣ ≤ d− 12t ∣∣∆+ind−1(t)∣∣+ (1− d2t)∣∣∆+ind (t)∣∣,≤(d− 12t+2t− d2t)max(∣∣∆+ind−1(t)∣∣, ∣∣∆+ind (t)∣∣),=(1− 12t)max(∣∣∆+ind−1(t)∣∣, ∣∣∆+ind (t)∣∣), (3.23)Therefor, as t→∞ then, we get from the Eq.s (3.19), (3.20) and (3.23)∣∣∆+ind (t+ 1)∣∣ = ∣∣E[X+ind (t+ 1)]− (t+ 1)β+ind ∣∣ ≤ O(1), (3.24)i.e.,∣∣∆+ind (t+ 1)∣∣ is bounded by a constant which is independent of d and t.Hence, by using the induction hypothesis we can say that,∣∣E[X+ind (t)]−tβ+ind∣∣ is bounded by a constant which is independent of d and t, i.e.E[X+ind (t)]can be approximated by tβ+ind .To find the power-law bound for the positive-in-degree distribution in thesigned directed networks generated by the Model A, we get from the defini-tion of β+ind at Eq.(3.17), thatβ+ind =d∏i=1i− 1i+ 2≈ d−3, (3.25)which is independent of k.333.5. Model B: Edge Copying ModelSimilarly, to find the power-law bound for the positive-out-degree distri-bution, let {β+outd } be a sequence of the positive integers. Then by followingthe above procedure for positive-out-degree, we can prove the following the-orem.Theorem 3.3. Let E[X+outd (t)]be the expected number of vertices withpositive-out-degree d in the random network Gkt generated by the Model A.Then ∣∣E[X+outd (t)]− β+outd t∣∣ ≤ C,where C is a constant and β+outd has a power-law bound d−3.In similar manner, let {β−ind } and {β−outd } be two sequence of positiveintegers. Then we can also prove the following theorems for negative-in-degree and negative-out-degree distributions respectively.Theorem 3.4. Let E[X−ind (t)]be the expected number of vertices withnegative-in-degree d in the random network Gkt generated by the Model A.Then ∣∣E[X−ind (t)]− β−ind t∣∣ ≤ C,where C is a constant and β−ind has a power-law bound d−3.Theorem 3.5. Let E[X−outd (t)]be the expected number of vertices withnegative-out-degree d in the random network Gkt generated by the Model A.Then ∣∣E[X−outd (t)]− β−outd t∣∣ ≤ C,where C is a constant and β−outd has a power-law bound d−3.3.5 Model B: Edge Copying Model3.5.1 Model DefinitionInitially at t = 0, we start with a initial signed directed network Gk0 withthe vertex set of size |V0| = 2k + 1. The initial network Gk0 is connected byexactly k positive and k negative directed edges in such way that, there existexactly k number of vertices having each type of degrees: positive-in-degree,positive-out-degree, negative -in-degree, and negative-out-degree.At the time t+1, the new vertex vt+1 is added to the network to constructGkt+1. The vertex vt+1 connects with k existing vertices by copying k distinctedges from Gkt . The copy procedure obeys the following steps:343.5. Model B: Edge Copying Model(1) Selects a distinct (i.e. not copied already) signed-directed-edge e fromGkt uniformly at random (u.a.r). Let e[src], e[trg] and e[sign] are thee’s source, target vertices and e’s sign respectively.(2) Add a new signed-directed-edge(a) from vt+1 to e[trg] with probability (1 − α), by copying e’s signand source e[src] vertex.(b) or, from e[src] to vt+1 with probability α, by copying e’s sign andtarget e[trg] vertex.(3) Repeat the above procedure k times.3.5.2 Comparison with Preferential Attachment ModelBefore studying the degree dynamics in the signed directed network gen-erated by the edge copying model, first we analyze following facts.According to the edge copying model, at time t + 1, the new vertexvt+1 selects uniformly at random an edge, which to be copied, from Gkt .Regardless the signs and directions, at each time step, exactly k new edgesare added to the network. Therefore, the total number of edges in Gkt is∣∣Et∣∣ = kt+ 2k ≈ kt, (for large t) (3.26)where Et is the set of all edges in Gkt and∣∣E0∣∣ = 2k is the number of edgesin initial network Gk0.Again, according to the model construction, the new vertex vt+1 addsk new edges by randomly copying k vertices (either the source or target)from the selected k distinct edges in the existing network. Therefore, anyvertex v can only receives a positive-in-degree if v is the target vertex of theselected edge (v′, v) ∈ E+t for copying in which the source vertex v′ is copiedby the vt+1. That is, if vt+1 connects with v by the edge (vt+1, v), suchthat (vt+1, v) ∈ E+t , (v′, v) ∈ E+t and v′ is copied by vt+1, then v receives apositive-in-degree.Also, we know that the number of positive-directed-edges, in whichv ∈ Vt is the target vertex, is equal to the positive-in-degree of v in Gkt ,i.e. d+inGkt(v). Now, v receives a positive-in-degree only if vt+1 copied thesource vertex from the selected positive-directed-edge e, in which v is thetarget vertex. Therefore, in the process of copying k edges from the existingnetwork, there is a chance of copying one more edge in which v ∈ Vt is the353.5. Model B: Edge Copying Modeltarget vertex. That is, v may receive more than one positive-in-degree inthe transition from Gkt to Gkt+1.Therefore, the probability that the vertex v ∈ Vt receives exactly lpositive-in-degree in time t+ 1 isP[v receives exactly l positive-in-degree] =α(d+inGkt(v)l)(kt−d+inGkt(v)k−l)(ktk) , (3.27)where e ∈ E+t , l ≥ 1. In the Eq.(3.27), the term α is the probability ofcopying the source vertex for the selected edge (i.e. connecting vt+1 withthe target vertex) and the fractional term is the probability of selecting ledges in which v is the target vertex of e.For given d+inGkt(v) = d, the conditional probability that a vertex v ∈ Vtreceives exactly l positive-in-degree in time t+ 1 isP[v receives exactly l positive-in-degree |d+inGkt(v) = d]= α(dl)(kt−dk−l)(ktk) , (3.28)which is dependent on t, d and k.At this point, if we look at the above probability, then it is very unclearto have any preferential attachment property.3.5.3 NotationsWe introduce the following parameters:ak = 1− k − 1kt→ 1 as t→∞, (3.29)bd =k−2∏i=0(1− dkt− i)→ 1 as t→∞, (3.30)bd−1 =k−2∏i=0(1− d− 1kt− i)→ 1 as t→∞. (3.31)3.5.4 Degree DynamicsBefore investigating the degree dynamics, first, we analyze the evolutionof the number of positive and negative edges in Gkt . Consider the edge363.5. Model B: Edge Copying Modelcopying model for signed directed networks generates the σ-algebra whichis denoted by Ft = σ(Gkt , t ≥ 1).Let, X+e (t) and X−e (t) are the random variables for the number of pos-itive and negative edges in Gkt respectively. According to the model con-struction, there exist exactly k positive and k negative edges in the initialnetwork Gk0.At time t, the new vertex adds k edges (both positive and negative) tothe network by copying k distinct edges from the existing network. Thatis, a positive edge will add to network if the new vertex select a positiveexisting edge. Therefore, for given the value of X+e (t), the expected numberof newly added positive edges in Gkt+1 isk|Et|X+e (t). (3.32)Therefore, the net difference in the number of positive edges from Gkt toGkt+1 isE[X+e (t+ 1)∣∣Ft]−X+e (t) = k|Et|X+e (t). (3.33)By taking mathematical expectation on the both side of the Eq.(3.33), weget (using Eq.(3.26))E[X+e (t+ 1)]=(1 +kkt+ 2k)E[X+e (t)]=t+ 3t+ 2E[X+e (t)]. (3.34)Therefore, we can writeE[X+e (t)]=t+ 2t+ 1E[X+e (t− 1)]. (3.35)Since, E[X+e (0)]= k, by solving the Eq.(3.35), we get (using Eq.(3.26))E[X+e (t)]=t+ 22k =|Et|2. (3.36)Now, we focus on analyzing the dynamic of the positive-in-degree distribu-tion in Gkt .Let, X+ind (t) be random variable for the number of vertices with positive-in-degree d in Gkd generated by edge copying model. Now, we prove the fol-lowing lemmas for calculating the expected number of vertices with positive-in-degree d in Gkt+1.373.5. Model B: Edge Copying ModelLemma 3.6. For each d ≤ k, the expected number of vertices with positive-in-degree d in Gkt+1 satisfiesE[X+ind (t+ 1)] ≈ O(1) + (d− 1)αtakbd−1 E[X+ind−1(t)]+ (1− dαtakbd) E[X+ind (t)] +d∑l=2O(t−l);where E[X+ind−1(t)]and E[X+ind (t)]are the expected number of vertices withpositive-in-degree d − 1 and d in Gkt respectively. Also, bd and bd−1 aredefined in Eq’s(3.30) and (3.31) respectively.Proof. For the case d ≤ k, a vertex v ∈ Vt may receive at most k positive-in-degree in the transition Gkt to Gkt+1. To find the expected number ofpositive-in-degree d in Gkt+1, in this case, we have to consider the followingthree situations.The first situation is, if v ∈ Vt, with d+inGkt (v) = d− l; 1 ≤ l ≤ d, receivesexactly l positive-in-degree in Gkt+1. Then, for given Gkt , the expected num-ber of vertices with positive-in-degree d− l in Gkt and d in Gkt+1 is (by usingEq.s(3.26) and (3.28))d∑l=1α(d−ll)(kt−d+lk−l)(ktk) X+ind−l (t)= α(d−11)(kt−d+1k−1)(ktk) X+ind−1(t) + d∑l=2α(d−ll)(kt−d+lk−l)(ktk) X+ind−l (t)=α(d− 1)t− k−1kk−2∏i=0(1− d− 1kt− i)X+ind−1(t)+d∑l=2αl−1∏i=0((d− l − i)(k − i)(l − i)(kt− i)) k−l−1∏i=0(1− d− lkt− i)X+ind−l (t)=α(d− 1)takbd−1X+ind−1(t) +d∑l=2O(t−l)X+ind−l (t); (3.37)where ak = 1− k−1kt and bd−1 =k−2∏i=0(1− d−1kt−i).The second situation is, if v ∈ Vt, with d+inGkt (v) = d; 1 ≤ l ≤ d, re-ceives exactly l positive-in-degree in Gkt+1. Then, for given Gkt , the expected383.5. Model B: Edge Copying Modelnumber of vertices with positive-in-degree d in Gkt and d+ l in Gkt+1 isd∑l=1α(dl)(kt−dk−l)(ktk) X+ind (t)= α(d1)(kt−dk−1)(ktk) X+ind−1(t) + d∑l=2α(dl)(kt−dk−l)(ktk) X+ind (t)=αdt− k−1kk−2∏i=0(1− dkt− i)X+ind (t)+d∑l=2αl−1∏i=0((d− i)(k − i)(l − i)(kt− i)) k−l−1∏i=0(1− dkt− i)X+ind (t)=αdtakbdX+ind (t) +d∑l=2O(t−l)X+ind (t); (3.38)where ak = 1− k−1kt and bd =k−2∏i=0(1− dkt−i).The third situation is, whether the new vertex vt+1 has positive-in-degreed in Gkt+1 or not. The vertex vt+1 can achieve a positive-in-degree in Gkt+1 ifthe random process selects an edge (vi, vj) ∈ E+t in which the target vertexvj is selected for copying. That is, vt+1 receives a positive-in-degree in Gkt+1,if vt+1 connects with the source vertex vi from the randomly selected edge(vi, vj) ∈ E+t by the new edge (vi, vt+1) ∈ E+t+1. Therefore, according to themodel construction, the probability that vt+1 receives a positive-in-degreein Gkt+1 is(1− α)∣∣E+t ∣∣∣∣Et∣∣ , (3.39)where E+t and Et are the sets of all positive-edges and all edges in Gktrespectively.Since, vt+1 has k neighbors in Gkt+1, then we can write the expectation of theevent that the vertex vt+1 has exactly d positive-in-degree in Gkt+1, where1 ≤ d ≤ k, asE[Ikd (vt+1)∣∣Ft] = (kd)(1− (1− α)∣∣E+t ∣∣∣∣Et∣∣)k−d((1− α)∣∣E+t ∣∣∣∣Et∣∣)d=(kd)((|Et| − (1− α) |E+t |)k−d|Et|k)(1− α)d |E+t |d393.5. Model B: Edge Copying Model≤(kd)( |Et|k−d + (1− α)k−d|E+t |k−d|Et|k)(1− α)d |E+t |d=(kd)((1− α)d |E+t |d|Et|d + (1− α)k |E+t |k|Et|k); (3.40)where Ikd (vt+1) is the indicator function of the event that vertex vt+1 haspositive-in-degree d, such that 1 ≤ d ≤ k, in Gkt+1.Since, E[X+e (t)] be the expected number of positive edges in Gkt , then weget (using Eq.(3.36))|E+t | ≈ E[X+e (t)] =|Et|2. (3.41)Then, from Eq.s(3.40) and (3.41), we getE[Ikd (vt+1)∣∣Ft] ≤ (kd)(12d(1− α)d + 12k(1− α)k)= IM . (let), (3.42)where 1 ≤ d ≤ k. Here, IM , which is independent of t and equals to zero ford > k, is constant for a random process.Therefore, by using Eq.s(3.37), (3.38) and (3.42), for given the value ofX+ind (t), the net difference in the number of vertices with positive-in-degreed from Gkt to Gkt+1 can be approximated as (after rearranging)E[X+ind (t+ 1)∣∣Ft]−X+ind (t) ≈ IM + (d− 1)αtak bd−1X+ind−1(t)− dαtakbdX+ind (t) +d∑l=2O(t−l). (3.43)By taking mathematical expectation on the both side of Eq.(3.43), we getE[X+ind (t+ 1)] ≈ IM + (d− 1)αtakbd−1E[X+ind−1(t)]+(1− dαtakbd)E[X+ind (t)] +d∑l=2O(t−l); (3.44)where d ≤ k and IM is a constant which is independent of t.403.5. Model B: Edge Copying ModelLemma 3.7. For each d > k, the expected number of vertices with positive-in-degree d in Gkt+1 satisfiesE[X+ind (t+ 1)]=(d− 1)αtakbd−1E[X+ind−1(t)]+(1− dαtakbd)E[X+ind (t)] +k∑l=2O(t−l),where E[X+ind−1(t)]and E[X+ind (t)]are the expected number of vertices withpositive-in-degree d − 1 and d in Gkt respectively. Also, bd and bd−1 aredefined in Eq’s(3.30) and (3.31) respectively.Proof. For the case d > k, a vertex v ∈ Vt may receive at most k positive-in-degree in the transition from Gkt to Gkt+1. To find the expected number ofpositive-in-degree d in Gkt+1, in this case, we have to consider the followingtwo situations.The first situation is, if v ∈ Vt, with d+inGkt (v) = d− l; 1 ≤ l ≤ k, receivesexactly l positive-in-degree in Gkt+1. Then, for give Gkt , the expected numberof vertices with positive-in-degree d− l in Gkt and d in Gkt+1 isk∑l=1α(d−ll)(kt−d+lk−l)(ktk) X+ind−l (t)=α(d− 1)takbd−1X+ind−1(t) +k∑l=2O(t−l)X+ind−l (t), (3.45)where ak = 1− k−1kt and bd−1 =k−2∏i=0(1− d−1kt−i).The second situation is, if v ∈ Vt, with d+inGkt (v) = d; 1 ≤ l ≤ k, re-ceives exactly l positive-in-degree in Gkt+1. Then, for given Gkt , the expectednumber of vertices with positive-in-degree d in Gkt and d+ l in Gkt+1 isk∑l=1α(dl)(kt−dk−l)(ktk) X+ind (t)=αdtakbdX+ind (t) +k∑l=2O(t−l)X+ind (t), (3.46)where ak = 1− k−1kt and bd =k−2∏i=0(1− dkt−i).413.5. Model B: Edge Copying ModelTherefore, by using the Eq.s(3.45) and (3.46), for given the value ofX+ind (t), the net difference in the number of vertices with positive-in-degreed from Gkt to Gkt+1 is (after rearranging)E[X+ind (t+ 1)∣∣Ft]−X+ind (t) = (d− 1)αtak bd−1X+ind−1(t)− dαtakbdX+ind (t) +k∑l=2O(t−l). (3.47)By taking mathematical expectation on the both side of the Eq.(3.43), wegetE[X+ind (t+ 1)]=(d− 1)αtakbd−1E[X+ind−1(t)]+(1− dαtakbd)E[X+ind (t)] +k∑l=2O(t−l) (3.48)where d > k.Let {β+ind } be a sequence of positive integers. In next theorem, we showthat,∣∣E[X+ind (t)] − tβ+ind ∣∣ is asymptotically bounded by a constant where{β+ind } satisfies the following recurrence equationsβ+ind =d− 1d+ 1αβ+ind−1; d > k, and β+ink ' c, (3.49)as t→∞ and c is a constant.Theorem 3.8. Let E[X+ind (t)] be the expected number of vertices with positive-in-degree d in Gkt generated by Model B. If α is the probability of copingsource vertex from a randomly selected edge, then∣∣E[X+ind (t)]− tβ+ind ∣∣ ≤ O(1),where β+ind has a power-law bound d−(1+ 1α).Proof. First, we investigate the base case for d = 1. In the Eq.(3.44), thesecond term becomes zero for d = 1. Then, the expected number of verticeswith positive-in-degree 1(one) in Gkt+1 can be approximated asE[X+in1 (t+ 1)] ≈ IM + (1− αtakbd)E[X+in1 (t)] +d∑l=2O(t−l). (3.50)423.5. Model B: Edge Copying ModelBy using E[X+in1 (t)]= k at t = 0, we get from solving the Eq.(3.50)E[X+in1 (t)] ≈ IM t+O(1). (3.51)Now, we investigate the case 2 ≤ d ≤ k. We get from the Eq.(3.44), theexpected number of vertices with positive-in-degree d in Gkt+1 can be ap-proximated asE[X+ind (t+ 1)] ≈ IM + (d− 1)αtakbd−1E[X+ind−1(t)]+(1− dαtakbd)E[X+ind (t)] +d∑l=2O(t−l), (3.52)where 2 ≤ d ≤ k.Finally, for d > k the expected number of vertices with positive-in-degree din Gkt+1 is (using Eq.(3.48))E[X+ind (t+ 1)]=(d− 1)αtakbd−1E[X+ind−1(t)]+(1− dαtakbd)E[X+ind (t)] +k∑l=2O(t−l). (3.53)Assume, β+ind = 0 for d ≤ 0. Let, E[X+ind]can be approximate by tβ+ind .Then, the Eq.(3.50) satisfies(t+ 1)β+in1 ≈ IM +(1− αtakbd)tβ+in1 +d∑l=2O(t−l),(1 +dαbdak)β+in1 ≈ IM +d∑l=2O(t−l).Since, as t → ∞, then ak → 1, bd → 1, and∑dl=2O(t−l) → 0, and also IMis a constant. Therefore, from the above equation, we getβ+in1 ≈ IM ; as t→∞. (3.54)Also, from the Eq.(3.52), we get for 2 ≤ d ≤ k(t+ 1)β+ind ≈ IM +(d− 1)αtakbd−1tβ+ind−1+(1− dαtakbd)tβ+ind +d∑l=2O(t−l).433.5. Model B: Edge Copying ModelBy rearranging the above equation and using the facts that, as t→∞, thenak, bd−1, bd all approach to 1,∑dl=2O(t−l) → 0, and IM is a constant, wegetβ+ind ≈ O(1) +d− 1d+ 1αβ+ind−1; 2 ≤ d ≤ k (3.55)as t→∞.Also, from the Eq.(3.53), we get for d > k(t+ 1)β+ind =(d− 1)αtakbd−1tβ+ind−1 +(1− dαtakbd)tβ+ind +k∑l=2O(t−l) (3.56)Since, as t→∞, then ak, bd−1, bd all approach to 1,∑dl=2O(t−l)→ 0, thenby rearranging above equation we getβ+ind =d− 1d+ 1αβ+ind−1; d > k, (3.57)as t→∞.Let, ∆+ind (t) = E[X+ind (t)] − tβ+ind . To show that, E[X+ind (t)] can beapproximated by tβ+ind , we have to prove by induction that,∣∣∆+ind (t)∣∣ isbounded by a constant.For d = 1, we get from the Eq.s(3.51) and (3.54)∆+in1 (t) + tβ+in1 = IM t+O(1),∆+in1 (t) =αIM t1 + α+O(1), (3.58)Therefore, for d = 1 we get|∆+in1 (t)| ≤ O(1). (3.59)Consider∣∣∆+ind (t)∣∣ is also bounded by a constant. Thus, we can write∣∣∆+ind (t)∣∣ = ∣∣E[X+ind (t)]− tβ+ind ∣∣ ≤ O(1). (3.60)Now, from the Eq.(3.52), we get∆+ind (t+ 1) + (t+ 1)β+ind ≈ IM +(d− 1)αtakbd−1(∆+ind−1(t) + tβ+ind−1)+(1− dαtakbd)(∆+ind (t) + tβ+ind)+d∑l=2O(t−l).443.5. Model B: Edge Copying ModelBy rearranging the above equation, we get∆+ind (t+ 1) = IM +(d− 1)αbd−1tak∆+ind−1(t) +tak − dαbdtak∆+ind (t)+(d− 1)αbd−1akβ+ind−1 −dαbd + akakβ+ind +d∑l=2O(t−l). (3.61)Now, by using the definition of β+ind from the Eq.(3.55), we can write(d− 1)αbd−1akβ+ind−1 −dαbd + akakβ+ind=(d− 1)αbd−1akdα+ 1(d− 1)αβ+ind −dαbd + akakβ+ind +O(1)=1ak((dα+ 1)bd−1 − (dαbd + ak))β+ind +O(1)=(bd−1ak+dα(bd−1 − bd)ak− 1)β+ind +O(1)= Aβ+ind +O(1), (3.62)where A =(bd−1ak+dα(bd−1−bd)ak− 1).Since, bd−1 − bd < 0 and also bd−1 → 1, ak → 1 for t → ∞. Hence,A→ 0 as t→∞. Therefore, from the Eq.s(3.61) and (3.62), we get∣∣∆+ind (t+ 1)∣∣ ≤ IM + (d− 1)αbd−1tak ∣∣∆+ind−1(t)∣∣+ tak − dαbdtak ∣∣∆+ind (t)∣∣+Aβ+ind +O(1)≤ IM +((d− 1)αbd−1tak+tak − dαbdtak)max(∣∣∆+ind−1(t)∣∣, ∣∣∆+ind (t)∣∣)+Aβ+ind +O(1)= IM +(1− αbd−1tak+dα(bd−1 − bd)tak)max(∣∣∆+ind−1(t)∣∣, ∣∣∆+ind (t)∣∣)+Aβ+ind +O(1)= IM +B max(∣∣∆+ind−1(t)∣∣, ∣∣∆+ind (t)∣∣)+Aβ+ind +O(1),(3.63)where B =(1− αbd−1tak +dα(bd−1−bd)tak).453.5. Model B: Edge Copying ModelSince, bd−1 − bd < 0 and also bd−1 → 1, ak → 1 for t → ∞. Hence,B → 1 as t→∞. Hence, from the Eq.s(3.59), (3.60), and (3.63), we get∣∣∆+ind (t+ 1)∣∣ = ∣∣E[X+ind (+1)]− (t+ 1)β+ind ∣∣ ≤ O(1). (3.64)Therefore, by using the induction hypothesis, we can say that, E[X+ind (t)−tβ+ind]is bounded by a constant.Now, to find the the power-low bound for the positive-in-degree distribu-tion in GKt generated by edge copying model, we have to solve the followingrecurrence equationβ+ind =d− 1(d+ 1α)β+ind−1; for d > k, (3.65)with the initial conditionsβ+in1 ≈ IM ; for d = 1, (3.66)β+ind ≈ O(1) +d− 1(d+ 1α)β+ind−1; for 2 ≤ d ≤ k, (3.67)when t→∞.From the Eq.(3.42), we know that, IM , which is independent of t, is constantfor a random process. Therefore, in Eq.(3.66), β+in1 is also a constants.Again, from the Eq.(3.67), the first term is constant for 2 ≤ d ≤ k.Therefore, for 1 ≤ d ≤ k, we can write (using Eq.(3.66))β+ink = K (3.68)where K is a constant.Therefore, from the Eq.(3.65), we getβ+ind = Kd∏i=ki− 1i+ 1α= KΓ(k + 1α)Γ(k − 1)Γ(d)Γ(d+ 1α + 1) (3.69)By using Stirling’s approximation in the above equation, we can write β+ind ≈d−(1+ 1α).463.6. Model C: Clique Copying ModelNow, we analyze the dynamics of the positive-out-degree distribution inGkt . The model parameter 1 − α is for probability of the copying targetvertex.Let {β+outd } be a sequence of positive integers. Then, by following theabove procedure for positive-out-degrees, we can prove the following theoremfor the positive-out-degree distribution in Gkt .Theorem 3.9. Let E[X+outd (t)] be the expected number of vertices withpositive-out-degree d in Gkt generated by Model B. If 1− α is the probabilityof coping target vertex from a randomly selected edge, then∣∣E[X+outd (t)]− tβ+outd ∣∣ ≤ O(1),where β+outd has a power-law bound d−(1+ 11−α).In similar manner, let {β−ind } and {β−outd } be two sequences of positiveintegers. Then we can also prove the following theorems for the negative-in-degree and the negative-out-degree distributions respectively.Theorem 3.10. Let E[X−ind (t)] be the expected number of vertices withnegative-in-degree d in Gkt generated by Model B. If α is the probability ofcoping source vertex from a randomly selected edge, then∣∣E[X−ind (t)]− tβ−ind ∣∣ ≤ O(1),where β−ind has a power-law bound d−(1+ 1α).Theorem 3.11. Let E[X−outd (t)] be the expected number of vertices withnegative-out-degree d in Gkt generated by Model B. If 1−α is the probabilityof coping target vertex from a randomly selected edge, then∣∣E[X−outd (t)]− tβ−outd ∣∣ ≤ O(1),where β−outd has a power-law bound d−(1+ 11−α).3.6 Model C: Clique Copying Model3.6.1 Model DefinitionIn this model, we try to generalize our edge copying model for signeddirected networks. According to the edge copying model, at each time, anew vertex enters to the network and copy an existing edge u.a.r to connect473.6. Model C: Clique Copying Modelwith a vertex (either source or target) from the selected edge. Alternatively,we can consider an edge as a k-clique where k = 2 and a vertex is a (k− 1)-clique. Therefore, in other words, we can express the general form of theedge copying model as follows.Initially, we start with an arbitrarily signed directed clique Gk0 of size∣∣Vt∣∣ = k+ 1. At the time t+ 1, the new vertex vt+1 enters to the network toconstruct Gkt+1. The vertex vt+1 connects with k − 1 existing vertices andcreates a new k-clique in the following ways:(1) Select a k-clique uniformly at random form Gkt .(2) Select a vertex v from the selected k-clique uniformly at random. Thisprocess gives a (k − 1)-clique in which the vertex v does not belong.(3) Connect vt+1 with the vertices in (k − 1)-clique by copying signed-directed-edges between the vertex v and the (k − 1)-clique vertices.3.6.2 Structural BalancedThe signed directed network generated by the clique copying modelshows following structural property.Theorem 3.12. If the initial network Gk0 is structurally balanced, then thesigned directed network Gkt = (Vt, Et) generated by clique copying model isalso structurally balanced.Proof. Let, at any time t − 1, the network Gkt−1 = (Vt−1, Et−1) is struc-turally balanced. Therefore, according to the balanced theory, we can find apartition in Vt−1 such that the end vertices of a positive edge belong to thesame group, and the end vertices of a negative edge belong to two differentgroups.According to the clique model, at time t, a new vertex vt enters to thenetwork Gkt−1 and connects with all vertices in a (k − 1)-clique by copy-ing their one of the common vertices v. That is, the signed-directed-edgesbetween v and the (k− 1)-clique vertices are copied by the signed-directed-edges between vt and the (k − 1)-clique vertices. Let, V (Ck−1) is the set ofvertices in the selected (k − 1)-clique.First, assume the existing network Gkt−1 is structurally balanced. Let,k = 2, i.e., k − 1 = 1. Therefore, there exist only one vertex, let vi, in theset V (Ck−1). Then, if the edge between v and vi ∈ V (Ck−1) is positive thenthe new edge between vt and vi is also positive. In that case, vt join the vi’s483.6. Model C: Clique Copying Modelbalanced partition in Gkt . Again, if the edge between v and vi is negativethen the new edge between vt and vi is also negative. In that case, vt cre-ates a new vertex partition in Gkt . In both cases, Gkt preserves its structuralbalance.Let, k > 2. Therefore, there exist more than one vertex in the setV (Ck−1). Since Gkt−1 is structurally balanced, then any two vertices vi, vj ∈V (Ck−1) and their common neighbor vertex v are in the same partition ifthe edges among vi, vj and v are positive. If vt connects with vi and vjby copying two positive edges (v, vi) and (v, vj), then vt is also in the samepartition with vi and vj in Gkt . This addition of the new vertex preservesthe balanced state of Gkt .Again, since Gkt−1 is structurally balanced, if vi, vj and v are in twopartitions there exist exactly one positive edge among these vertices. Thenvt may copies one positive and one negative edges or both negative edges.If vt copies one positive and one negative edges, then the edge between viand vj must be a negative edge, i.e. vi and vj are in different partitions.Therefore, vt enters either vi or vj ’s partition in Gkt based on the new positiveedge. In this case, Gkt is also structurally balanced.Again, if vi, vj and v are in three different partitions there exist nopositive edge among these vertices. Then, vt copies two negative edges toconnect with vi and vj , which are already in different partitions. Therefore,vt creates a new partition in Vt. This case also preserve the balanced stateof Gkt .Next assume Gkt−1 is not balanced. Therefore, there exist at least threevertices v, vi and vj such that they are connected by exactly two positiveedges and one negative edge. Now, let the vertex vt connect with vi andvj by copying the edges (v, vi) and (v, vj). If vt copies both positive edgesthen the edge between vi and vj must be negative, which leads Gkt is notstructurally balanced.Again, if vt copies one positive and one negative edge, then edges betweenvi and vj is positive, i.e. vi and vj are in same partition. Now, vt has apositive edge and a negative edge with two vertices from the same partition,which leads Gkt is not structurally balanced.Therefore, if Gkt−1 is balanced, then Gkt is also balanced. By using backinduction, we conclude that, if the initial networkGk0 is balanced, then at anytime the network generated by the clique copying model is also structurallybalanced.493.7. Simulation and Results DiscussionModel A: Preferential Attachment Model|V |, |E| Param.’s Dist. type n γ p-value10000, k = 2 pos-out-deg 9999 2.860 0.53379968 pos-in-deg 9999 2.830 0.480neg-out-deg 9999 2.730 0.019neg-in-deg 9999 2.790 0.72410000, k = 3 pos-out-deg 9999 2.930 0.510119928 pos-in-deg 9999 2.840 0.688neg-out-deg 9999 2.840 0.449neg-in-deg 9999 2.800 0.33710000, k = 4 pos-out-deg 9999 2.800 0.840159872 pos-in-deg 9999 2.870 0.249neg-out-deg 9999 2.870 0.353neg-in-deg 9999 2.800 0.023Table 3.3: Power-law exponents γ and the corresponding p-values for differ-ent signed-directed-degree distributions in the synthetic networks generatedby preferential attachment model.3.7 Simulation and Results DiscussionIn the preferential attachment model (Model A), at each time, k numberof positive and negative directed-edges are added in the both directions (inand out) with respect to the new vertex. So according to the Theorem 3.2-Theorem 3.5, all of the signed-directed-degrees follow the same power-lawdistribution with a exponent in the range γ ≈ 3. In Table 3.3, the valuesof the exponent γ for the power-law model fitting for the signed-directed-degree distributions in the random networks generated by the preferentialattachment model is u 2.8 which supports the theoretical argument.Again, compare to our empirical study, the preferential attachment modelonly captures the observing property that the signed-directed-degree distri-butions follow the power-law with exponents in the range 2.0 ≤ γ ≤ 3.5.But this model fails to capture the another observing property of havingthe inverse relationship between the number of vertices with in-degree andout-degree (for both positive and negative). This is because, in this model,the new vertex enters to the existing network with equal numbers of all thefour types of signed-directed-degrees and there is no parameter to controlthe direction or sign of the newly added edges.503.7. Simulation and Results DiscussionModel B: Edge copying model|V |, |E| Param.’s Dist. type n γ p-value100000, k = 2, pos-out-deg 42005 2.240 0.439299996 α = 0.25 pos-in-deg 87386 3.500 0.000neg-out-deg 23300 2.310 0.146neg-in-deg 57930 3.500 0.000100000, k = 2, pos-out-deg 40771 2.850 0.165299996 α = 0.50 pos-in-deg 40797 2.770 0.119neg-out-deg 71259 2.760 0.003neg-in-deg 71379 2.780 0.012100000, k = 2, pos-out-deg 89861 3.500 0.000299996 α = 0.75 pos-in-deg 44679 2.270 0.323neg-out-deg 51817 3.500 0.000neg-in-deg 20179 2.220 0.769100000, k = 3, pos-out-deg 49327 2.260 0.246399995 α = 0.25 pos-in-deg 92175 3.500 0.000neg-out-deg 32638 2.270 0.541neg-in-deg 73278 3.500 0.000100000, k = 3, pos-out-deg 54980 2.850 0.479399995 α = 0.50 pos-in-deg 54617 2.770 0.086neg-out-deg 78642 2.890 0.251neg-in-deg 78460 2.940 0.495100000, k = 3, pos-out-deg 94961 3.500 0.000399995 α = 0.75 pos-in-deg 53777 2.250 0.654neg-out-deg 63809 3.500 0.000neg-in-deg 26806 2.270 0.870Table 3.4: Power-law exponents γ and the corresponding p-values for differ-ent signed-directed-degree distributions in the network instances generatedby edge copying model.513.7. Simulation and Results DiscussionModel C: Clique copying model|V |, |E| Param.’s Dist. type n γ p-value100000, k = 2 pos-out-deg 16705 2.510 0.012100000 pos-in-deg 16780 2.890 0.141neg-out-deg 8430 2.560 0.007neg-in-deg 8231 2.650 0.070100000, k = 3 pos-out-deg 24138 2.300 0.003199998 pos-in-deg 24611 2.420 0.146neg-out-deg 16908 2.360 0.954neg-in-deg 16902 2.370 0.200100000, k = 4 pos-out-deg 28719 2.330 0.365299995 pos-in-deg 31431 2.210 0.851neg-out-deg 19835 2.280 0.919neg-in-deg 19724 2.170 0.262Table 3.5: Power-law exponents γ and the corresponding p-values for differ-ent signed-directed-degree distributions in the network instances generatedby clique copying model.From the results given in the Table 3.4 for the edge copying model, wecan observe that the power-law model fitting for the signed-directed-degreedistributions in the random networks generated by this model are mostlystatistically significant (p-value ≥ 0.01) with components in the range 2.0 ≤γ ≤ 3.5. Therefore, this model captures the real-world signed directedsocial networks property of having signed-directed-degree distributions witha component in the range 2.0 ≤ γ ≤ 3.5.Again, in the edge copying model (Model B), the signed-directed-degreedistributions depend on the parameter α which is the probability of copyingthe target vertex from the randomly selected edge. That is, when α → 1,more vertices receive signed-out-degrees (both positive and negative) com-pare to the number of vertices that receive singed-in-degrees (both positiveand negative). In Table 3.4, the values in the column n support this ar-gument. Therefore, the edge copying model captures the inverse propertybetween the number of in and out degree vertices (positive and negative) ofthe real-world signed directed social networks.The results are given in Table 3.5, show that power law can also char-acterize the signed-directed degree distributions in random networks gener-523.7. Simulation and Results Discussionated by clique copying model with an exponent in the range 2.0 ≤ γ ≤ 3.5.Since there is no parameter for controlling the in and out direction of newlyadded edges, this model also does not capture the inverse property betweenthe number of in and out degree vertices (positive and negative) of the real-world signed directed social networks.The summary of the capturing our observed attributes (from Table 3.2.2)by the proposed random models for signed directed networks is given in thefollowing Table 3.7.Features Model A Model B Model C(+/−)-out-deg∗ d−3 d−(1+ 11−α ) No(+/−)-in-deg∗ d−3 d−(1+ 1α ) No2 ≤ γ ≤ 3.5 Yes Yes YesCaptured Attributes A3 A1, A2, A3 A1, A2, A3Table 3.6: Summary of capturing observed attributes by the proposed ran-dom models.53Chapter 4Heuristic Algorithm forCorrelation ClusteringProblems4.1 Correlation Clustering ProblemOn a given signed weighted network (either undirected or directed), inwhich each edge is labeled by either a positive or a negative sign, the Corre-lation Clustering problem is to find a partition P in the vertex set that isconsistent with the edge-sign labels as much as possible. This problem can,equivalently, be expressed in terms of two different objectives: maximumagreements and minimum disagreements. A positive edge can be regardedas a clustering agreement if the both end-vertices are in the same cluster,whereas, it can be regarded as a clustering disagreement if the end-verticesare in different clusters. On the other hand, a negative edge can be re-garded as a clustering agreement if the both end-vertices are in differentclusters, whereas, it can be regarded as clustering disagreement if the bothend-vertices are in the same cluster. For the case of maximizing agreements,the correlation clustering problem looks at the total weight of positive (+)edges inside clusters, and negative (−) edges between the clusters. On theother hand, for the case of minimizing disagreements, the correlation clus-tering problem looks at the total weight of negative (−) edges inside theclusters and positive (+) edges between the clusters. In this chapter, we de-fine the maximizing agreements and minimizing disagreements correlationclustering problems as Max-Agree-CC and Min-Agree-CC respectively.Let G = (V,E, s) be a signed network with n vertices, where every edgee = (i, j) in E has a non-negative weight wij . We also define the weightof an edge e = (i, j) equivalently as we = wij . Assume every edge e ∈ Eis labeled by a sign function s : E → {+,−}. An edge (i, j) labeled withpositive-sign (+) suggests that the vertices i and j are similar and shouldbelong to the same cluster, whereas an edge (i, j) labeled with negative-sign544.1. Correlation Clustering Problem(−) suggests that the vertices i and j are different and should be in differentclusters. Let E+ and E− denote the set of all positive and negative edges inG respectively. Therefore, we can write E = E+ ∪ E− and E+ ∩ E− = ∅.Let P = {P1, P2, ..., Pk} be a partition of V . Let C(i) be the set ofvertices in the same cluster as i. Then we can define the total weight ofthe positive and negative edges inside the clusters due to the partition Prespectively asW+IC(P) =∑{wij : (i, j) ∈ E+, i ∈ C(j)},W−IC(P) =∑{wij : (i, j) ∈ E+, i ∈ C(j)}.Similarly, the total weight of the positive and negative edges between theclusters due to the partition P respectively asW+BC(P) =∑{wij : (i, j) ∈ E+, i /∈ C(j)},W−BC(P) =∑{wij : (i, j) ∈ E+, i /∈ C(j)}.Therefore, the total weight of the positive edges insider the clusters andnegative edges between the clusters due to the partition P isfw(P) = W+IC(P) +W−BC(P) (4.1)Similarly, the total weight of the positive edges between clusters and negativeedges inside clusters due to the partition P isgw(P) = W+BC(P) +W−IC(P) (4.2)Based on the definition of Bansal et al. [BBC04], we can formulate theMax-Agree-CC and Min-Disagree-CC problems as follows:Problem 4.1 Max-Agree-CC Problem.Instance: A weighted signed graph G = (V,E, s), where |V | = n ands : E → {+,−}.Task: Find a partition P∗ of vertices such thatfw(P∗) = maxP fw(P).554.2. Literature ReviewProblem 4.2 Min-Disagree-CC Problem.Instance: A weighted signed graph G = (V,E, s), where |V | = n ands : E → {+,−}.Task: Find a partition P∗ vertices such thatgw(P∗) = minP gw(P).In this chapter, we focus on Problem 4.2 , i.e. minimizing disagreementsdue to the partition. In the following sections of this chapter we referMin-Disagree-CC Problem equivalently as Correlation ClusteringProblem or Clustering Problem.4.2 Literature ReviewThe term Correlation Clustering was first used by Doreian andMrvar [DM96] as a criteria for analyzing the structural balance in socialnetworks. In 2003, Charikar et al. [CGW03] investigate the correlationclustering editing problem on both complete and general graphs. They alsoproved that this editing problem is APX-hard on complete graphs. Bansal etal. [BBC04], in 2004, formalized the Correlation Clustering problemas an optimization problem and showed that this is a special case of Clus-tering Editing problem defined on signed network. They also showedthat this problem is a NP-hard and can be formulated in two differentways: maximum agreements (Max-Agree) and minimum disagreements(Min-Disagree). Since then, two distinct traits can be seen to solve thisproblem. Bansal et al. [BBC04] first presented a polynomial time approxi-mation scheme (PTAS) for the Max-Agree problem when the edge weightsof the signed networks are ±1. In 2006, Giotis et al. [GG06] proposed an-other PTAS for the Max-Agree problem to signed network with ±1 edgeweights to find a partition P in which the maximum number of clustersin P is fixed, say k. Coleman et al. [CSW08] presented an efficient local-search approximation for this problem when k = 2. A 0.766-approximationalgorithm for the Max-Agree problem to signed network with arbitraryedge weights was proposed by Charikar et al. [CGW05]. In 2015, Ahn etal. [ACG+15] introduced a Max-Agree of the Correlation Cluster-ing problem in the dynamic data stream model and presented a polynomialtime O(n.polylog n)-space approximation algorithm.On the other hand, Charikar et al. [CGW05] first proposed an approxi-mate algorithm to solve the Min-Disagree correlation clustering problems564.3. Heuristic Algorithm for Correlation Clustering Problemsin 2005. In 2006, Demaine et al. [DEFI06] studied this problem on gen-eral weighted graphs and presented an O(log n)-approximation algorithmbased on linear programming rounding and region growing technique. Anagent-based heuristic algorithm of the correlation clustering problems wasproposed by Yang et al. [YCL07], in which no prior knowledge on hiddencommunity structure is needed. A 3-approximation and implementable inthe computational model such as MapReduce was introduced by Chierichettiet al. [CDK14].The Correlation Clusterings problem is important in network sci-ence as well as other scientific areas [MMP12]. In social networks, thisproblem becomes a natural way to identify communities [CBGV+12] andpredicting missing edge sign in the link classification problem [CSX12]. Forexample, Figueiredo and Moura [FM13] used this problem to evaluate bal-anced partition in signed directed social networks by ignoring the edges di-rections. The Correlation Clustering problem has an significant use inthe area of machine learning and data mining [CDK14, GMT07, ACG+15],portfolio analysis in risk management [FF14, HLW02], biological systemnetworks [HBN07, DESZ07] etc..4.3 Heuristic Algorithm for CorrelationClustering Problems4.3.1 Integer Linear Programming FormulationIn this section, we restate the integer linear programming formulation ofcorrelation clustering problem on general weighted signed graph proposed byDemaine et al. [DEFI06]. We also used Gro¨tschel and Wakabayashi [GW89]integer linear programming formulation of clustering editing problem forsimplifying the constraints, which later studied by Charikar et al. [CGW03],and Bo¨cker et al. [BBK11].Consider a set of(n2)binary decision variables X = (xij ; 1 ≤ i < j ≤ n)to represent each pair of vertices in G. Then, for a given clustering partitionP, set xij = 0 if i and j are in a same cluster, and xij = 1 otherwise.Here, the solution matrix X for a given partition P can be represented asan underlying induced undirected and unsigned graph GX with the same setof vertices as G. We can define this underlying graph GX by the followingdefinition.Definition 4.1 (X-Induced Graph). A graph GX = (V,EX) is said to beX-Induced for a given matrix X if and only if (i, j) ∈ EX , i, j ∈ V then574.3. Heuristic Algorithm for Correlation Clustering Problemsxij = 0.Therefore, we can draw the relation among the signed graph G, a givensolution matrix X, and the underlying X-Induced graph GX in such waythat,if xij = 0, =⇒ (i, j) ∈ EX ,=⇒ i and j are in the same cluster in GAlternatively, we can express this relation as for a given partition P in asigned graph G if vertices i and j are in the same cluster then (i, j) ∈ GX .Now by definition, we know that if 1− xij = 1 then vertices i and j arein the same cluster, and 1 − xij = 0 then they are in the different clusters.Thus, we can express gw(P) given Eq.(4.2) as follows:gw(P) =∑(i,j)∈E+wijxij +∑(i,j)∈E−wij(1− xij), (4.3)Described in Demaine et al. [DEFI06], the integer linear programmingformulation for the Correlation Clustering Problems, given in theProblem 4.2 which minimizes the objective function given in the Eq.(4.3),can be defined as follows:min∑(i,j)∈E+wijxij +∑(i,j)∈E−wij(1− xij); ∀i, j ∈ V, (4.4)subject to: xij + xjk ≥ xik; ∀i, j, k ∈ V, (4.5)xij = xji; ∀i, j ∈ V, (4.6)xij ∈ {0, 1}; ∀i, j ∈ V. (4.7)The inequality constraint, in Eq.(4.5), enforces the condition that any dis-tinct vertices i, j, k ∈ V such that, if i and j are in a same cluster then kis also in this cluster. This is also called triangle inequality constraint. Theequality constraint, in Eq.(4.6), is to represent the undirected edge con-straint.Therefore, our goal is to solve the integer linear programming problemsgiven in Eq.s(4.4)-(4.7) to find the solution matrix X which leads us to avertex partition P. The underlying X-induced graph GX induced by thissolution matrix X will be a collection of disjoint maximal clique, in whichthe vertices set corresponding to each maximal clique represents a cluster inP.584.3. Heuristic Algorithm for Correlation Clustering Problems4.3.2 Relaxed-ILPIn this step, we relax the integer constraint in the Eq.(4.7). Then we getthe following linear programming problem:min∑(i,j)∈E+wijxij +∑(i,j)∈E−wij(1− xij); ∀i, j ∈ V, (4.8)subject to: xij + xjk ≥ xik; ∀i, j, k ∈ V, (4.9)xij = xji; ∀i, j ∈ V, (4.10)xij ∈ [0, 1]; ∀i, j ∈ V. (4.11)Based on Gro¨tschel and Wakabayashi [GW89], the linear programming for-mulation for the Correlation Clustering Problems, the above relaxedlinear programming problem, given in Eq.s(4.8)-(4.11), equivalently can bewritten as follows :min∑(i,j)||(j,i)∈E+wijxij +∑(i,j)||(j,i)∈E−wij(1− xij); ∀ 1 ≤ i < j ≤ n,(4.12)subject to: xij + xjk ≥ xik; ∀ 1 ≤ i < j < k ≤ n, (4.13)xij + xik ≥ xjk; ∀ 1 ≤ i < j < k ≤ n, (4.14)xjk + xik ≥ xij ; ∀ 1 ≤ i < j < k ≤ n, (4.15)0 ≤ xij ≤ 1; ∀ 1 ≤ i < j ≤ n. (4.16)This relaxed problem, given in Eq.s(4.12)-(4.16), can be solved by using anystandard linear programming algorithm by the time polynomial of the inputsize.Let XR = (xij ; 1 ≤ i < j ≤ n) be the solution of the above relaxedproblem. Here, we may consider XR as a distance matrix in which x :V × V → [0, 1] is the distance function with the following properties:0 ≤ xij ≤ 1; ∀ i, j ∈ V, (4.17)xij = xji; ∀ i, j ∈ V, (4.18)xij + xjk ≥ xik; ∀ i, j, k ∈ V. (4.19)Therefore, after solving the linear programming problem, given in Eq.s(4.12)-(4.16), we get a complete weighted graph GXR induced by the solution(distance) matrix XR in which all entries (distances) satisfies the above con-ditions Eq.s(4.17)-(4.19) and lies between [0, 1].594.3. Heuristic Algorithm for Correlation Clustering ProblemsAt this point, our goal is to calculate a distance matrix X∗ = (x∗ij ; 1 ≤i < j ≤ n), which is closest to the solution distance matrix XR and is afeasible solution of the integer program problem given in Eq.s(4.8)-(4.11).Thus the distance matrixX∗ satisfy the constraints given in Eq.s(4.9)-(4.11),the underlying X-Induced graph GX∗(V,EX∗) induced by X∗ satisfy , ifx∗ij = 0, (i, j) ∈ V then (i, j) ∈ EX∗ and otherwise, and then all of theconnected components in GX∗ can be interpreted as approximate cluster inthe signed graph G.4.3.3 Ultrametric Distance MatrixIn this step, we calculate the ultrametric distance matrix UX for thegiven solution distance matrix XR which is the solution of the relaxed linearprogram of the Correlation Clustering Problem. An ultrametric onthe set V is defined as follows.Definition 4.2 (Ultrametric). A distance function u : V ×V → IR+0 is saidto be ultrametric ifmax{uij , ujk} ≥ uik; ∀ i, j, k ∈ V, (4.20)where, uij is the distance between i and j for all i, j ∈ V .Ultrametric Definition as Linear Inequality: Consider the above dis-tance function as u : V × V → {0, 1}. Then the ultrametric condition givenin Eq.(4.20) can be written as:uij + ujk ≥ uik (4.21)which is equivalence to the triangle inequality constraint, given in Eq.(4.5),in theCorrelation Clustering Problem given in Eq.s(4.4)-(4.7). Basedon the Definition 4.2 and Eq.(4.21), we define the following definition.Definition 4.3 (0-1 Ultrametric Distance Matrix). A distance matrix Uis said to be 0-1 Ultrametric Distance Matrix if each of the elements inU satisfies the linear inequality conditions given in Eq.(4.21), where u :V × V → {0, 1}.From the above Definition 4.3 and Eq.(4.21), we can say that any fea-sible solution matrix of the integer linear programming formulation for theCorrelation Clustering Problem problem given in Eq.s(4.4)-(4.7) isalso a 0-1 Ultrametric Distance Matrix.604.3. Heuristic Algorithm for Correlation Clustering ProblemsTherefore, here, our goal to find the closest 0-1 Ultrametric DistanceMatrix UX for a given solution distance matrix XR in which all entries sat-isfy the distance function x : V × V → [0, 1]. Here, XR is the solution ofthe relaxed linear programming problem given in Eq.s(4.12)-(4.16). Thisproblem can be formulated as follows:Problem 4.3. 0-1 Ultrametric Distance Matrix.Instance: A distance matrix XR with x : V × V → [0, 1].Task: Find a 0-1 Ultrametric Distance Matrix UX , where u : V × V →{0, 1}, in minimum distance (cost).In the above problem, finding UX with minimum cost is hard. We cansolve this issue by using two following steps: approximation and rounding.In the first step, we solve the Problem 4.3 to approximate the closest ul-trametric distance matrix by relaxing the integer constraint. The relaxedversion of the Problem 4.3 can be described as follows:Problem 4.4. Closest Ultrametric.Instance: A distance matrix XR with x : V × V → [0, 1].Task: Find the closest ultrametric distance matrix UR with u : V × V →[0, 1].After finding the ultrametric distance matrix UR (relaxed) by solving Prob-lem 4.4 , we can use a rounding method by using a given threshold k todetermine the 0-1 ultrametric distance matrix UX . The rounding problemcan be formulated as follows:Problem 4.5. Rounding.Instance: A distance matrix UR = (uij) with u : V × V → [0, 1] and agiven threshold k.Task: Find a distance matrix UX = (u∗ij), such that u∗ : V × V → [0, 1] byusing a rounding process.4.3.4 Closest UltrametricIn this section, we focus on solving the Problem 4.4 , which is a closestultrametric problem. The complexity and algorithm for finding the closestultrametric from V × V , where V is set of vertices of a complete weightedgraph G′ = (V,E), depends on the type of distortion we are looking for.The Problem 4.4 , which is finding and ultrametric u which is closest to x614.3. Heuristic Algorithm for Correlation Clustering Problemson V can be formulated under lp-distortion as follows:Problem 4.6. Closest Ultrametric (lp-Distortion).Instance: A distance matrix XR with x : V × V → [0, 1].Task: Find a ultrametric distance matrix UX with u : V × V → [0, 1], suchthatminu∈Umaxi,j∈V( ∑i,j∈V|uij − xij |p)1/p,where u : V × V → IR+0 and U is the set of all ultrametrics on V .Krˇiva´nek and Mora´vek[KM86] proved that this problem is NP-hard forp = 1, i.e. for the case of additive distortion. Later, Harb et al.[HKM05]proved it is APX-hard for any fixed p ≥ 1.Again, the Problem 4.4 of finding the closest ultrametric u on V canbe formulated under l∞-distortion as follows:Problem 4.7. Closest Ultrametric (l∞-Distortion).Instance: A distance matrix XR with x : V × V → [0, 1].Task: Find a ultrametric distance matrix UX with u : V × V → [0, 1], suchthatminu∈Umaxi,j∈V|uij − xij |,where u : V × V → IR+0 and U is the set of all ultrametrics on V .Krˇiva´nek [Krˇi88] showed that the complexity of the algorithm to solve Prob-lem 4.7 , i.e. to find closest ultrametric u on V from x under l∞-distortion isO(n3). In Krˇiva´nek’s algorithm, the ultrametric distance between verticesi, j ∈ V are adjusted by the ‘bottleneck ’ in the minimum spanning tree T onG′. This bottleneck in T can be defined asmaxe∈T (i,j)xe, (4.22)where T (i, j) is the path between the vertices i and j in T . It can be notedthat, the graph G′ may have more than one minimum spanning tree, butthe value in Eq.(4.22) is independent of the selection of T . Krˇiva´nek[Krˇi88]proved the following theorem:Theorem 4.4 ([Krˇi88]). If T be a minimum spanning tree on a complete624.3. Heuristic Algorithm for Correlation Clustering Problemsweighted graph G, then2 minu∈Umaxi,j∈V|xij − uij | = maxi,j∈V{xij − maxe∈T (i,j)xe} (4.23)where T (i, j) is the edge set of the path between i to j in T , xe is the edgeweight (distance) of an edge e ∈ T (i, j), and U is the set of all ultrametricson V .Krˇiva´nek[Krˇi88] also proved that, for a given weighted completed graphG = (V,E) and a minimum spanning tree T = (V,ET ) on G, an ultrametricu∗ : V × V → IR+0 on V such thatu∗ij =12maxe∈T (i,j){xe + x′e}; for all i, j ∈ V, (4.24)satisfies Eq.(4.23), where x′e is the adjustment valuation can be defined byx′ij = maxe∈T (i,j)xe; for each (i, j) /∈ ET and, (4.25)x′e = max(i,j)/∈ET{x′i,j , xe}; for each e ∈ ET . (4.26)In this point, by using Eq.s(4.24)-(4.26), our goal is to find the closest ul-trametric distance matrix UX on V in which u∗ : V × V → IR+0 from thegiven distance matrix XR obtained from the solution of the relaxed linearprogram given in Eq.s(4.12)-(4.16).4.3.5 RoundingIn this step, our focus to solve the Problem 4.5 to get an 0-1 ultrametricdistance matrix UX = (u∗ij); ∀i, j ∈ V , from the calculated relaxed ultra-metric distance matrix UR with the distance function u : V × V → [0, 1],and a given threshold k. We use a simple rounding process such that, takeu∗ij = 0 if uij ≤ k for each i, j ∈ V , otherwise u∗ij = 1.4.3.6 The Algorithm and Implementation: SummaryThe algorithmic steps and implementations procedures of the above al-gorithm for solving the Correlation Clustering Problems are givenin the follows:Step 1: Solve the relaxed linear program problem, given in Eq.s(4.12)-(4.16). LetGXR = (V,EXR) be the underlyingX-induced complete weighted-graph by the solution (distance) matrix XR. Each real number xij ∈ XRrepresents the weight (distance) corresponds to the edge (i, j) ∈ EXR .634.4. Experimental ResultsStep 2: Find a minimum spanning-tree TXR = (V,ET ) of GXR . We useKruskal’s[Kru56] algorithm for minimum spanning-tree.Step 3: Compute adjust valuation x′ij for all i, j ∈ V from Eq.s(4.24)-(4.26). This can be implemented as follows:Algorithm 4: u∗ : V × V → [0, 1].Data: Complete, weighted graph G(V,E) and a spanning treeT (V,ET ) on G.Result: Ultrametric distance matrix UX .for each (i, j) /∈ ET doT (i, j)← set of edges in the path between i and j;x′ij ← maxe∈T (i,j) xe;endfor each e ∈ ET dox′e ← max{x′ij ; (i, j) /∈ ET & x′ij = xe}endfor each (i, j) ∈ E doT (i, j)← set of edges in the path between i and j;u∗ij ← 12 maxe∈T (i,j){x′e + xe};endStep 4: Find UX = (u∗ij) from UR = (uij) by using the given threshold kand return the partition P such that the vertices i and j are in same clusterif u∗ij <= k; ∀i, j ∈ V . Otherwise, i and j are in different clusters.Corollary 4.1. (Complexity) The proposed heuristic algorithm runs inpolynomial-time of the input network size.Proof. The complexity of the step 1 for solving relaxed linear program ispolynomial with the input graph size [Meg86]. In step 2 and 3, the com-plexity of finding the closest ultrametric distance matrix from the solutionmatrix XR is O(n3) [Krˇi88]. Finally, the complexity of a straight forwardimplementation of the rounding in step 4 is O(n2). 4.4 Experimental ResultsEvaluation Platform: We implements the proposed algorithm in Javaand use the IBM CPLEX V.12.1 solver for solving the relaxed ILP problem.We also use graph package jGrapht to deal with the graph properties. The644.4. Experimental Resultsrunning times we calculated on a system with Intel Core i5 @ 1.70 GHz, 64bit and 8GB memory.4.4.1 Random G(n, e, p) Signed Networks:We generate random G(n, e, p) network instances, in which n the numberof vertices and e is the probability of connecting two vertices, and if thereis an edge, then p is the probability for that edge is positive. Therefore, theprobability of connecting two vertices with a positive edge is ep and with anegative edge is e(1−p). The experimental results of the proposed heuristicalgorithm are given in Fig.4.1 for different random network instances.According to Corollary 4.1 any striaght forward implementation of theproposed algorithm should run in polynomial time. In Fig.4.1(a), it lookslike the running time graphs for changing networks size in different networkinstances are polynomial except for the case when e = 0.7, p = 0.7. For thiscase the run time graph seems like increasing exponentially. This exceptionmay arise due to some issues in our implementation which we failed toidentify.In the proposed heuristic algorithm, after solving the ILP-relaxed prob-lem, we deal with the complete induced weighted network to determine the0 − 1-ultrametric distance matrix. Also in the ILP-relaxed problem, thenumber of decision variable only depends on the size of the vertex set andindependent from the size of the edge set. Therefore, according to our hy-pothesis, the run time should be independent from the edge density (forboth positive and negative edges). The Fig.4.1(b) support this hypothesisfor the cases when e ≥ 0.4. With the same argument, the runtime shouldbe independent from the ratio of positive or negative edge densities. TheFig.4.1(c) also supports the argument for the cases p ≤ 0.6.Next, we tested the variations of the minimum disagreements due to thepartition with the changing of the given threshold. For do this we havetested the variations in ten random signed G(n, e, p) networks with fixedn = 100, e = 0.5, p = 0.5. The results, in Fig.4.2, shows an inconclusiveargument on the relation between the minimum disagreement due to thepartition and user-given threshold.654.4. Experimental Results(a)(b)(c)Figure 4.1: (a) Runtime (in sec.) for changing n when e and p are fixed. (b)Runtime (in sec.) for changing e when n = 50 and p are fixed. (c) Runtime(in sec.) for changing p when n = 50 and e are fixed.664.4. Experimental ResultsFigure 4.2: Minimum disagreement due the changing of threshold in tenrandom signed network instances G1, ..., G10 with fixed n = 100, e = 0.5, p =0.5s.4.4.2 International Bilateral Trade Growth Rate NetworkThe International Trade Centre (ITC)1 is an auxiliary group of the WorldTrade Organization (WTO)2 and the United Nations Conference on Tradeand Development (UNCTAD)3. It provides trade-related technical assistanceand bilateral trading data between its member countries and economical ter-ritories. We have collected bilateral average trade growth rate between theyears 2011-2015 among 231 countries and territories. The network con-sist 231 vertices (country or territory) and 16,356 edges. Each signed andweighted edge indicates the average of the total trade rate (import and ex-port) between two members. The sign of each edge depends on the positive-negative growth rate. The summary of the trade growth rate network isgiven in the following Table 4.3 .We have solved to find the partition in the country set by using theproposed heuristic algorithm and different thresholds. At threshold = 0.45the algorithm return 189 clusters. Most of these clusters include a singlecounty or economic territory except few. The clusters with more than threecountries are given in Fig.4.3. Again by using threshold = 0.5 the algorithmreturns only 5 (five) clusters, in which all of the countries are in one clusterexcepts the countries: Iran, Kazakhstan, Greenland, Syrian Arab Republic.These four countries are in four separate clusters. From this result, the only1http://www.intracen.org/2https://www.wto.org3http://www.unctad.org674.4. Experimental ResultsNumber of Countries 231Edges 16356Positive edges 7471Negative edges 8885Balance triangles 340495Imbalance triangles 353107Table 4.1: Summary of the International Bilateral Trade Growth Rate Net-work 2011-2015.Figure 4.3: Clusters of countries when threshold = 0.45.information we can predict that due to the UN economic sanction on Iranand recent Syrian civil war the bilateral trading with these two countrieswith rest of the world has been drastically decreased in the period 2011-2015. The algorithm returns a single cluster for the threshold > 0.5 andputs each countries in separate clusters for the cases threshold ≤ 0.4.68Chapter 5ConclusionIn this thesis, we attempted to study the two prominent areas of networkscience: the evolution of the signed directed social network, e.g. Wikipedia’srequest for adminship (Wiki-RfA), etc. and to design a heuristic algorithmfor the Correlation Clustering Problems in the signed networks.Those works are presented in Chapter 3 and Chapter 4 respectively.5.1 Random Models for Signed Directed SocialNetworksIn Chapter 3 , to the best of our knowledge, we have studied (for the firsttime) the signed-directed-degree distributions in the real-world web-basedsigned directed social networks and proposed three random models: pref-erential attachment model, edge copying model, and clique copying model.Our analysis and simulation results suggest that the signed-directed degreedistributions in the networks simulated by the proposed models follow apower law with an exponent in the range 2.0 ≤ γ ≤ 3.5. For the cliquecopying model, we have proved that if the initial network is structurallybalanced, then the signed directed networks generated by this model is alsostructurally balanced.Future Works: We have presented theoretical proof for the power-lawsigned-directed degree distributions in the networks generated by prefer-ential attachment and edge copying models. Despite this theoretical jus-tification, we still need to prove that the number of vertices of degree dconcentrates on its expectation. For the clique copying model, one also re-quires a theoretical analysis for its power-law signed-directed distributions.Also, an empirical experiment is needed to justify for the balance networktheorem in this model.695.2. Heuristic Algorithm for Correlation Clustering Problems5.2 Heuristic Algorithm for CorrelationClustering ProblemsIn Chapter 4 , we have proposed a heuristic algorithm for the Corre-lation Clustering Problem which is a NP-hard problem. Our exper-imental results for random signed G(n, e, p) network instances have shownthat the runtime of this algorithm is independent of the case when e ≥ 0.4or when p ≤ 0.6. The limitation of this algorithm is that it can not give anyconclusive argument for the changing of the minimum disagreements due tothe variation of given threshold.Future Works: To improve the runtime performance of this algorithm wecan apply a data reduction technique to reduce the input graph size. Theprocess given in [BBK11] and [GHK+10] may lead us to this research. Again,after solving the closest ultrametric problem, we use simple rounding basedon the given threshold. We would also like to improve an efficient roundingtechnique to get a better result.70Bibliography[AAELvZ12] Nir Ailon, Noa Avigdor-Elgrabli, Edo Liberty, and Anke vanZuylen. Improved approximation algorithms for bipartite cor-relation clustering. SIAM Journal on Computing, 41(5):1110–1121, 2012. → pages 3[ACG+15] KookJin Ahn, Graham Cormode, Sudipto Guha, Andrew Mc-Gregor, and Anthony Wirth. Correlation clustering in datastreams. In Proceedings of the 32nd International Conferenceon Machine Learning, ICML, pages 6–11, 2015. → pages 56,57[ACL00] William Aiello, Fan Chung, and Linyuan Lu. A random graphmodel for massive graphs. In Proceedings of the thirty-secondannual ACM symposium on Theory of computing, pages 171–180. ACM, 2000. → pages 27[ACL01] William Aiello, Fan Chung, and Linyuan Lu. A randomgraph model for power law graphs. Experimental Mathematics,10(1):53–66, 2001. → pages 27[BA99] Albert-La´szlo´ Baraba´si and Re´ka Albert. Emergence of scalingin random networks. Science, 286(5439):509–512, 1999. →pages 2, 7, 8, 27[BBC04] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlationclustering. Machine Learning, 56(1-3):89–113, 2004. → pages3, 55, 56[BBK11] Sebastian Bo¨cker, Sebastian Briesemeister, and Gunnar W.Klau. Exact algorithms for cluster editing: Evaluation andexperiments. Algorithmica, 60(2):316–334, 2011. → pages 57,7071Bibliography[BE05] Ulrik Brandes and Thomas Erlebach. Network Analysis:Methodological Foundations, volume 3418. Springer Science,2005. → pages 4, 6[Bel58] Richard Bellman. On a routing problem. Quarterly of AppliedMathematics, 16(1):87–90, 1958. → pages 12[BK73] Coen Bron and Joep Kerbosch. Algorithm 457: finding allcliques of an undirected graph. Communications of the ACM,16(9):575–577, 1973. → pages 15[BR03] Be´la Bolloba´s and Oliver M. Riordan. Mathematical results onscale-free random graphs. Handbook of graphs and networks:from the genome to the internet, pages 1–34, 2003. → pages27[BS06] Stefan Bornholdt and Heinz Georg Schuster. Handbook ofgraphs and networks: from the genome to the internet. JohnWiley & Sons, 2006. → pages 4[CBC+15] V. Ciotti, G. Bianconi, A. Capocci, F. Colaiori, and P. Pan-zarasa. Degree correlations in signed social networks, 2015. →pages 2, 11, 27, 28[CBGV+12] Nicolo Cesa-Bianchi, Claudio Gentile, Fabio Vitale, GiovanniZappella, et al. A correlation clustering approach to link clas-sification in signed networks. In COLT, pages 34–1, 2012. →pages 57[CDK14] Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Corre-lation clustering in mapreduce. In Proceedings of the 20thACM SIGKDD international conference on Knowledge discov-ery and data mining, pages 641–650. ACM, 2014. → pages 3,57[CF03] Colin Cooper and Alan Frieze. A general model of web graphs.Random Structures & Algorithms, 22(3):311–335, 2003. →pages 8, 9, 27[CGW03] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth.Clustering with qualitative information. In Foundations ofComputer Science, 2003. Proceedings. 44th Annual IEEESymposium on, pages 524–533. IEEE, 2003. → pages 3, 56, 5772Bibliography[CGW05] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth.Clustering with qualitative information. Journal of Computerand System Sciences, 71(3):360–383, 2005. → pages 56[CH56] Dorwin Cartwright and Frank Harary. Structural balance:a generalization of heider’s theory. Psychological review,63(5):277, 1956. → pages 1, 2, 19[CH73] V. Chva´tal and P. Hammer. Set packing and threshold graphs,univ. Waterloo Res. Report, pages 73–21, 1973. → pages 17[Cor09] Thomas H. Cormen. Introduction to Algorithms. MIT press,2009. → pages 14[CSN09] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ New-man. Power-law distributions in empirical data. SIAM review,51(4):661–703, 2009. → pages 21[CSW08] Tom Coleman, James Saunderson, and Anthony Wirth. Alocal-search 2-approximation for 2-correlation-clustering. InEuropean Symposium on Algorithms, pages 308–319. Springer,2008. → pages 56[CSX12] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clusteringsparse graphs. In Advances in neural information processingsystems, pages 2204–2212, 2012. → pages 57[Dav77] James A. Davis. Clustering and structural balance in graphs.Social networks. A developing paradigm, pages 27–34, 1977. →pages 19, 20[DEFI06] Erik D. Demaine, Dotan Emanuel, Amos Fiat, and Nicole Im-morlica. Correlation clustering in general weighted graphs.Theoretical Computer Science, 361(2):172–187, 2006. → pages3, 57, 58[DESZ07] Bhaskar DasGupta, German Andres Enciso, Eduardo Sontag,and Yi Zhang. Algorithmic and complexity results for decom-positions of biological networks into monotone subsystems.Biosystems, 90(1):161–178, 2007. → pages 3, 57[Dij59] Edsger W. Dijkstra. A note on two problems in connexionwith graphs. Numerische Mathematik, 1(1):269–271, 1959. →pages 1273Bibliography[DM96] Patrick Doreian and Andrej Mrvar. A partitioning approachto structural balance. Social Networks, 18(2):149–168, 1996.→ pages 56[DMS00] Sergey N. Dorogovtsev, Jose´ Fernando F Mendes, and Alexan-der N Samukhin. Structure of growing networks with prefer-ential linking. Physical Review Letters, 85(21):4633, 2000. →pages 27[ER59] Paul Erdo˝s and Alfre´d Re´nyi. On random graphs. Publ. Math.Debrecen, 6:290–297, 1959. → pages 6, 27[ER60] Paul Erdo¨s and Alfre´d Re´nyi. On the evolution of randomgraphs. Publ. Math. Inst. Hung. Acad. Sci, 5(17-61):43, 1960.→ pages 27[ER61] Paul Erdo˝s and Alfre´d Re´nyi. On the strength of connected-ness of a random graph. Acta Mathematica Hungarica, 12(1-2):261–267, 1961. → pages 27[EZE85] M. El-Zahar and P. Erdo˝s. On the existence of two non-neighboring subgraphs in a graph. Combinatorica, 5(4):295–300, 1985. → pages 17[FF14] Rosa Figueiredo and Yuri Frota. The maximum bal-anced subgraph of a signed graph: Applications and solu-tion approaches. European Journal of Operational Research,236(2):473–487, 2014. → pages 3, 57[FFF99] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos.On power-law relationships of the internet topology. In ACMSIGCOMM computer communication review, volume 29, pages251–262. ACM, 1999. → pages 2, 27[Flo62] Robert W. Floyd. Algorithm 97: shortest path. Communica-tions of the ACM, 5(6):345, 1962. → pages 12[FM13] Rosa Figueiredo and Gisele Moura. Mixed integer program-ming formulations for clustering problems related to structuralbalance. Social Networks, 35(4):639–651, 2013. → pages 57[FT87] Michael L. Fredman and Robert Endre Tarjan. Fibonacciheaps and their uses in improved network optimization algo-74Bibliographyrithms. Journal of the ACM (JACM), 34(3):596–615, 1987. →pages 13[Gao09] Yong Gao. The degree distribution of random k-trees. Theo-retical Computer Science, 410(8):688–695, 2009. → pages 10,11, 27[Gao14] Yong Gao. Community structure-lecture 07. Course Note,Computer Science, The University of British Columbia,Okanagan, 2014. → pages 16, 17[GG06] Ioannis Giotis and Venkatesan Guruswami. Correlation clus-tering with a fixed number of clusters. In Proceedings of theseventeenth annual ACM-SIAM symposium on Discrete algo-rithm, pages 1167–1176. Society for Industrial and AppliedMathematics, 2006. → pages 56[GHK+10] Jiong Guo, Sepp Hartung, Christian Komusiewicz, Rolf Nie-dermeier, and Johannes Uhlmann. Exact algorithms and ex-periments for hierarchical tree clustering. In Proceedings of the24th AAAI Conference on Artificial Intelligence (AAAI10),pages 457–462, 2010. → pages 70[GJ79] Michael R. Garey and David S. Johnson. A guide to the theoryof np-completeness. WH Freemann, New York, 1979. → pages15[GMT07] Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas.Clustering aggregation. ACM Transactions on Knowledge Dis-covery from Data (TKDD), 1(1):4, 2007. → pages 3, 57[Gol78] Martin Charles Golumbic. Trivially perfect graphs. DiscreteMathematics, 24(1):105–107, 1978. → pages 17[GW89] Martin Gro¨tschel and Yoshiko Wakabayashi. A cutting planealgorithm for a clustering problem. Mathematical Program-ming, 45(1-3):59–96, 1989. → pages 57, 59[Har59] Frank Harary. On the measurement of structural balance. Be-havioral Science, 4(4):316–323, 1959. → pages 19[HBN07] Falk Hu¨ffner, Nadja Betzler, and Rolf Niedermeier. Optimaledge deletions for signed graph balancing. In International75BibliographyWorkshop on Experimental and Efficient Algorithms, pages297–310. Springer, 2007. → pages 3, 57[Hei46] Fritz Heider. Attitudes and cognitive organization. The Jour-nal of Psychology, 21(1):107–112, 1946. → pages 1, 18, 20[HKM05] Boulos Harb, Sampath Kannan, and Andrew McGregor. Ap-proximating the best-fit tree under l p norms. In Approxi-mation, Randomization and Combinatorial Optimization. Al-gorithms and Techniques, pages 123–133. Springer, 2005. →pages 62[HLW02] Frank Harary, Meng-Hiot Lim, and Donald C Wunsch. Signedgraphs for portfolio analysis in risk management. IMA Journalof management mathematics, 13(3):201–210, 2002. → pages3, 57[HPK11] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining:concepts and techniques. Elsevier, 2011. → pages 16[JGG+15] Songwei Jia, Lin Gao, Yong Gao, James Nastos, Yijie Wang,Xindong Zhang, and Haiyang Wang. Defining and identifyingcograph communities in complex networks. New Journal ofPhysics, 17(1):013044, 2015. → pages 18[Jor06] Jonathan Jordan. The degree sequences and spectra ofscale-free random graphs. Random Structures & Algorithms,29(2):226–242, 2006. → pages 27[Jun78] Heinz A. Jung. On a class of posets and the correspondingcomparability graphs. Journal of Combinatorial Theory, Se-ries B, 24(2):125–133, 1978. → pages 17[KM86] Mirko Krˇiva´nek and Jaroslav Mora´vek. Np-hard problems inhierarchical-tree clustering. Acta Informatica, 23(3):311–323,1986. → pages 18, 62[Krˇi88] Mirko Krˇiva´nek. The complexity of ultrametric partitions ongraphs. Information Processing Letters, 27(5):265–270, 1988.→ pages 62, 63, 64[KRR+00] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,D. Sivakumar, Andrew Tomkins, and Eli Upfal. Stochastic76Bibliographymodels for the web graph. In Foundations of Computer Sci-ence, 2000. Proceedings. 41st Annual Symposium on, pages57–65. IEEE, 2000. → pages 2, 9, 10, 27[Kru56] Joseph B Kruskal. On the shortest spanning subtree of a graphand the traveling salesman problem. Proceedings of the Amer-ican Mathematical society, 7(1):48–50, 1956. → pages 14, 64[KT06] Jon Kleinberg and Eva Tardos. Algorithm Design. PearsonEducation India, 2006. → pages 4[LHK10] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg.Signed networks in social media. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, pages1361–1370, New York, NY, USA, 2010. ACM. → pages 1, 20[LWGC12] Yunlong Liu, Jianxin Wang, Jiong Guo, and Jianer Chen.Complexity and parameterized algorithms for cograph edit-ing. Theoretical Computer Science, 461:45–54, 2012. → pages18[Meg86] Nimrod Megiddo. On the complexity of linear programming.IBM Thomas J. Watson Research Division, 1986. → pages 64[Mil67] Stanley Milgram. The small world problem. Psychology Today,2(1):60–67, 1967. → pages 2, 27[MM65] John W. Moon and Leo Moser. On cliques in graphs. Israeljournal of Mathematics, 3(1):23–28, 1965. → pages 15[MMP12] Kevin T. Macon, Peter J. Mucha, and Mason A Porter. Com-munity structure in the united nations general assembly. Phys-ica A: Statistical Mechanics and its Applications, 391(1):343–361, 2012. → pages 3, 57[Mor34] Jacob Levy Moreno. Who shall survive?: A new approachto the problem of human interrelations. Nervous and MentalDisease Publishing Co, 1934. → pages 1[MWW89] S. Ma, WD. Wallis, and J. Wu. Optimization problems onquasi-threshold graphs. J. Comb. Inf. Syst. Sci, 14:105–110,1989. → pages 1777Bibliography[NG13] James Nastos and Yong Gao. Familial groups in social net-works. Social Networks, 35(3):439–450, 2013. → pages 18[Pri57] Robert Clay Prim. Shortest connection networks and somegeneralizations. Bell Labs Technical Journal, 36(6):1389–1401,1957. → pages 14[SGWN11] Ajay Sridharan, Yong Gao, Kui Wu, and James Nastos. Sta-tistical behavior of embeddedness and communities of over-lapping cliques in online social networks. In INFOCOM, 2011Proceedings IEEE, pages 546–550. IEEE, 2011. → pages 11,27[SMKS03] Roded Sharan, Adi Maron-Katz, and Ron Shamir. Click andexpander: a system for clustering and visualizing gene expres-sion data. Bioinformatics, 19(14):1787–1799, 2003. → pages16[Sum74] David P. Sumner. Dacey graphs. Journal of the AustralianMathematical Society, 18(04):492–502, 1974. → pages 17[TCAL16] Jiliang Tang, Yi Chang, Charu Aggarwal, and Huan Liu. Asurvey of signed network mining in social media. ACM Com-puting Surveys (CSUR), 49(3):42, 2016. → pages 1[WL93] Zhenyu Wu and Richard Leahy. An optimal graph theoreticapproach to data clustering: Theory and its application toimage segmentation. IEEE transactions on pattern analysisand machine intelligence, 15(11):1101–1113, 1993. → pages16[Wol65] ES. Wolk. A note on” the comparability graph of a tree”.Proceedings of the American Mathematical Society, 16(1):17–20, 1965. → pages 17[WPLP14] Robert West, Hristo S. Paskov, Jure Leskovec, and Christo-pher Potts. Exploiting social network structure for person-to-person sentiment analysis. arXiv preprint arXiv:1409.2450,2014. → pages 20[WS98] Duncan J. Watts and Steven H. Strogatz. Collective dynamicsof small-worldnetworks. Nature, 393(6684):440–442, 1998. →pages 2, 7, 2778Bibliography[YCC+96] Jing-Ho Yan, Jer-Jeong Chen, Gerard J. Chang, et al. Quasi-threshold graphs. Discrete Applied Mathematics, 69(3):247–255, 1996. → pages 17[YCL07] Bo Yang, William Cheung, and Jiming Liu. Community min-ing from signed social networks. IEEE transactions on knowl-edge and data engineering, 19(10):1333–1348, 2007. → pages5779

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0347257/manifest

Comment

Related Items