Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mining unstructured social streams : cohesion, context and evolution Li, Pei 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_may_pei_li.pdf [ 3.88MB ]
Metadata
JSON: 24-1.0343307.json
JSON-LD: 24-1.0343307-ld.json
RDF/XML (Pretty): 24-1.0343307-rdf.xml
RDF/JSON: 24-1.0343307-rdf.json
Turtle: 24-1.0343307-turtle.txt
N-Triples: 24-1.0343307-rdf-ntriples.txt
Original Record: 24-1.0343307-source.json
Full Text
24-1.0343307-fulltext.txt
Citation
24-1.0343307.ris

Full Text

Mining Unstructured Social Streams:Cohesion, Context and EvolutionbyPei LiB.Eng., HuaZhong University of Science & Technology, 2007M.Eng., Renmin University of China, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2017c© Pei Li 2017AbstractAs social websites like Twitter greatly influence people’s digital life, un-structured social streams become prevalent, which are fast surging textualpost streams without formal structure or schema between posts or insidethe post content. Modeling and mining unstructured social streams in Twit-ter become a challenging and fundamental problem in social web analysis,which leads to numerous applications, e.g., recommending social feeds like“what’s happening right now?” or “what are related stories?”. Current socialstream analysis in response to queries merely return an overwhelming list ofposts, with little aggregation or semantics. The design of the next genera-tion social stream mining algorithms faces various challenges, especially, theeffective organization of meaningful information from noisy, unstructured,and streaming social content.The goal of this dissertation is to address the most critical challengesin social stream mining using graph-based techniques. We model a socialstream as a post network, and use “event” and “story” to capture a group ofaggregated social posts presenting similar content in different granularities,where an event may contain a series of stories. We highlight our contributionson social stream mining from a structural perspective as follows. We firstmodel a story as a quasi-clique, which is cohesion-persistent regardless of thestory size, and propose two solutions, DIM and SUM, to search the largeststory containing given query posts, by deterministic and stochastic means,respectively. To detect all stories in the time window of a social stream andsupport the context-aware story-telling, we propose CAST, which defines astory as a (k, d)-Core in post network and tracks the relatedness betweenstories. We propose Incremental Cluster Evolution Tracking (ICET), whichis an incremental computation framework for event evolution on evolvingpost networks, with the ability to track evolution patterns of social events astime rolls on. Approaches in this dissertation are based on two hypotheses:users prefer correlated posts to individual posts in post stream modeling,and a structural approach is better than frequency/LDA-based approachesin event and story modeling. We verify these hypotheses by crowdsourcingbased user studies.iiPrefaceThe modeling of social streams discussed in Chapter 3 is primarily basedon our publication at KDD 2013 (KeySee [42], a system that supports key-word search on social events). The work on cohesion-persistent story searchpresented in Chapter 4 is based on our publication at SDM 2016 [41]. Thecontext-aware story-teller (CAST) discussed in Chapter 5 is mainly based onour publication at CIKM 2014 [43], and partly based on our publication atICDE 2012 [45]. The work on incremental cluster evolution tracking (ICET)presented in Chapter 6 is based on our paper at ICDE 2014 [44]. All ofthe above mentioned publications [41–45] are collaborated with my supervi-sor Prof. Laks V. S. Lakshmanan from the University of British Columbia,Canada. Publications [42–44] are also collaborated with Prof. EvangelosMilios from Dalhousie University, Canada.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Unstructured Social Streams . . . . . . . . . . . . . . . . . . 11.1.1 Concept and Modeling . . . . . . . . . . . . . . . . . 11.1.2 Mining Unstructured Social Streams . . . . . . . . . . 41.2 Opportunities and Challenges . . . . . . . . . . . . . . . . . . 81.2.1 Cohesion-Persistent Story Search . . . . . . . . . . . . 81.2.2 Story Context Mining . . . . . . . . . . . . . . . . . . 91.2.3 Event Evolution Tracking . . . . . . . . . . . . . . . . 101.2.4 Evaluation of Mining Tasks . . . . . . . . . . . . . . . 111.3 Contributions and Research Plan . . . . . . . . . . . . . . . . 121.3.1 Query-Driven Quasi-Clique Maximization . . . . . . . 121.3.2 Context-Aware Story-Telling . . . . . . . . . . . . . . 131.3.3 Incremental Event Evolution Tracking . . . . . . . . . 141.3.4 Targeted Crowdsourcing . . . . . . . . . . . . . . . . 151.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1 Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.1 Graph Clustering . . . . . . . . . . . . . . . . . . . . 172.1.2 Community Detection and Search . . . . . . . . . . . 182.1.3 Dense Subgraph Mining . . . . . . . . . . . . . . . . . 182.1.4 Subgraph Relatedness Computation . . . . . . . . . . 20ivTable of Contents2.2 Social Media Analytics . . . . . . . . . . . . . . . . . . . . . 202.2.1 Frequency Based Peak Detection . . . . . . . . . . . . 212.2.2 Entity Network Based Approaches . . . . . . . . . . . 212.3 Topic Detection and Tracking . . . . . . . . . . . . . . . . . . 213 Modeling Unstructured Social Streams . . . . . . . . . . . . 233.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Social Stream Preprocessing . . . . . . . . . . . . . . . . . . 243.3 Post Network Construction . . . . . . . . . . . . . . . . . . . 253.3.1 Post Similarity Computation . . . . . . . . . . . . . . 253.3.2 Post Network . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Linkage Search . . . . . . . . . . . . . . . . . . . . . . 263.4 Stories and Events . . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Stories and Events in Social Streams . . . . . . . . . . 273.4.2 Stories and Events in Post Network . . . . . . . . . . 283.5 Context and Evolution . . . . . . . . . . . . . . . . . . . . . 313.6 Comparing with Other Modeling Methods . . . . . . . . . . . 333.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 354 Cohesion-Persistent Story Search . . . . . . . . . . . . . . . . 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Pre-solution on Core Tree . . . . . . . . . . . . . . . . . . . . 424.3.1 Core Tree: Definition and Properties . . . . . . . . . 434.3.2 Search for Pre-Solution . . . . . . . . . . . . . . . . . 474.4 Query-Driven Quasi-Clique Maximization . . . . . . . . . . . 504.4.1 Objective and Operations . . . . . . . . . . . . . . . . 504.4.2 Efficient Solution Search and Rank . . . . . . . . . . 524.4.3 Iterative Maximization Algorithms . . . . . . . . . . . 544.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . 584.5.1 Core Tree Construction . . . . . . . . . . . . . . . . . 584.5.2 Performance Evaluation . . . . . . . . . . . . . . . . . 594.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 635 Context-Aware Story-Telling . . . . . . . . . . . . . . . . . . . 665.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 Story Vein . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Transient Story Discovery . . . . . . . . . . . . . . . . . . . . 695.3.1 Defining a Story . . . . . . . . . . . . . . . . . . . . . 695.3.2 Story Formation . . . . . . . . . . . . . . . . . . . . . 72vTable of Contents5.4 Story Context Tracking . . . . . . . . . . . . . . . . . . . . . 735.4.1 Story Relatedness Dimensions . . . . . . . . . . . . . 745.4.2 Story Context Search . . . . . . . . . . . . . . . . . . 765.4.3 Interpretation of Story Vein . . . . . . . . . . . . . . 795.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . 805.5.1 Tuning Post Network . . . . . . . . . . . . . . . . . . 805.5.2 Quality Evaluation . . . . . . . . . . . . . . . . . . . 815.5.3 Performance Testing . . . . . . . . . . . . . . . . . . . 855.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 876 Incremental Event Evolution Tracking . . . . . . . . . . . . . 896.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.2 Problem Formalization . . . . . . . . . . . . . . . . . . . . . 926.3 Incremental Tracking Framework . . . . . . . . . . . . . . . . 936.4 Skeletal Graph Clustering . . . . . . . . . . . . . . . . . . . . 946.4.1 Node Prioritization . . . . . . . . . . . . . . . . . . . 956.4.2 Skeletal Cluster Identification . . . . . . . . . . . . . 976.5 Incremental Cluster Evolution . . . . . . . . . . . . . . . . . 976.5.1 Fading Time Window . . . . . . . . . . . . . . . . . . 976.5.2 Network Evolution Operations . . . . . . . . . . . . . 986.5.3 Skeletal Graph Evolution Algebra . . . . . . . . . . . 996.5.4 Incremental Cluster Evolution . . . . . . . . . . . . . 1016.6 Incremental Algorithms . . . . . . . . . . . . . . . . . . . . . 1036.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.7.1 Tuning Skeletal Graph . . . . . . . . . . . . . . . . . 1056.7.2 Cluster Evolution Tracking . . . . . . . . . . . . . . . 1066.7.3 Running Time of Evolution Tracking . . . . . . . . . 1126.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 1137 Crowdsourcing-Based User Study . . . . . . . . . . . . . . . 1167.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.3 Hypotheses in Social Stream Mining . . . . . . . . . . . . . . 1197.4 Quality Control for Crowdsourcing . . . . . . . . . . . . . . . 1227.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 1318 Summary and Future Research . . . . . . . . . . . . . . . . . 1338.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 134viTable of ContentsBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136viiList of Tables3.1 Event evolution patterns . . . . . . . . . . . . . . . . . . . . . 334.1 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Three Maximization Operations. . . . . . . . . . . . . . . . . 514.3 Running time for core tree construction. . . . . . . . . . . . . 654.4 Aggregated quality scores for different methods from 100 tests. 655.1 Major Notations. . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 The number of edges in the post network, and the number ofstories and relatedness links in the story vein as the changingof temporal proximity functions. We define a story as a (5, 3)-Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Top 20 stories detected by LDA from Tech-Full and describedby high frequency words. We treat top 100 posts of each topicas a story. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.4 The accuracy of RCS, as the number of simulations n grow.We define n as a number proportional to the neighboring postsize of story S and measure the accuracy based on DCS. . . . 886.1 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Tuning post network. . . . . . . . . . . . . . . . . . . . . . . . 115viiiList of Figures1.1 The illustration of major components in a social website. . . 21.2 The similarity between posts in a time window of social streamsis captured by a post network. Each story or event in socialstreams can be modeled as a specifically defined subgraph inpost network. . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 An illustration of an event and its included stories. This eventis about “MH370 Search” and has a clear evolution history,i.e., emerge, grow and decay, in a time span of three months.This event includes several stories which are related. . . . . . 64.1 The query S is marked by solid nodes, with λ = 0.5. Amaximal quasi-clique and the query-driven maximum quasi-clique (QMQ) are circled by a dotted ellipse and a solid ellipse,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Architecture of the efficient query-driven maximum quasi-cliquesearch by DSG tree. . . . . . . . . . . . . . . . . . . . . . . . 434.3 A graph G with k-Cores identified recursively in (a) and itscorresponding core tree in (b). . . . . . . . . . . . . . . . . . 444.4 (a) A small graph shown and (b) its corresponding core tree,showing deepest tree nodes containing graph nodes v1, · · · , v5;G5 and G6 in (b) are annotated by dotted circles in (a). . . . 494.5 (a) An example for the QMQ maximization. (b) Illustratingthe inflection node on F(x), where solutions with rank x ≤ x∗are preferred. . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.6 The number of maximal cores first rapidly increases, and thenslowly decreases with the coreness k on three data sets. . . . . 594.7 (a) Running time of 1000 pre-solution queries as the increasingof query set size. (b) Portion of pre-solution chosen from Gl,out of all pre-solutions, for different query set sizes. λ = 0.9. . 60ixList of Figures4.8 (a) Running time of Add and Add-MC for different solutionsizes, with query node size |S| = 3; (b) ratio between thesearch spaces of Add and Add-MC, given current solution,as query size increases; (c) average #iterations for variousmethods on LiveJournal data set; (d) average running timeof different methods on LiveJournal, as query size increases;λ = 0.9 by default. . . . . . . . . . . . . . . . . . . . . . . . . 655.1 Illustrating the workflow of StoryVein, which has three ma-jor steps: (1) post network construction, (2) transient storydiscovery and (3) story context search. . . . . . . . . . . . . . 685.2 (a) A 3-Core without similarity witness between p1 and p2.(b) The illustration of generating a (3, 1)-Core from a 3-Core. 715.3 Top 10 results of HashtagPeaks, MentionPeaks and Entity-Peaks on Tech-Lite dataset. The ground truth for precisionand recall is top 10 major stories selected from main streamtechnology news websites. . . . . . . . . . . . . . . . . . . . . 825.4 Top 10 transient stories generated by our proposed story-telleron Tech-Lite. Each story is represented as a (5,3)-Core, andwe render a story into an entity cloud for human reading.Some transient stories may be related, as linked by lines. Thecurve on the bottom is the breakdown of tweet frequency oneach day in Jan 2012. . . . . . . . . . . . . . . . . . . . . . . 845.5 A fragment of story vein tracked from CNN-News, which hasa time span from January 1 to June 1, 2014. . . . . . . . . . . 855.6 An example to illustrate our context-aware story-teller. Eachtag cloud here is a single story identified from the post net-work. Sets of stories with higher relatedness are grouped to-gether in rectangles to aid readability. . . . . . . . . . . . . . 865.7 (a) Running time of (d + 1)-Core generation and (d + 1, d)-Core generation (Zigzag, NodeFirst). (b) The number ofconnected components generated by (d + 1)-Cores and (d +1, d)-Cores. (c) Running time of different context search ap-proaches. All experiments are running on Tech-Full datasetwith the time window set to one week. . . . . . . . . . . . . . 87xList of Figures6.1 Post network captures the correlation between posts in thetime window at each moment, and evolves as time rolls on.The skeletal graph is shown in bold. From moment t to t+ 1,the incremental tracking framework will maintain clusters andmonitor the evolution patterns on the fly. . . . . . . . . . . . 906.2 (a) The commutative diagram between dynamic networksGt, Gt+1and cluster sets St, St+1. The “divide-and-conquer” baselineand our Incremental Tracking are annotated by dotted andsolid lines respectively. (b) The workflow of incremental track-ing module, which shows our framework tracks cluster evolu-tion dynamics by only consuming the updating subgraph ∆Gt+1. 936.3 The functional relationships between different types of objectsdefined in this paper, e.g., the arrow from Gt to Gt with labelSke means Gt = Ske(Gt). Refer to Table 6.1 for notations. . 956.4 An illustration of the fading time window from time t to t+1,where post priority may fade w.r.t. the end of time window.Gt will be updated by deleting subgraph Gold and adding sub-graph Gnew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.5 (a) The relationships between primitives and evolutions. Eachbox represents an evolution object and the arrows betweenthem describe inputs/outputs. (b) The evolutionary behaviortable for clusters when adding or deleting a core post p. . . . 1016.6 The trends of the number of core posts, core edges and eventswhen increasing δ from 0.3 to 0.8. We set δ = ε = 0.3 as the100% basis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.7 Examples of Google Trends peaks in January 2012. We vali-date the events generated by cTrack by checking the existenceof volume peaks at a nearby time moment in Google Trends.Although these peaks can detect bursty events, Google Trendscannot discover the merging/splitting patterns. . . . . . . . . 1076.8 Lists of top 10 events detected from Twitter Technology streamsin January 2012 by baseline HashtagPeaks, UnigramPeaks,Louvain and our incremental tracking approach eTrack. . . . 1076.9 The merging and splitting of “SOPA” and “Apple”. At eachmoment, an event is annotated by a word cloud. Baselines 1and 2 only works for the detection of new emerging events,and is not applicable for the tracking of merging and splittingdynamics. The evolution trajectories of and Baseline 3 aredepicted by solid and hollow arrows respectively. . . . . . . . 111xiList of Figures6.10 The running time on two datasets as the adjusting of the timewindow length and step length. . . . . . . . . . . . . . . . . . 1127.1 Crowdsourcing task for Hypothesis 1. . . . . . . . . . . . . . . 1207.2 Crowdsourcing task for Hypothesis 2. . . . . . . . . . . . . . . 1217.3 The workflow of quality control steps. . . . . . . . . . . . . . 1247.4 (a) A voting example with 3 workers and 3 questions, whereeach question has two options: A and B. (b) The computationof normalized quality vector q on each iteration, where qual-ity scores on iteration 0 are obtained from the qualificationtest. (c) and (d): The distributions of quality weights amongoptions for each question on iteration 1 and 15, respectively. . 1277.5 Qualification test sample used before real crowdsourcing tasks. 1297.6 Running EMQ for verifying Hypothesis 1 and 2. In (b), thepercentage of weighted votes on options D/E increase from74.4% (before EMQ) to 83.4% (after EMQ). In (c), the per-centage of weighted votes on options D increase from 56.7%(before EMQ) to 75.4% (after EMQ). . . . . . . . . . . . . . . 130xiiChapter 1IntroductionAs social streaming websites like Twitter become popular and gradually dom-inate people’s digital life, the current Web is in the social age. The informa-tion propagation channel in the social web age can be viewed as post streamsalong the timeline, where each post is a tweet in Twitter. We call these poststreams as unstructured social streams, since there is no formal structureor schema between these posts or inside a post. Modeling and mining theunstructured social streams becomes a fundamental problem in social webanalysis, which leads to numerous applications [48, 55, 66, 77], e.g., answer-ing “what’s happening now?” on Twitter. In this chapter, we first provide anoverview on the modeling of unstructured social streams. Then, we discussthe opportunities, challenges, and our contributions in various mining tasksof unstructured social streams.1.1 Unstructured Social StreamsIn this section, we first scope out the definition and position of unstructuredsocial streams in the framework architecture of a social website, and then, webriefly introduce the modeling of a social stream as a post network. Followingthat, we discuss the major tasks and challenges in social stream mining, andprovide a reasoning on why we adopt structural approach rather than othersocial stream mining approaches, like content or frequency based approaches.1.1.1 Concept and ModelingSocial Website Components. While a precise and widely recognizeddefinition of social streams is still missing in the literature, we try to refineits scope in the context of social website. Here, a social website generallyrefers to a website that supports the interaction among people in a socialnetwork, in which they create and share information. Twitter and Weiboare two typical social websites, where users are connected in virtual onlinecommunities, and short textual posts are created and shared among them.11.1. Unstructured Social Streams 6RFLDO:HEVLWH8QVWUXFWXUHG5HDOWLPH/DUJHYROXPH)DFHERRN7ZLWWHU/LQNHG,Q3RVW6WUHDP&UHDWHVKDUHOLNHUHSO\IROORZIULHQGVKLS,QWHUDFWLRQVFigure 1.1: The illustration of major components in a social website.In Figure 1.1, we decompose a social website into two major components:post stream and interactions. Each of them is explained below.• Post Stream. In social websites, posts are the atomic form of the contentcreated by users and propagated across social networks. The lengthof each post is usually very short, e.g., a tweet only has at most 140characters. Post streams are also unstructured, since the majority ofposts do not have any relationships between them and there is no schemafor the content of a post. In most popular social websites, the volume ofposts surges very quickly, e.g., Twtter reached a peak of 143,199 tweetsper second1 on August 3, 2013.• Interactions. The interactions in a social website may happen betweendifferent types of objects. The typical interaction happens between theuser and a post, e.g., a user creates a post. Interactions may also happenbetween a post and a post. For example, the “reply” refers to the in-teraction between two posts, and two posts without “reply” relationshipmay be semantically interacted if they tell the same story.Unstructured Social Streams. Conceptually, we call the textual poststreams, e.g., daily updates, on a social website as social streams. We de-scribe a social post by three types of information: author, text content andcreatioin time. We define a social stream as a first-in-first-out queue of postsordered by time of creation, in which each post is associated with the textcontent and a timestamp. An illustration of the social stream in a timewindow can be found in Figure 1.2. Mining social streams is highly chal-lenging, with the difficulties originating from two important properties ofsocial streams:1https://blog.twitter.com/2013/new-tweets-per-second-record-and-how21.1. Unstructured Social StreamsTime WindowFrequencySocial Streams Post NetworkExtractFigure 1.2: The similarity between posts in a time window of social streamsis captured by a post network. Each story or event in social streams can bemodeled as a specifically defined subgraph in post network.• Unstructured. The unstructured nature of social streams has two as-pects. First, the text content inside a post is unstructured. Socialposts such as tweets are usually written in an informal way with lots ofgrammatical errors, and even worse, a correctly written post may haveno significance and be just noise. The design of a processing strategythat can quickly judge what a post talks about is a challenging prob-lem. Second, the relationship between posts is usually unstructured.Our statistics on Twitter Tech-Full data set (described in Chapter 5.5)shows that only 5% tweets are retweets or replies, making the majorityof tweets independent with each other, even though they may talk aboutsimilar events or stories in the social website. The effective organiza-tion of these unstructured posts in social streams is a very challengingproblem.• Streaming. The streaming nature of social streams also poses two chal-lenges. First, in popular social websites, new posts emerge quickly in ev-ery second, e.g., Twitter generates thousands of tweets per second. Thedesign of social stream mining algorithms should be efficient enough tohandle the quick surge of new posts in a timely manner, e.g., it shoulduse a single-scan algorithm with linear scalability. Second, since it is im-possible and unnecessary to process all historical posts, old posts shouldbe naturally outdated as the time rolls on. The design of a time windowwith proper time-decay effect is essential to handle the streaming data.This time window can be regarded as the window of observation.Modeling Social Stream as Post Network. In this thesis, we propose aninnovative approach to quickly transform an unstructured social stream into31.1. Unstructured Social Streamsa well-defined structure, called post network. As illustrated in Figure 1.2,we monitor social streams using a sliding time window with length Len. Atany moment t, all posts generated in the time window [max{t − Len, 0}, t]constitute the snapshot of current social streams. We transform a socialstream into a post network by the following rule: a post network at momentt can be defined as a graph Gt(Vt, Et), where each node p ∈ Vt is a post in thesnapshot, and an edge (pi, pj) ∈ Et is constructed if the similarity S(pi, pj)between pi and pj is higher than a given threshold ε. As the time windowmoves forward, new posts flow in and old posts fade out, and then Gt(Vt, Et)will be dynamically updated at each moment, with new nodes/edges addedand old nodes/edges removed. As we can see, this transformation will makeGt(Vt, Et) an evolving network as time rolls on.The key challenge to construct the post network is how to computeS(pi, pj) effectively and efficiently. Traditional similarity measures such asTF-IDF based cosine similarity, Jaccard Coefficient and Pearson Correlation[53] only consider the post content. However, time stamps should play animportant role in determining post similarity, since posts created closer to-gether in time are more likely to discuss the same story than posts createdat very different moments. We define the similarity between a pair of postspi and pj by combing both content similarity and temporal proximity. Theexact form of the similarity function will be discussed in Chapter 3. The sim-ilarity score S(pi, pj) will fall into the interval [0, 1]. The similarity thresholdε will be empirically set, where 0 < ε < 1.1.1.2 Mining Unstructured Social StreamsMajor Tasks. People easily feel overwhelmed by the information delugecoming from social streams which flow in from channels like Twitter, Face-book/LinkedIn, forums, blog websites and email-lists. There is thus an ur-gent need for tools which can automatically extract and summarize signifi-cant information from highly dynamic social streams, e.g., report emergingbursty events, or track the evolution of one or more specific events in a giventime span. Imagine that a user called Bob follows a few news media (e.g., theCNN Breaking News channel), which feed him a social stream consisting ofthousands of tweets per day. Bob does not have time to digest these tweetsone-by-one, but he wants to keep synced up with the new emerging storiesdiffused on these information channels. The major tasks of social streammining, can be viewed as answering the following typical queries on socialstreams:41.1. Unstructured Social Streams• What’s trending now? Every morning at breakfast, Bob wants to keepupdated on the new events or stories emerging in his social streams. Hegot thousands of new tweets pushed from his followees during last nightand he does not have the time or interest to read so many short andinformally-written posts.• Tell me related news. One day morning in 2014, Bob is reading a break-ing news about “Annexation of Crimea by the Russian”. Bob is unclearabout the context of this story and wants to check more related stories.• How’re things going? Bob has an interest in the event “Ukraine Crisis”and follows this event for several months. Every week, there are somenew stories happening related to the Ukraine crisis. In January 2014,Bob is on vocation and does not follow this event any more. When heis back from his vocation, Bob wants to know how the Ukraine crisisevent evolved in the past month.The answering of these key queries needs to scope out two questions: (1)how to effectively organize these meaningful information in posts? (2) How tocapture the behaviors of these meaningful information in social streams? Forthe first question, we define story and event as two structures to organizemeaningful information in posts. For the second question, we introducecohesion, context and evolution to capture the evolution behaviors of eventsand stories. They are elaborated below.Story and Event. Before the development of some highly intelligent al-gorithms to answer these queries, we need to define several basic conceptsthat support the query. In this dissertation, story and event are two ba-sic concepts we use to aggregate and organize posts in social streams withsimilar meaningful information together. In related work [55, 66, 67], thereare many different definitions for story and event, and some studies evenuse these two terms interchangeably, without proper distinction. In the fol-lowing, we clearly distinguish them and define story and event from theconceptual level.• Story. A story is a set of posts that talk about very similar informationin a short time span. Usually, the information carried by a story canbe described by a few sentences or a small set of keywords. Since postsin social streams are usually very short, we consider a post describes atmost a story in most cases. However, a story can be described by manyposts, e.g., two posts “Crimea was annexed by the Russian Federation”and “Crimea voted on 6 March to formally accede as part of the RussianFederation” are talking about the same story about the annexation of51.1. Unstructured Social StreamsPDOD\VLDDLUOLQHVIOLJKWORVWDXVWUDOLDRIILFLDOVVD\SRVVLEOHREMHFWVLQLQGLDQRFHDQPDOD\VLDQRIILFLDOVXSGDWHVHDUFKIRU0+DXVWUDOLDQDXWKRULWLHVXSGDWHVHDUFKIRU0+emergegrow decayFigure 1.3: An illustration of an event and its included stories. This eventis about “MH370 Search” and has a clear evolution history, i.e., emerge,grow and decay, in a time span of three months. This event includes severalstories which are related.Crimea by the Russian Federation in March 2014.• Event. A event is a post cluster which contains many highly relatedstories, where the relatedness between stories is measured on both thecontent similarity and the time closeness. Typically, an event has a rela-tively long time span and follows clear evolution patterns, e.g., emerging,growing and decaying. A typical event contains lots of details and can-not be described by a short post. For example, “Ukrainian crisis” is anevent, which consists of a series of related stories, e.g., “Crimean Crisis”,“War in Donbass”, “Ukraine President Elections”, etc.Since a social stream can be modeled as a post network, stories and eventscorrespond to subgraphs in the post network. As illustrated in Figure 1.3,we show the relationship between an event and its included stories, wherestories are related with each other. Notice that in this section, stories andevents are defined on the conceptual level. The detailed technical definitionsfor events and stories on the post network will be given in Chapter 3.Cohesion, Context and Evolution. To capture the behaviors of eventsand stories, we briefly introduce story cohesion, story context and eventevolution, as explained below. The detailed studies on them will be scatteredin Chapter 4, Chapter 5 and Chapter 6.• Story Cohesion. Given a set of posts which may correspond to a story,61.1. Unstructured Social Streamsthe cohesion of this post set measures the likelihood of this post settelling the same story. Since we can measure the similarity between anytwo posts, the cohesion of a post set can be computed by aggregatingthe pairwise post similarity inside this post set. For example, a commonway to define the cohesion is the ratio between the minimum degree ofa post in the story and the story size. We only treat a post set withenough high cohesion as a story. Obviously, cohesion is an intrinsicfeature of a story.• Story Context. In contrast to the cohesion, story context is an extrin-sic feature that is between stories. Story context tries to measure howstrongly two given stories are related. Since we can compute the simi-larity between two posts in different stories, a natural way to measurethe relatedness between two stories is by assessing the post similaritybetween them, potentially normalized by the sizes of stories.• Event Evolution. An event usually consists of a series of stories, whichallow a relatively-long time span (e.g., weeks or even months) with evo-lution pattens. Event evolution tries to capture the evolution path ofan event, and typical evolution patterns include emerge, grow, decay,disappear, etc.Difficulties in Social Stream Mining. Social stream mining is a categoryof difficult data mining problems. The aim of social stream mining is to fulfillpeople’s information seeking needs on social streams. Typical social streammining tasks include detecting new stories, searching for related stories andtracking the evolution of trending events. The design of social stream miningalgorithms faces the following difficulties:• Dynamics. The quick surge of social post streams makes the data up-dated very frequently, so the social stream mining algorithms should beable to handle the dynamic nature of the social streams effectively andefficiently.• Quality. A large number of posts in social streams are noise or redun-dant. An effective social stream mining algorithm should be able tocombat noise, condense redundant content and aggregate small piecesof information conveyed by each short post together.• Scalability. The huge amount of posts generated by users every dayraises huge challenges on the scalability of the social stream miningalgorithms.71.2. Opportunities and Challenges1.2 Opportunities and ChallengesUnstructured social streams are the primary data source for informationexchange in social websites. It is noisy and surges quickly, with meaningfulinformation hidden deeply inside the informally writing text content. Inthis section, we discuss the opportunities and challenges in the mining ofunstructured social streams. First, we define a story as a cohesion-persistentsubgraph in post network, which makes sure all social posts in the samestory are highly related with each other. We then propose the story contextsearch to find related stories of a given story, which allows us to build therelatedness between stories in social streams. An event is defined as a postcluster which contains many highly-related stories. The tracking of eventevolution patterns is an interesting problem and we discuss the challengesto perform event evolution tracking in social streams. These approaches aresupported by two hypotheses: users prefer correlated results to individualresults in social stream modeling, and structural approach is better thanfrequency/LDA-based approaches in event and story modeling. We discussthe challenges in user studies to verify these hypotheses.1.2.1 Cohesion-Persistent Story SearchSince a post in social streams can be modeled as a node and two posts withhigh similarity will be connected by an edge, a social stream can be trans-formed into a post network. As a result, a story in social streams correspondsto a connected subgraph in the post network. In the post network, we definea story as a connected subgraph Gi(Vi, Ei) with enough high cohesion, wherethe cohesion C(Gi) is defined as the ratio between the minimum degree andthe maximum possible degree in Gi. Supposing deg(v,Gi) is the degree ofnode v in Gi, we have C(Gi) = min deg(v,Gi)|Vi|−1 for any node v ∈ Vi.There are several reasons why we use cohesion to define a story. The rootcause is similar posts are connected together in the post network. If two postsare similar with each other in both the text content and the creation time,it is likely that these two posts are telling the same story, and they will beconnected by an edge in the post network. A story in a social stream is asubgraph in the post network. If nodes in a subgraph are highly connectedwith each other, it is very likely that the posts corresponding to this subgraphare telling the same story. We argue that, cohesion is an effective way toguarantee every post in the story is similar to the majority of posts in thesame story, regardless of the story size.Clearly, a story is a special kind of dense subgraph. In related work, there81.2. Opportunities and Challengesare many definitions for dense subgraphs, e.g., quasi-clique, k-Plex, k-Core[47] and k-Truss [34]. Supposing N(p) is the neighbor set of node p ∈ Vi inGi(Vi, Ei), these dense subgraphs are defined as:• Quasi-clique: |N(p)| ≥ λ(|Vi| − 1) for any post p ∈ Vi and 0 < λ ≤ 1 ;• k-Plex: |N(p)| ≥ |Vi| − k for every post p ∈ Vi;• k-Core: |N(p)| ≥ k for every post p ∈ Vi;• k-Truss: for every edge (pi, pj) ∈ Ei, |N(pi) ∩N(pj)| ≥ k.The cohesion of a k-Plex can be very low if its size is very close to k.The cohesion of a k-Core can be also very low if its size is very larger than k,similarly for k-Truss. Out of them, only for λ-quasi-cliques, the cohesion isalways at least λ as the subgraph size changes. The intuition of a cohesion-persistent story is best captured by a λ-quasi-clique, as the degree of nodesgrows with the node size in a quasi-clique. Given several posts alreadyread by the user as the input, the finding of the cohesion-persistent storycontaining these posts is called the cohesion-persistent story search problem.On the post network, the cohesion-persistent story search problem is actuallythe query-driven quasi-clique search problem.To the best of our knowledge, there is no existing studies on the query-driven quasi-clique search problem. existing studies on quasi-clique maxi-mization (without any query) are mainly based on local search [15, 33, 62],in which a solution moves to the best neighboring solution iteratively, up-dated node-by-node. To apply local search on the query-driven quasi-cliquesearch problem, we face two challenges: (1) the efficient finding of an initialsolution, (2) efficient iterative maximization approaches for searching betterneighboring solutions. In addition, existing local search methods are usuallybased on deterministic heuristics, which may easily trap the optimizationprocess into a local maximum. Thus, we face new challenges of developingrandomized algorithms to find the largest query-driven quasi-cliques.1.2.2 Story Context MiningThere are many previous studies [48, 55, 66, 77] on detecting new emergingstories from social streams. Since these stories detected by previous studiesare not cohesion-persistent, we proposed cohesion-persistent story in Section1.2.1. However, all these story detection approaches only serve the need foranswering “what’s happening now? ” in social streams, and are not able tofind the relatedness between stories. In reality, stories usually do not happenin isolation, and the recommendation of related stories will greatly enhancethe user’s experience and improve the participation. For example, “Crimeavotes to join Russia” (on May 6, 2014) and “President Yanukovych signs91.2. Opportunities and Challengescompromise” (on February 21, 2014) are two separate stories, but they areactually highly related under the same event “Ukraine Crisis”. The context-aware story-telling for streaming social content not merely detects trendingstories in a given time window of observation, but also builds the “context”of each story by measuring its relatedness with other stories on the fly. Asa result, context-aware story-telling has advantages on answering advanceduser queries like “tell me related stories”, which is crucial to help digest largevolume social streams. Context-aware story-telling on social streams raisesthe following challenges:• Identification of transient stories from time window. Story detectionshould be robust to noisy posts and efficient enough to support single-pass tracking, which is essential in the streaming environment.• Story context search on the fly. Story relatedness computation shouldbe efficient to find related stories of a given story, and interpretable tobuild a story graph that supports the story-telling to users.To the best of our knowledge, there is no publicly available training dataset for the context-aware story-telling on social streams, which makes theexisting studies [59, 68] on Story Link Detection (SLD) not applicable, be-cause SLD is trained on well-written news articles. Furthermore, we cannotapply topic tracking techniques (e.g., [32]) to story context search, becausetopic tracking is usually formulated as a classification problem [4], with anassumption that topics are predefined before tracking, which is unrealisticfor story context search on social streams. All these constraints make themining of story contexts on social streams an extremely challenging problem.1.2.3 Event Evolution TrackingPeople easily feel overwhelmed by the information deluge coming from highlydynamic social streams. There is thus an urgent need to provide users withtools which can automatically extract and summarize significant informationfrom social streams, e.g., report emerging bursty events, or track the evolu-tion of one or more specific events in a given time span. In this thesis, anevent is defined as a group of posts that contains many related stories. Thereare several previous studies [10, 26, 48, 55, 66, 67] on detecting new emerg-ing events from text streams, designed to answer simple queries like “what’strending now? ”. However, in many scenarios, users are dissatisfied with onlyproviding them new emerging events. Instead, users may want to know theevolution history of an event and like to issue advanced queries like “how’rethings going? ”. The ideal answer of such a query would be a “panoramic view”101.2. Opportunities and Challengesof the event, which improves user experience greatly. Here, the panoramicview of an event means the whole evolution life cycle of an event, includingprimitive operations like emerge, grow, decay and disappear and compositeoperations like merge and split. Technically, we can model social streamsas dynamically evolving post networks and model events as clusters in thesenetworks, obtained by means of a clustering approach that is robust to thelarge amount of noise present in social streams. Accordingly, we consider theabove kind of queries as an instance of the event evolution tracking problem,which aims to track the evolution patterns of events at each moment in suchdynamic post networks.In many scenarios, social streams are of large scale and evolve quickly.There are several major challenges in event evolution tracking:• The first challenge is the effective design of incremental computationframework for event evolution tracking. Traditional approaches (e.g.,[38]) based on decomposing a dynamic network into snapshots and pro-cessing each snapshot independently from scratch are prohibitively ex-pensive. An efficient single-pass incremental computation framework isessential for event evolution tracking over social streams that exhibitvery large throughput rates. To our knowledge, the event evolutionproblem has not yet been studied.• The second challenge is the formalization and tracking of event evolu-tion operations under an incremental computation framework, as thenetwork evolves. Most related work reports event activity by volumeover the time dimension [48, 55]. While certainly useful, this is just notcapable of showing the composite evolution behaviors about how eventssplit or merge, for instance.• The third challenge is the handling of bulk updates. Since dynamic postnetworks may change rapidly, a node-by-node approach to incremen-tal updating will lead to poor performance. A subgraph-by-subgraphapproach to incremental updating is critical for achieving good per-formance over very large, fast-evolving dynamic networks such as postnetworks. But this in turn brings the challenge of incremental clustermaintenance against bulk updates on dynamic networks.1.2.4 Evaluation of Mining TasksThe approaches proposed in this dissertation are based on two hypotheses:users prefer related results to individual results in social stream modeling,and structural approach is better than frequency/LDA-based approaches in111.3. Contributions and Research Planevent and story modeling. User study is an effective way to verify a hypoth-esis about user preferences and satisfaction. Traditional user study involveslots of human labor work. With the rising of the Internet, crowdsourcinghas recently become a popular mechanism behind user studies. Amazon Me-chanical Turk (MTurk) is the most popular crowdsourcing marketplace. Asrecorded in 20142, there are over 500,000 workers from over 190 countries.Besides the normal MTurk workers, it is a well-known fact that there arelots of spammers and bots among Amazon workers.In this user study, we try to perform hypotheses verification for the socialstream mining through crowdsourcing tasks on Amazon Mechanical Turk(MTurk). Given the fact that there are a large number of bots and spammerson MTurk, the most critical challenge is the quality control in crowdsourcing.Existing techniques on quality control includes majority voting, minimumtime constraint, etc. However, none of them can solve the “smart spammer”problem, in which workers pass the qualification test but perform like aspammer simply to get the reward with minimal work. Especially, sincethese smart spammers are still qualified workers for crowdsourcing tasks,none of existing approaches can detect them effectively. All these facts makethe user quality control in crowdsoucing a very challenging problem.1.3 Contributions and Research PlanSection 1.2 explained the challenges in cohesion-persistent story search, storycontext mining, event evolution tracking and quality control in user studies.In this section, we briefly introduce the main ideas behind the solutions wepropose to conquer these challenges, and summarize the contributions wemade in this dissertation.1.3.1 Query-Driven Quasi-Clique MaximizationAs discussed in Section 1.2.2, the cohesion of a k-Plex can be very low ifits size is very close to k, and the cohesion of a k-Core or k-Truss can bealso very low if its size is very larger than k. Thus, λ-quasi-clique is the bestdefinition for cohesion-persistent stories in the post network, since every postin a λ-quasi-clique is similar to at least a portion λ (e.g., λ = 0.9) of otherposts. Recall that two posts are similar if they are talking about similarcontent and generated in close time moments. Given a query S which is aset of posts, the problem of finding the largest cohesion-persistent story that2http://docs.aws.amazon.com/AWSMechTurk/latest/RequesterUI/OverviewofMturk.html121.3. Contributions and Research Plancontains the query S is formalized as the query-driven maximum quasi-clique(QMQ) search, which aims to find the largest λ-quasi-clique containing S.The QMQ search problem is a new graph mining problem that is not studiedbefore, and this problem is proved to be NP-Hard and inapproximable. Tosolve this problem, we propose the notion of core tree to organize dense sub-graphs recursively, which reduces the search space and effectively helps findthe solution within a few tree traversals. To optimize a currently availablesolution to a better solution, we introduce three maximization operations:Add, Remove and Swap. We propose two iterative maximization algorithms,DIM and SUM, to approach QMQ by deterministic and stochastic means re-spectively. With extensive experiments on real datasets, we demonstratethat our algorithms significantly outperform the state of the art algorithmsin running time and/or the quality.We make the following contributions:• We define the problem of query-driven maximum quasi-clique search, anovel cohesive subgraph query not studied before, to solve the cohesion-persistent story search problem.• We propose core tree as a recursive representation of a graph, whichhelps quickly find a tentative solution to the QMQ search problem withina few tree traversals by greatly reducing the solution search space.• We introduce Add, Remove and Swap to search for new solutions andefficiently optimize a tentative solution to a better neighboring solu-tion. Building on this, we propose deterministic and stochastic iterativemaximization algorithms for QMQ search – DIM and SUM.• We perform an extensive experimental study on three real datasets,which demonstrates that our algorithms significantly outperform severalbaselines in running time and/or the quality.We will discuss the details of the QMQ search in Chapter 4.1.3.2 Context-Aware Story-TellingMining transient stories and their relatedness implicit in social streams is achallenging task, since these streams are noisy and surge quickly. To addressthe challenges discussed in Section 1.2.2, we propose CAST [43], whichis a Context-Aware Story-Teller specifically designed for streaming socialcontent. CAST takes a noisy social stream as the input, and outputs a“story vein”, which is a human digestible and evolving summarization graphby linking highly related stories together. More precisely, we model the socialstream as a post network, and define stories by a new cohesive subgraph131.3. Contributions and Research Plantype called (k, d)-Core in the post network, in which every node shouldhave at least k neighbors and two end nodes of every edge should have atleast d common neighbors. (k, d)-Core is more compact than k-Core, andthus a better definition for stories than k-Core. Unlike quasi-cliques whosedecision problem is NP-Hard, (k, d)-Core can be detected in polynomial time.We propose deterministic and randomized context search to support theiceberg query, which builds the relatedness between stories as social streamsflow. We call the relatedness graph between stories as a story vein. Weperform detailed experimental study on real Twitter streams and the resultsdemonstrate the creativity and value of our approach.The main contributions of CAST are summarized below:• We define a new cohesive subgraph called (k, d)-Core to represent tran-sient stories and propose two efficient algorithms, Zigzag and Node-First, to identify maximal (k, d)-Cores from the post network;• Given a story, we propose deterministic and randomized context searchto support the iceberg query for highly related stories, which builds thestory vein on the fly;• Our experimental study on real Twitter streams shows that story veincan be digested and effectively help build an expressive context-awarestory-teller on streaming social content.We will discuss the details of CAST in Chapter 5.1.3.3 Incremental Event Evolution TrackingIn this thesis, an event is a post cluster which may contain many relatedstories. Since an event usually has a relatively long (e.g., weeks or evenmonths) life cycle, the tracking of event evolution patterns in social streamscan greatly help understand and summarize the event over the time. Toimplement the event evolution tracking, we propose Incremental ClusterEvolution Tracking (ICET, [44]), which focuses on tracking the evolutionpatterns of clusters in highly dynamic networks. There are several previousworks on data stream clustering using a node-by-node approach for main-taining clusters. However, handling of bulk updates, i.e., a subgraph at atime, is critical for achieving acceptable performance over very large highlydynamic networks. We propose a subgraph-by-subgraph incremental track-ing framework for cluster evolution in this thesis. To effectively illustratethe techniques in our framework, we consider the event evolution trackingtask in social streams as an application, where a social stream and an eventare modeled as a dynamic post network and a dynamic cluster respectively.141.3. Contributions and Research PlanBy monitoring through a fading time window, we introduce a skeletal graphto summarize the information in the dynamic network, and formalize clus-ter evolution patterns using a group of primitive evolution operations andtheir algebra. Two incremental computation algorithms are developed tomaintain clusters and track evolution patterns as time rolls on and the net-work evolves. Our detailed experimental evaluation on large Twitter datasetsdemonstrates that our framework can effectively track the complete set ofcluster evolution patterns from highly dynamic networks on the fly.In summary, the problem we study is captured by the following questions:how to incrementally and efficiently track the evolution behaviors of clustersin large-scale networks, which are noisy and highly dynamic? Our maincontributions are the following:• We propose an incremental computation framework for cluster evolutionon highly dynamic networks;• We filter out noise by introducing a skeletal graph, based on which wedefine a group of primitive evolution operations for nodes and clusters,and introduce their algebra for incremental tracking;• We leverage the incremental computation by proposing two algorithmsbased on bulk updating: ICM for the incremental cluster maintenanceand eTrack for the cluster evolution tracking;• Our application on event evolution tracking in large Twitter streamsdemonstrates that our framework can effectively track all kinds of clusterevolution patterns from highly dynamic networks in real time.The details of ICET will be discussed in Chapter 6.1.3.4 Targeted CrowdsourcingThe approaches proposed in this dissertation are based on two hypotheses:users prefer correlated posts to individual posts in social stream modeling,and structural approach is better than frequency/LDA-based approaches inevent and story modeling. We use crowdsourcing-based user studies to verifythese two hypotheses. To make sure the crowdsourcing-based user studieshave a high quality, we use multiple techniques to control the quality ofworkers, as explained below.• In the beginning, we set the Minimum Time Constraint and ApprovalRate Constraint to filter out the MTurk workers who provide answerswithin a very short time and have a very low historical approval rate.Most of bots and spammers will be removed after this step.• For the remaining workers, we perform the qualification test, which is151.4. Thesis Outlinea series of questions with known answers. These qualification questionswill be treated as the golden standard and worker’s qualification will bemeasured in terms of the ratio of questions answered correctly. If theratio is higher than a predefined threshold, this worker will be treatedas a qualified worker.• All qualified workers will submit their work on the real crowdsourcingtasks. Since there is no known answer for the crowdsourcing tasks, thequality of workers will be measured by cross-comparison with peers inan iterative way, which is captured by Expectation-Maximization withQualification (EMQ). EMQ is capable of measuring user’s quality incrowdsourcing and punishing Smart Spammers from among all quali-fied workers, by assigning low quality scores to them. By understandingthese probabilities as users’ quality scores, EMQ achieves a better per-formance than other competing approaches.The details of targeted crowdsourcing will be explained in Chapter 7.1.4 Thesis OutlineThe rest of this dissertation is structured as follows. Chapter 2 explores therelated work. In Chapter 3, we introduce the preprocessing techniques onsocial streams and the modeling of a social stream as a post network. Chap-ter 4 studies the cohesion-persistent story search problem in social streams.Chapter 5 discusses the context-aware story-telling for streaming social con-tent. In Chapter 6, we propose the incremental clustering evolution tracking,which is an incremental computation framework for event evolution trackingon social streams. Chapter 7 performs targeted crowdsourcing for ground-truth finding and hypothesis verification, with an extensive discussion onuser quality control. In Chapter 8, we summarize this dissertation and listpossible directions for future research in social stream mining.16Chapter 2Related WorkRelated work on mining unstructured social streams can be classified intothree categories: (1) Graph mining, including graph clustering, communitydetection, dense subgraph mining, etc.; (2) Social media analytics, especially,event detection in Twitter streams; (3) Topic detection and tracking, whichis traditionally studied on news articles and recently applied to social mediadata. In this chapter, we introduce the related work in each category.2.1 Graph MiningGraphs are seemingly ubiquitous to model the relationship between objectsin many applications. When modeling an unstructured social stream as anevolving network of posts, graph mining becomes a powerful way to detectand track meaningful patterns in social streams. Related work of this dis-sertation in graph mining falls into one of the following topics: (1) Graphclustering; (2) Community detection and search; (3) Dense subgraph miningand (4) Subgraph relatedness computation. They are discussed below.2.1.1 Graph ClusteringDBSCAN [30] is a density-based clustering method, which groups togetherpoints that are closely packed together. Compared with partitioning-basedapproaches (e.g., K-Means [30]) and hierarchical approaches (e.g., BIRCH[30]), density-based clustering (e.g., DBSCAN [30]) is effective in findingarbitrarily-shaped clusters, and is robust to noise. Application of these clus-tering approaches to a network that is changing with the time is a verychallenging problem. CluStream [3] is a framework that divides the cluster-ing process into an online component which periodically generates detailedsummary statistics for nodes and an offline component which uses only thesummary statistics for clustering. However, CluStream is based on K-Meansonly. DenStream [17] presents a new approach for discovering clusters in anevolving data stream by extending DBSCAN. DStream [18] uses an onlinecomponent which maps each input data record into a grid and an offline172.1. Graph Miningcomponent which generates grid clusters based on the density. Another re-lated work is by Kim et al. [38], which first clusters individual snapshotsinto quasi-cliques and then maps them over time by looking at the densityof bipartite graphs between quasi-cliques in adjacent snapshots. Although[38] can handle birth/growth/decay/death of clusters, it is not incrementaland the split and merge patterns are not supported. In contrast, our eventtracking approach on dynamic post networks is incremental and is able totrack composite behaviors like merging and splitting.2.1.2 Community Detection and SearchCommunities in graphs can be defined from global or local perspectives [25].A community in global sense is a dense subgraph with very few ties to theoutside of this subgraph, which is measured by “modularity”, a function whichevaluates the goodness of graph partitioning. Louvain method [14], basedon modularity optimization, is the state-of-the-art community detection ap-proach which outperforms others. However, Louvain method is not robustto combat noises, such as the meaningless posts like “Good night :)” in so-cial streams. Local definitions of communities focus on the subgraph understudy, but neglecting the rest of the graph. Clique is a very strict definitionof community, where a member is a friend of every other member. However,finding cliques in a graph is an NP-complete problem. It is possible to relaxthe notion of clique, by defining a community as clique-like structures, as wewill discuss further in related work of dense subgraph mining.On the application level, our query-driven quasi-clique search shares somesimilar intuitions with the community search ([19, 71]), which is finding thecommunties containing the querying set of people. However, since the defi-nitions of communities in [71] and [19] are very different from a quasi-clique,the comparison between them and our query-driven quasi-clique search isnot applicable.2.1.3 Dense Subgraph MiningTypical dense subgraphs studied in the literature include densest subgraph[73], clique, k-Plex, quasi-clique and k-Core [47]. Densest subgraph is asubgraph that maximizes the average degree. The densest subgraph can befound in polynomial time by solving a parametric maximum-flow problem[73], while finding the densest subgraph with a fixed size is known to beNP-Hard [7]. The maximum clique problem is a well-known NP-Hard prob-lem. In real applications, the clique definition is too strict making it less182.1. Graph Mininglikely for large cliques to exist in practical graphs. E.g., there may be a largesubgraph where most but not all node pairs are adjacent. This has moti-vated relaxations to cliques, some popular examples of which include k-Plex,quasi-clique, and k-Core. Since the degree constraints of k-Plex and quasi-clique are correlated with the size |Vi|, the NP-Hardness of the maximumclique problem carries over to k-Plexes and quasi-cliques [8]. In contrast, thecomplexity of the k-Core generation is in polynomial time [47]. However, ifthe size of a k-Core is distinctly larger than k, the average edge degree of thisk-Core may be very low, and in this case, k-Core is not compact enough todescribe a story. In this dissertation, we define a new dense subgraph called(k, d)-Core to overcome these challenges.Maximal/Maximum Quasi-clique Detection. By definition, a maximalquasi-clique is a quasi-clique that cannot be a subgraph of any other quasi-cliques in the given graph. The maximum quasi-clique is the quasi-cliquethat has the largest number of nodes, out of all quasi-cliques in the givengraph. The size of a maximal quasi-clique may be much smaller than thesize of the maximum quasi-clique.A negative breakthrough result by Arora et al. [6] together with resultsof Feige et al. [23], and more recently Hastad [31], shows that no polynomialtime algorithm can approximate the maximum clique problem within a factorof n1−( > 0), unless P = NP. Thus, it is very unlikely that general heuristicalgorithms can provide results with guaranteed optimality to the maximumclique problem. Related prior work on the maximal/maximum quasi-cliquedetection is typically based on local search ([1, 15, 33, 50, 61, 62]). Espe-cially, Abello et al. [1] developed efficient semi-external memory algorithmsfor GRASP [64] to extract maximal quasi-cliques. Brunato et al. [15] ex-tended two existing stochastic local search algorithms used for the classi-cal maximum clique problem to the maximal quasi-clique problem. Pat-tillo et al. [62] established several fundamental properties of the maximumquasi-clique problem, but their quasi-clique is defined using edge density:D(Gi) = 2|Ei||Vi|(|Vi|−1) ≥ λ. Based on depth-first search, Liu et al. [50] pro-posed an efficient algorithm called Quick to find maximal quasi-cliques usingseveral pruning techniques, and Uno et al. [74] proposed a reverse searchmethod to enumerate all quasi-cliques. However, none of them studied thequery-driven quasi-clique problem as defined in Chapter 4. To the best ofour knowledge, this thesis is the first study on the query-driven quasi-cliquesearch problem.Notice that there is another definition for the quasi-clique based on edgedensity: D(Gi) = 2|Ei||Vi|(|Vi|−1) ≥ λ, studied in [1, 61, 74]. We do not adopt192.2. Social Media Analyticsthis definition because it has the potential to introduce undesired low degreenodes into quasi-cliques: a quasi-clique has edge density D(Gi) ≥ λ can-not prevent the occurrence of a node in Gi which has very low degree. Incontrast, we use the definition that every node in a quasi-clique should havedegree higher than λ ·(|Vi|−1). It is easy to show that a λ-quasi-clique underour node degree definition is at least a λ-quasi-clique under the edge densitydefinition, but the converse is invalid. Thus, our definition is stronger thanthe edge density-based definition. Besides, Tsourakakis et al. [73] proposeda new dense subgraph called optimal quasi-clique, and defined constrainedoptimal quasi-clique problem to find the optimal quasi-clique containing agiven node. However, their optimal quasi-cliques (Problem 2 in [73]) is de-fined as the subgraph Gi(Vi, Ei) that maximizes |Ei| − λ|Vi|(|Vi|−1)2 , which isa fundamentally different problem from popular definitions of quasi-cliquesbased on edge density or node degree.2.1.4 Subgraph Relatedness ComputationIn this thesis, we model a story as a dense subgraph of posts, and the related-ness between stories can be measured by the relatedness between subgraphs.The relatedness between nodes in a graph is a well-studied problem, withpopular algorithms such as HITS, Katz and Personalized PageRank [11]. Forthe relatedness between dense subgraphs, traditional measures like Jaccardcoefficient, Cosine similarity and Pearson’s Correlation [53] are not effective ifthese dense subgraphs have no overlap on node sets. Recently, a propagationand aggregation process was used to simulate the information flow betweennodes, which was studied in the context of top-k structural similarity searchin [45] and authority ranking in [49].2.2 Social Media AnalyticsIn related work, there are two main research directions on social media anal-ysis. The first direction is the frequency-based approaches, which treat eachsocial post as a statistical unit and use histogram-based analysis to minethe patterns in social media, e.g., bursty events. The second direction is theentity based approach, which extracts entities from each social posts andbuilds a network of entities from social media for event identification. Theyare introduced below separately.202.3. Topic Detection and Tracking2.2.1 Frequency Based Peak DetectionMost previous works detect events by discovering topic bursts from a docu-ment stream. Their major techniques either detect the frequency peaks ofevent-indicating phrases over time in a histogram, or monitor the formationof a cluster from a structure perspective. A feature-pivot clustering is pro-posed by Fung et al. [26] to detect bursty events from text streams. Sarmaet al. [67] design efficient algorithms to discover events from a large graphof dynamic relationships. Weng et al. [77] build signals for individual wordsand apply wavelet analysis on the frequency of words to detect events fromTwitter. A framework for tracking short, distinctive phrases (called “memes”)that travel relatively intact through on-line text was developed by Leskovecet al. [48]. Twitinfo [55] represents an event it discovers from Twitter by atimeline of related tweets. Marcus et al. [54] presents TweeQL, a stream-ing SQL-like interface to the Twitter API, making common tweet processingtasks simpler. Sakaki et al. [66] investigated the real-time interaction ofevents such as earthquakes in Twitter and proposed an algorithm to monitortweets and to detect a target event based on classifiers.2.2.2 Entity Network Based ApproachesRecently, Agarwal et al. [2] discover events that are unraveling in microblogstreams, by modeling events as correlated keyword graphs. Angel et al. [5]study the maintenance of dense subgraphs with size smaller than a thresh-old (e.g., 5) under streaming edge weight updates. Both [2] and [5] modelthe social stream as an evolving entity graph, but suffer from certain draw-backs: (1) many posts are ignored in the entity recognition phase and postattributes like time; (2) author of the post cannot be integrated; (3) they can-not handle subgraph-by-subgraph bulk updates, which are key to efficiency.In contrast, these drawbacks are addressed in the post network defined inthis dissertation.2.3 Topic Detection and TrackingTopic detection and tracking is an extensively studied field [51], with themost common approaches based on Latent Dirichlet Allocation (LDA) [13].Techniques for topic detection and tracking cannot be applied to story re-latedness tracking, because they are usually formulated as a classificationproblem [4], with an assumption that topics are predefined before tracking,which is unrealistic for social streams. Recent works (e.g., [32]) suffer from212.3. Topic Detection and Trackingthis problem. Besides, the lack of training data set for story relatednesstracking on noisy social streams renders the existing works [59, 68] on StoryLink Detection (SLD) inapplicable, because SLD is trained on well-writtennews articles. Jin et al. [37] present Topic Initiator Detection (TID) toautomatically find which web document initiated the topic on the Web. Intext streams, Hierarchical Dirichlet Processes (HDP, [27]) is proposed totrack and connect topics incrementally. Since HDP is computed based onthe document-word matrix, it is difficult to integrate HDP-based approacheswith other signals, e.g., time stamps, GPS, authors, etc.There is less work on evolution tracking. A framework for tracking short,distinctive phrases (called “memes”) that travel relatively intact through on-line text was developed in [48]. The evolution of communities in dynamicsocial networks is tracked in [79]. However, these existing works cannot trackcomposite evolution patterns of communities, e.g., merging or splitting ofcommunities. Moreover, all existing works have to re-compute communitiesfrom each network snapshot, which is time-consuming and results in lots ofredundant computation. Unlike them, we focus on the incremental trackingof cluster evolution patterns in highly dynamic networks, where we maintaineach cluster by gradually adding or removing nodes from it.22Chapter 3Modeling Unstructured SocialStreamsThis chapter focuses on the modeling of social streams, which is the firststep in social stream mining. We introduce the social stream preprocessingtechniques in Section 3.2. The construction of the post network from a socialstream is discussed in Section 3.3. We define stories and events, from boththe social stream perspective and the post network perspective, in Section3.4. We provide a comparison between the structural modeling used in thisdissertation and other modeling methods in Section 3.6.3.1 MotivationSocial websites like Twitter have a great impact on many people’s digitallives. As the social content streams fast, it may easily lead to “informa-tion anxiety”, which is the gap between the information we receive and theinformation we are able to digest [40]. The current generation of information-seeking on social media works just like the traditional search on web pages,in which users input several keywords and the output will be a long list oftweets or posts with keywords contained, ranked by time freshness. For in-stance, Twitter Search3 returns a huge list of posts to a given keyword query,with little aggregation or semantics, and leaves it to the users to sift throughthe large collection of results to figure out the very small portion of usefulinformation. Since a post like a tweet only contains a small piece of infor-mation, users are required to manually aggregate and digest search results,which is time-consuming and painful. The noisy and redundant nature ofsocial streams degrades user’s experience further.On the other hand, since a post like tweet only conveys a very smallpiece of information, it would be ideal if we can group the posts talkingabout the same information together. In this dissertation, we define “story”and “event” as two kinds of post structures that organize the posts telling3https://twitter.com/search-home233.2. Social Stream Preprocessingsimilar information together. We are aiming to build the next generationsocial stream mining technologies, which provide users an organized andsummarized view of what’s happening in the social world. Instead of showingusers a long list of posts, our new social stream mining technologies try topresent users with well-organized stories and events.3.2 Social Stream PreprocessingIn a social media like Twitter, new posts emerge quickly in every second andold posts will be naturally outdated as the time rolls. Posts in social streamssuch as tweets are usually written in an informal way. To design a processingstrategy that can quickly and robustly extract the meaningful information ofa post, we focus on the entity words contained in a post, since entities depictthe topic. For example, given a tweet “iPad 3 battery pointing to thinner,lighter tablet?”, the entities are “iPad”, “battery” and “tablet”. However,traditional Named Entity Recognition tools [21] only support a narrow rangeof entities like Locations, Persons and Organizations. [5] reported that onlyabout 5% of tweets has more than one named entity. NLP parser basedapproaches [39] are not appropriate due to the informal writing style ofposts and the need for high processing speed. To broaden the applicability,we treat each noun in the post text as a candidate entity. Technically, weobtain nouns from a post text using a Part-Of-Speech Tagger4, and if anoun is plural (POS tag “NNS” or “NNPS”), we obtain its singular form.In practice, we find this preprocessing technique to be robust and efficient.In the Twitter dataset we used in experiments (see Section 6.7), each tweetcontains 4.9 entities on average. We describe a post p as a triple (L, τ, u),where pL is the list of entities, pτ is the time stamp, and pu is the author.We formally define a post and a post stream as follows.Definition 1 (Post). A post p is a triple (L, τ, u), where L is the list ofentities in the post, τ is the time stamp when this post is generated, and u isthe user who created it.We let pL denote L in the post p for simplicity, and analogously for pτand pu. We use |pL| to denote the number of entities in p.Definition 2 (Social Stream) A social stream is a first-in-first-out queueof posts ordered by time of arrival, in which each post p is represented as atriple (L, τ, u) as defined in Definition 1.4POS Tagger, http://nlp.stanford.edu/software/tagger.shtml243.3. Post Network Construction3.3 Post Network ConstructionOur modeling method for social streams in this dissertation is based on con-structing a network of posts and maintaining the network over a moving timewindow, as posts stream in and fade out. This network is used for subse-quent analysis. Essentially, our modeling method is based on the hypothesisthat the correlation between posts should be considered, and grouped postsprovide a better user experience than individual posts. In this section, wedescribe the construction of post network based on post correlations.3.3.1 Post Similarity ComputationOur initial data analysis shows that post topics do not have a strong corre-lation with authors in Twitter. In other words, a typical Twitter user maycreate posts with different topics at different time moments. Fortunately, wefound that post topics are highly correlated with entities and time moments,i.e., posts talking about the same topic typically have similar entities andvery close time stamps.Traditional similarity measures such as TF-IDF based cosine similarity,Jaccard Coefficient and Pearson Correlation [53] only consider the post con-tent. However, clearly time stamps should play an important role in deter-mining post similarity, since posts created closer together in time are morelikely to discuss the same event. We introduce the notion of fading similarityto capture both content similarity and time proximity. For example, withJaccard coefficient as the underlying content similarity measure, the fadingsimilarity is defied asSF (pi, pj) =|pLi ∩ pLj ||pLi ∪ pLj | · e|pτi−pτj |(3.1)We use an exponential function to incorporate the decaying effect of timelapse between the posts. The unit of time difference is ∆t, typically in hours.It is trivial to see that 0 ≤ SF (pi, pj) ≤ 1 and that SF (pi, pj) is symmetric.Post similarity measures the similarity between two posts p1 and p2 by ascore between 0 and 1, which is assessed by considering both the post contentand time. That is to say, if two posts share many common entities and theirposting time is very close, they will be similar.3.3.2 Post NetworkTo find the correlation between posts, we build a post network G(V,E) basedon the following rule: if the fading similarity between two posts (pi, pj) is253.3. Post Network Constructionhigher than a given threshold λ, we create an edge e(pi, pj) between themand set the edge similarity s(pi, pj) = SF (pi, pj). Obviously, a lower λretains more semantic similarities but results in much higher computationcost, and we set λ = 0.3 empirically on Twitter streams to gain a balancebetween edge sparsity and information richness. Consider a time windowof observation and consider the post network at the beginning. While wemove forward in time and new posts appear and old posts fade out, G(V,E)is dynamically updated at each moment, with new nodes/edges added andold nodes/edges removed. On the scale of Twitter streams with millions oftweets per hour, G(V,E) is truly a large and fast dynamic network. Theformal definition for the post network is given below.Definition 3 (Post Network) Given two posts pi, pj in a social stream Qand a threshold  (0 <  < 1), there is an edge between pi and pj if the postsimilarity s(pi, pj) ≥ . The post network corresponding to Q is denoted asG(V,E), where each node p ∈ V is a post in Q, and each edge (pi, pj) ∈ Eis constructed if the similarity s(pi, pj) ≥ .Intuitively, an edge in the post network connects two posts if they aresimilar enough. The post network can be viewed as a structural represen-tation of the original unstructured social stream, by organizing meaningfulinformation from noisy buzzes. Specifically, posts with very few edges canbe essentially treated as noise and ignored.3.3.3 Linkage SearchRemoving a node and associated edges from G(V,E) is an easy operation.In contrast, when a new post pi appear, it is impractical to compare pi witheach node pj in Vt to verify the satisfaction of SF (pi, pj) > λ, since thenode size |Vt| can easily go up to millions. To solve this problem, first weconstruct a post-entity bipartite graph, and then perform a two-step randomwalk process to get the hitting counts. The main idea of linkage search is tolet a random surfer start from post node pi and walk to any entity node in pLion the first step, and continue to walk back to posts except pi on the secondstep. All the posts visited on the second step form the candidates of pi’sneighbors. Supposing the average number of entities in each post is d1 andthe average number of posts mentioning each entity is d2, then linkage searchcan find the neighbor set of a given post in time O(d1d2). In our Twitterdataset, d1 and d2 are usually below 10, which supports the construction ofa post network on the fly.263.4. Stories and Events3.4 Stories and EventsStory and event are two concepts we use in this disseration to aggregate andorganize posts with similar meaningful information together. In related work,there are many different definitions for story and event, and some studieseven use these two terms interchangeably, without proper distinguishing. Inthis section, we distinguish and define stories and events dually on two levels:social streams level and post network level, as explained below.3.4.1 Stories and Events in Social StreamsIn social streams, both a story and an event are a set of posts talking aboutvery similar information. Their main difference is on the granularity of in-formation they carry: a story is assumed to only talk about a single thing,while an event is assumed to talk about many things with high relatedness.Thus, we can say that an event contains many highly related stories. Forexample, we consider “Ukrainian crisis” is an event happening from Novem-ber 2013 to May 2014, which contains lots of stories like “the annexation ofCrimea by the Russian Federation” and “War in Donbass” in March 2014. Inthe following, we give the definitions of stories and events in social streams.Definition 4 (Story in Social Stream) A story is a set of posts that talkabout a single topic in a short time span.Definition 5 (Event in Social Stream) An event is a set of posts thattalk about a series of highly related topics.Usually, the information carried by a story can be described by a fewsentences or a small set of keywords. Since posts in social streams are usuallyvery short, we consider a post describes at most one story in most cases.However, a story can be described by many posts, e.g., two posts “Crimeawas annexed by the Russian Federation” and “Crimea voted on 6 Marchto formally accede as part of the Russian Federation” are talking aboutthe same story. An event is a post cluster which contains many highlyrelated stories, where the relatedness between stories is measured on boththe content similarity and the time closeness. Typically, an event has arelatively long time span and follows clear evolution patterns, e.g., emerging,growing and decaying. A typical event contains lots of details and cannotbe described by a short post. For example, “Ukrainian crisis” is an event,which consists of a series of related stories, e.g., “Crimean Crisis”, “War inDonbass”, “Ukraine President Elections”, etc.273.4. Stories and EventsDefinition 4 and 5 are given on the conceptual level. There are severalopen questions for these two definitions, some of which are listed below:• Given a set of posts, how to determine whether these posts talking abouta story, an event, or none of them?• How to quantify the relatedness of two stories?• How to measure the evolution patterns of an event?Since social streams are unstructured and difficult to analyze, the answersof these questions depend on the modeling of social streams. In previoussections, we discussed the modeling of a social stream as a post network. Inthe following, we explain stories and events with reference to a post network.3.4.2 Stories and Events in Post NetworkIn Section 3.3, we discussed the construction of a post network from a socialstream: a post in social stream is modeled as a node and two posts withhigh similarity are connected by an edge. As a result, a story or an event insocial streams corresponds to a subgraph in the post network. This subgraphcorresponding to a story or an event should be connected, since we assumeposts in a story or an event carry very similar information.Cohesion. Let’s start from the definition of a story in the post network.Given a connected subgraph Gi(Vi, Ei) of the post network, how can wedetermine whether this subgraph corresponds to a story or not? The answeris to make use of a measure called “cohesion”, as defined below.Definition 6 (Cohesion) Given a connected subgraph Gi(Vi, Ei), its co-hesion C(Gi) is defined as the ratio between the minimum degree and themaximum possible degree in Gi. Supposing deg(v,Gi) is the degree of nodev in Gi, we haveC(Gi) =minv∈Videg(v,Gi)|Vi| − 1 (3.2)Why the cohesion is so important for defining a story? The reason issimilar posts are connected together in the post network. If two posts aresimilar with each other in both the text content and the creation time, itis likely that these two posts are telling the same story, and they will beconnected by an edge in the post network. A story in a social stream is asubgraph in the post network. If nodes in a subgraph are highly connectedwith each other, it is very likely that the posts corresponding to this subgraphare telling the same story. To guarantee that posts in the same story are283.4. Stories and Eventssimilar to each other, we consider coreness, edge density and cohesion of thispost subgraph. Among these three, cohesion defined in Eq. (3.2) is mosteffective in ensuring that every post in the story is similar to the majorityof posts in the same story. Compared with coreness K(Gi) in k-Core whereK(Gi) ≥ k and edge density D(Gi) = 2|Ei||Vi|(|Vi|−1) , cohesion can make sure anode connects to the majority of other nodes, regardless of the story size,while K(Gi) and D(Gi) cannot guaranttee that.Cohesion-Persistent Story. The first instantiation is defining a story as aconnected subgraph Gi(Vi, Ei) with enough high cohesion. Clearly, a story isa special kind of dense subgraphs. In related work, there are many definitionsfor dense subgraphs, e.g., quasi-clique, k-Plex, k-Core [47] and k-Truss [34].Supposing deg(v,Gi) is the degree of node v in Gi, these dense subgraphsare defined as:• Quasi-clique: |deg(v,Gi)| ≥ λ(|Vi|−1) for any node v ∈ Vi and λ ∈ (0, 1];• k-Plex: |deg(v,Gi)| ≥ |Vi| − k for every node v ∈ Vi;• k-Core: |deg(v,Gi)| ≥ k for every node v ∈ Vi;• k-Truss: for every edge (vi, vj) ∈ Ei, |N(vi) ∩N(vj)| ≥ k.Out of them, only for λ-quasi-cliques, the cohesion is always at least λas the subgraph size changes. The intuition of a cohesion-persistent story isbest captured by a λ-quasi-clique, as the degree of nodes grows with the nodesize in a quasi-clique. Given several posts already read by the user as theinput, the finding of the cohesion-persistent story containing these posts iscalled the cohesion-persistent story search problem, as discussed in Chapter4. Formally, we introduce the cohesion-persistent story, as defined below.Definition 7 (Cohesion-Persistent Story) A cohesion-persistent storyGi(Vi, Ei) in a post network G(V,E) is defined as a λ-quasi-clique, where0 < λ ≤ 1 and for any node v ∈ Vi, we have |deg(v,Gi)| ≥ λ(|Vi| − 1).(k, d)-Core Story. Definition 7 provides a theoretically ideal way to define astory. However, notice that finding the maximum clique or quasi-cliques froma graph is an NP-Hard problem [8, 15]. Even worse, [31] proved that thereare no polynomial algorithms that provide any reasonable approximationto the maximum clique problem. Since a clique is a special case of quasi-clique or k-Plex, these hardness results carry over to maximum quasi-cliqueand maximum k-Plex problems. There are a lot of heuristic algorithms thatprovide no theoretical guarantee on the quality [1, 8, 15]. Since most of themare based on local search [15], they do not scale to large networks, becauselocal search involves the optimization of solutions by iteratively moving to abetter neighbor solution in an exponential search space.293.4. Stories and EventsAlthough quasi-cliques are theoretically ideal to instantiate a story, theintractable performance of quasi-clique computation brings difficulty to ap-ply this definition on the real-world large post network. In Chapter 4, we willdiscuss the cohesion-persistent story search problem, which aims to approachthe maximum quasi-clique containing given nodes using heuristic rules. How-ever, due to the performance, this approach cannot solve the problem offinding all stories in the post network. These challenges motivate us to seekan alternative instantiation of a story, which is cohesive enough and efficientto compute on real-world post networks.The good news is that k-Cores can be exactly found in polynomial time.By adjusting k, we can generate k-Cores with any desired cohesion scorek/(|VS | − 1). For example, increasing k will improve the cohesion becausethe minimal edge degree is increased and |VS | is decreased in the same time.However, k-Cores are not always capable of capturing our intuition on sto-ries: although each post has at least k neighbors, the fact that two posts areconnected by a single capillary may be not a strong enough evidence to provethat they tell the same story. Sometimes, posts only share some commonwords but discuss different stories, e.g., “Google acquires Motorola Mobility”and “Bell Mobility acquires Virgin Mobile”. To address this challenge, wemake a key observation that the existence of more common neighbors be-tween two connected posts suggests a stronger commonality in story-telling.Supposing pi and pj are connected by an edge, N(pi) and N(pj) are neighborsets of pi and pj respectively, if there exists post pl ∈ N(pi)∩N(pj), then wecall pl a witness for the post similarity s(pi, pj). We capture this intuitionin the following definition, where we formalize a story as a (k, d)-Core.Definition 8 ((k, d)-Core Story) A (k, d)-Core story in the given post net-work G(V,E) is defined by a maximal (k, d)-Core Gi(Vi, Ei), where k, d arenumbers with k > d > 0 and• Gi(Vi, Ei) is a connected subgraph;• For every post p ∈ Vi, |N(p)| ≥ k;• For every edge (pi, pj) ∈ Ei, |N(pi) ∩N(pj)| ≥ d.Density-Based Event. An event tries to organize a set of posts talkingabout many related things together. Especially, an event may include mul-tiple related stories. In the post network, an event is also modeled as aconnected subgraph. Unfortunately, the various cohesive subgraph optionswe discussed in story modeling are not suitable for defining an event, sincethese cohesive subgraphs have a clear center and are designed to talk about303.5. Context and Evolutionmainly a single thing. An ideal definition of events should allow multiplecenters, and these centers should be highly related.In this dissertation, we consider a graph cluster (or called partition) isthe best way to define an event. The intuition is that, since post networkis constructed by pairwise post similarity, a graph cluster with high internalconnectivity and low external connectivity indicates that posts inside thecluster talk about very related things. Various clustering approaches can beapplied on a post network to extract clusters, and among them, we choosedensity-based clustering [30] as the best modeling for events. In density-based clustering (e.g., DBSCAN [20]), the threshold MinPts is used as theminimum number of nodes in an ε-neighborhood, required to form a cluster.We adapt this and use a weight threshold δ as the minimum total weight ofneighboring nodes, required to form a cluster. The reason we choose density-based approaches is that, compared with partitioning-based approaches (e.g.,K-Means [30]) and hierarchical approaches (e.g., BIRCH [30]), density-basedmethods such as DBSCAN define clusters as areas of higher density thanthe remainder of the data set, which is effective in finding arbitrarily-shapedclusters and is robust to noise. Moreover, density-based approaches are easyto adapt to support single-pass clustering. In the post network, we considerε to be a similarity threshold to decide the connectivity, which can be used todefine the weight of a post in density-based clustering. According to density-based clustering, nodes in the post network are distinguished into three types:core posts, border posts and noisy posts, by a threshold δ applied on the postweight. Formally, we define an event on post network below.Definition 9 (Density-Based Event) An event in the post network G(V,E)is a cluster C obtained by density-based clustering using density parametersε and δ, where ε is a similarity threshold to remove the edge if its similarityis lower than ε, and δ is the minimum weight of a core node.The details of density-based clustering for event identification can befound in Section 6.4.1.3.5 Context and EvolutionAs discussed in Chapter 1, detecting stories with high cohesion, trackingstory context and event evolution patterns are the most important problemswe focus on in this dissertation. The notion of cohesion has been defined inDefinition 6. In this section, we introduce context for stories and evolutionfor events respectively.313.5. Context and EvolutionStory Context. Transient stories that we identified from the post networkmay be highly related, e.g., two stories “the launch of Blackberry 10” and“BlackBerry Super Bowl ad” are highly related. We try to exploit and inte-grate signals from different perspectives (e.g., content or time) to computethe relatedness between stories. Here, we introduce different types of related-ness dimensions, which capture the story relatedness from three perspectives,and quantify the relatedness by a value in [0, 1].• Content Similarity. By viewing a story as a document and a postentity as a term, existing document similarity measures can be used toassess the story relatedness. However, TF-IDF based Cosine similarityfail to be effective, since TF vectors of stories tend to be very sparse.• Temporal Proximity. Stories that happen closer in time are morelikely to correlate together.• Edge Connectivity. Edges associated with posts of two stories can beused to determine the story relatedness. This approach calculates thestrength of edge connectivity between posts of two stories. These edgesserve as the bridge between two stories, and the connectivity strengthcan be measured by various ways, e.g., the Jaccard Coefficient.Story context search aims at finding the neighboring stories of a givenstory in the post network efficiently. In database research literature, we useiceberg query to describe a kind of queries which aims to find results withscores above a given threshold. We introduce story context search, which isa kind of iceberg query on stories, as stated below.Definition 10 (Story Context Search) Given a set of stories S, a thresh-old γ (0 < γ < 1) and a story S, the story context search for S is tofind the subset of stories S′ ⊆ S, where for each S′ ∈ S′, the relatednessCor(S, S′) ≥ γ.The computation of Cor(S, S′) will be discussed in Chapter 5.Event Evolution. Event evolution happens when the time window moveson social streams. We use E to denote an event and e is a snapshot of Eat a specific moment. For simplicity, if we talk about event e, it actuallymeans event E at moment t. Let St denote the set of events at moment t. Weanalyze the evolutionary process of events at each moment and abstract theminto four primitive patterns and two composite patterns. The four primitivepatterns are emerge, disappear, grow and decay. The two composite patternsare merge and split, which can be decomposed into a series of emerge anddisappear patterns. From moment t to t+ 1, they are defined below.323.6. Comparing with Other Modeling MethodsCases Add RemoveIf p is a noise post - -If p is a border post to theneighboring event e grow decayIf p is a core post withoutneighboring event emerge disappearIf p is a core post to exactly oneneighboring event e grow decayIf p is a core post to multipleneighboring events {e1, e2, · · · , en} merge splitTable 3.1: Event evolution patterns• emerge: add event e to the event set St;• disappear: remove event e from the event set St;• grow: increase the size of e by adding new posts;• decay: decrease the size of e by removing old posts;• merge: remove a list of events {e1, e2, · · · , en} from St and add a newevent e, where e = e1 + e2 + · · ·+ en;• split: remove an old event e from St and add a list of events {e1, e2, · · · , en},where e = e1 + e2 + · · ·+ en.Compared with primitive patterns, merge and split are not very commonin event evolution. To a specific event, emerge/disappear or merge/split canonly happen once, but grow/decay may happen at each moment.In the following, we explain how to track event evolution incrementallyas the post network gets updated. Conceptually, we call a post a core postif this post is similar to lots of other posts. Suppose a post p is addedinto the post network. If p is a noise post, we simply ignore p. If p is aborder post to the neighboring event e, grow e. If p is a core post withoutneighboring event, a new event emerges. If p is a core post that is a neighborof exactly one event e, grow e. If p is a core post that is a neighbor ofmultiple events {e1, e2, · · · , en}, merge them into a new event. The analysisof event evolution patterns for removing a post p from post network is verysimilar. We show evolution patterns of various cases in Table 3.1.3.6 Comparing with Other Modeling MethodsIn this thesis, we view each post as a node, and the relationship betweenposts as an edge. By this way, a post stream can be transformed into a333.6. Comparing with Other Modeling Methodsnetwork evolving with the time, which is called a post network. Eventsand stories can be viewed as substructures in this post network. However,in related work, there are other alternative perspectives for social mediasearch, in which content and frequency based approaches are major players[13, 48]. In this section, we briefly compare them, and show the advantagesof the structure based approach. Notice that the experimental study ofthe structural perspective versus content and frequency perspectives will beextensively discussed in Chapter 5 and 6.• Structure vs. LDA Approaches. In content perspective, topic de-tection and tracking is an extensively studied field [51], with the mostcommon approaches based on Latent Dirichlet Allocation (LDA) [13].Techniques on topic tracking are usually formulated as a classificationproblem [4], with an assumption that topics are predefined before track-ing, which is unrealistic for social streams. Recent works [27, 32] fallprey to this problem. The lack of training data set for story relatednesstracking on noisy social streams renders the existing works [59, 68] onStory Link Detection (SLD) inapplicable, because SLD is trained onwell-written news articles. In text streams, Hierarchical Dirichlet Pro-cesses (HDP, [27]) was proposed to track and connect topics incremen-tally. However, since HDP is computed based on the document-wordmatrix, it is difficult to integrate HDP-based approaches with other sig-nals, e.g., time stamps, GPS signals, authors, etc.• Structure vs. Frequency. In related work, frequency based ap-proaches are commonly used in story and event detection. Weng et al.[77] build signals for individual words by applying wavelet analysis onthe frequency based signals of words to detect events from Twitter. Aframework for tracking short, distinctive phrases (called “memes”) thattravel relatively intact through on-line text was developed in [48]. Twit-info [55] represents an event it discovers from Twitter by a timeline ofrelated tweets. [66] investigated the real-time interaction of events suchas earthquakes in Twitter and proposed an algorithm to monitor tweetsand to detect a target event based on classifiers. To summarize, thecommon technique behind the above is detecting popular items basedon the ranking of frequency, and these items are typically terms, hash-tags or short phrases. Compared with frequency based approaches, ourstructural approach represents a story as a cohesive subgraph in postnetwork, which has a compact internal structure to describe the story,and contains rich information. For example, frequency based approachescannot track the merge or split of two events, while structural approach343.7. Discussion and Conclusionis capable to track, by capturing the merge or split of two events as themerge or split of the underlying subgraphs in post network.We also perform a detailed user study to verify the hypothesis that struc-tural method is better than content and frequency based methods usingcrowdsourcing in Chapter 7.3.7 Discussion and ConclusionOur modeling method for social streams in this dissertation is based onconstructing a network of posts and maintaining the network over a movingtime window. To track stories and events from social streams, our modelingmethod contains following steps:• Post information extraction, in which we extract post text content andtime stamps from social streams;• Entity extraction, where entities in post content are extracted usingNLP tools;• Post similarity computation based on entities and time stamps;• Graph based algorithms to mine social patterns like stories and events.The limitations of this modeling work flow is that we ignore some infor-mation on each step. For example, in post information extraction, we ignorethe author of a post, and in entity extraction, we use entities to representpost content. However, based on data analysis, we ensure the informationwe ignored is less meaningful than the information we kept. For example, weignore authors because data analysis shows post topics of an author on Twit-ter are roughly random. We use entities to represent post content, ratherthan keywords or hashtags, because we found that keywords will generatepost networks with too many edges (i.e., too high edge density) and hashtagswill generate post network which is too sparse.One drawback of our post similarity computation is that we ignore theword ambiguity problem. For example, “apple” may mean the Apple com-pany or a kind of fruit; “Microsoft” and “MSFT” may mean the same thing.Distinguishing word ambiguity in informally written tweets is a very hardproblem in the NLP community and we consider the deep analysis on thisproblem beyond the research scope of this dissertation.There are still some space for the improvement. First, currently theentities returned by Stanford NLP tools are mostly nouns, which indicatesbetter entity recognition approaches may be proposed to generate entitieswith higher quality. E.g., emotions and sentiments can be extracted as at-353.7. Discussion and Conclusiontributes of entities and provide more input to the similarity function. Sec-ond, deep analysis of sentence structure of tweets may result in a betterpost content similarity computation method. Third, since the current postsimilarity computation only combines content similarity and time proximity,better similarity functions may be proposed to incorporate more meaningfulinformation into consideration. Fourth, it is possible to combine structuralapproaches with frequency or LDA-based approaches, which may result in abetter hybrid approach than pure structural approaches.36Chapter 4Cohesion-Persistent StorySearchA cohesion-persistent subgraph is a subgraph with its cohesion higher thana given threshold λ (0 < λ ≤ 1). Recall that cohesion is defined as the ratiobetween the minimum degree and the maximum possible degree in a givensubgraph (Definition 6). By this definition, a cohesion-persistent subgraphwith threshold λ is a λ-quasi-clique. The quasi-clique is an appealing wayto model a story in the post network, due to the benefit that the cohesionof a λ-quasi-clique is always higher than λ (0 < λ ≤ 1), regardless of thesize of the story. We call the story defined by a quasi-clique as a cohesion-persistent story. In this chapter, we are interested in the search problem ofcohesion-persistent stories, where given a query set of posts, we try to find themaximum quasi-clique that contains this query set of posts. In applications,the query can be posts liked or found interesting/userful by the user, andthe answer will be the most popular story related to the querying posts.4.1 IntroductionJust like any real-world networks, it is common for the post network tohave a skewed degree distribution, with the result that they feature “densesubgraphs”. Since we model a social stream as a post network, finding densesubgraphs of the post network corresponds to the story detection problem onsocial streams. While several alternative definitions have been proposed fordense subgraphs (e.g., see [25, 47]), quasi-cliques constitute an appealing wayto model a story, since its cohesion is guaranteed to above a given threshold.Specifically, a λ-quasi-clique is defined as a connected subgraph in whichthe ratio between the degree of each node and the highest possible degreeis at least λ, where λ ∈ (0, 1]. There has been some prior work on findingquasi-cliques from graphs [15, 62, 63]. However, none of them can handlethe quasi-clique search problem, which is not studied before, and has manyapplications in the real-world, especially, the cohesion-persistent story search374.1. Introductionon social streams. In particular, we focus on a new graph mining problemin this chapter, namely query-driven maximum quasi-clique search. Given agraph G(V,E), a set of query nodes S ⊆ V , and a parameter λ ∈ (0, 1], weare interested in the problem of finding the largest λ-quasi-clique containingthe node set S.The conceptual definition of a story is a set of posts telling the samething. On the post network, this requires that every node has a sufficientlyhigh connection strength with other nodes in the same subgraph. To helpdetermine a post set tells the same thing or not, we use the notion “cohesion”to describe the ratio between the minimum degree and the highest possibledegree in a connected subgraph. Clearly, the ideal definition of a storyrequires the cohesion to be persistent, or robust, regardless of the size ofthe subgraph. There are many definitions for dense subgraphs, e.g., densestsubgraph [73], k-Plex, quasi-clique, k-Core [47] and k-Truss [34]. Of these,only for λ-quasi-cliques, the cohesion is always at least λ as the subgraph sizechanges. The finding of the largest cohesion-persistent subgraph containing agiven node set can be modeled as the Query-driven Maximum Quasi-clique(QMQ) search problem. Since the maximum clique problem is NP-Hard[61] and no polynomial time algorithm can approximate it within a factorof n1− ( > 0) [31], it is not surprising that finding the maximum quasi-clique is also NP-Hard and not approximable in polynomial time. As forheuristic approaches, existing studies on quasi-clique maximization (withoutany query) are mainly based on local search [15, 33, 62], in which a solutionmoves to the best neighboring solution iteratively, updated node-by-node.However, none of the existing approaches are designed for quasi-clique searchand can efficiently handle the QMQ problem. A detailed comparison withthese and other related works appears in Section 2.1.3.Challenges. To the best of our knowledge, this chapter is the first studyon the quasi-clique search problem. Solving this difficult problem raises thefollowing challenges:• Given a set of query nodes S, how to find a λ-quasi-clique containing Sas efficiently as possible?• How can we design an effective and uniform objective function for solu-tions that guides the maximization of λ-quasi-clique containing S?• Given a solution, how to devise efficient iterative maximization tech-niques to search for a better solution?• As the iterative maximization is a local search process, how can weprevent the algorithm being trapped into a local maximum?384.1. IntroductionMain Idea. In this chapter, we propose an efficient framework for theQMQ search problem, which has two components: an offline component thatbuilds a tree representation T of the given graph G, and an online componentwhich responds to the maximum quasi-clique search for a given set of querynodes S. In the offline component, we propose core tree, a tactfully designedhierarchical structure to help find a pre-solution to the QMQ search problemin a few tree traversal operations, where the pre-solution is a maximal k-Core for some k, which contains the query nodes S and whose cohesion isvery close to λ. The idea is that a λ-quasi-clique can be obtained fromthe pre-solution very quickly. The notion of k-Core, as defined in [47], isa connected subgraph in which every node has degree at least k. In ourproposed core tree, G is the root and each tree node is a maximal k-Core,for some k; if a tree node Ti is a subgraph of another tree node Tj , we sayTi is a child of Tj . The core tree can be built recursively and efficiently, andthe pre-solution can be quickly retrieved from the core tree.In the online component, we introduce three maximization operations:Add, Remove and Swap. By scheduling these operations using differentstrategies, we propose two iterative maximization algorithms for the QMQsearch, called Deterministic Iterative Maximization (DIM) and Stochasti-cally Updated Maximization (SUM). Of these, DIM is a deterministic ap-proach that, in each iteration, greedily moves from the current solution to thebest available neighboring solution. SUM, on the other hand, is a stochasticapproach, which moves from the current solution to one of its good neigh-boring solutions with a probability proportional to the “marginal gain” as-sociated with such a move. Intuitively, SUM has the potential to break thelimitation of getting trapped in a local maximum and thus find better re-sults than DIM, at the expense of potentially longer iteration steps thanDIM. Both DIM and SUM are efficient, specifically DIM takes O(Tn) timewhile SUM takes O(Tn2) time for each simulation, where T is the totalnumber of iterations, and n is the average solution size on each iteration.Contributions. We make the following contributions in this chapter:• We define the problem of query-driven maximum quasi-clique search, anovel cohesive subgraph query with many real-world applications (Sec-tion 4.2).• We propose core tree as a recursive representation of a graph, whichhelps quickly find a pre-solution to the QMQ search problem within afew tree traversals by reducing the solution search space (Section 4.3).• We introduce Add, Remove and Swap to search for new solutions and394.2. Problem OverviewG(V,E) the given networkS the query node setGi(Vi, Ei) a connected subgraph of GQλS(VQ, EQ) a query-driven quasi-cliqueQλS(V Q, EQ) the query-driven maximum quasi-cliqueT,Ti core tree, tree nodedeg(v,Gi) Degree of node v in subgraph GiF(Gi) objective function for subgraph GiTable 4.1: Notation.efficiently optimize a pre-solution to a λ-quasi-clique as well as the cur-rent solution to a better neighboring solution. Building on this, wepropose deterministic and stochastic iterative maximization algorithmsfor QMQ search – DIM and SUM (Section 4.4).• We perform an extensive experimental study on three real datasets,which demonstrates that our algorithms significantly outperform severalbaselines in running time and/or the quality. (Section 4.5).Section 4.6 concludes this chapter. Major notations used are listed inTable 6.1.4.2 Problem OverviewProblem. Given a set of query nodes S, we define a query-driven quasi-clique w.r.t. S as a quasi-clique that contains all nodes in S. The formaldefinition follows, where for a subgraph Gi ⊆ G and a node v in Gi, we usedeg(v,Gi) to denote the degree of v in Gi.Definition 11 (Query-Driven Quasi-Clique). Given an undirected graphG(V,E), a set of query nodes S ⊆ V , and a parameter 0 < λ ≤ 1, a query-driven quasi-clique w.r.t. S and λ is a subgraph QλS(VQ, EQ) of G, suchthat:• QλS is connected and S ⊆ VQ ⊆ V ;• For every node v ∈ VQ, deg(v,QλS) ≥ λ · (|VQ| − 1).For simplicity, we use QλS and Q interchangeably. The problem of query-driven maximum quasi-clique search is to find the largest λ-quasi-clique con-taining the query nodes S, as formalized below.404.2. Problem Overview404&OLTXHZLWKnQRGHV0D[LPDOTXDVLFOLTXHFigure 4.1: The query S is marked by solid nodes, with λ = 0.5. A maximalquasi-clique and the query-driven maximum quasi-clique (QMQ) are circledby a dotted ellipse and a solid ellipse, respectively.Problem 1 Given an undirected graph G(V,E), a set of query nodes S ⊆ Vand a parameter 0 < λ ≤ 1, find the subgraph QλS(V Q, EQ) in G(V,E), suchthatQλS(V Q, EQ) = arg maxQ{|VQ| | Q = QλS(VQ, EQ)} (4.1)We denote byQλS(V Q, EQ) the query-driven maximum quasi-clique (QMQ)for the query S and parameter λ. Notice that it is possible that there is noqualified λ-quasi-clique containing all nodes in S, and in this case we returnNULL.Hardness. It is well-known that the problem of finding the maximum cliqueis NP-Hard [61]. Even worse, a negative breakthrough result by Arora etal. [6] together with results of Feige et al. [23], and more recently Hastad[31], imply that there is unlikely to be any polynomial time approximationalgorithm for this problem within a factor of n1− ( > 0). According toAbello et al. [1], general heuristics which provide answers with guaranteedapproximation to the maximum clique are unlikely to exist. Since a cliqueis a special case of a quasi-clique, these hardness results carry over to themaximum quasi-clique problem [62]. In particular, the query-driven maxi-mum quasi-clique (QMQ) search is also NP-Hard and inapproximable, sincewhen S = ∅, QMQ search reduces to the traditional maximum quasi-cliqueproblem.Maximum vs. Maximal Quasi-Cliques. The QMQ search problemshould not be confused with the extensively studied maximal quasi-cliqueproblem. A maximal λ-quasi-clique cannot be the subset of any other λ-quasi-clique. However, its size may be arbitrarily smaller than that of themaximum quasi-clique. Fig. 4.1 illustrates these ideas with an example of amaximal λ-quasi-clique and the QMQ, both containing the query set S.Solution Overview. The hardness of the QMQ search problem indicates414.3. Pre-solution on Core Treethat polynomial-time algorithms which provide answers with guaranteed op-timality to QMQ unlikely exist. The hardness and inapproximability resultsfor QMQ naturally motivate the quest for good heuristics to solve the prob-lem. In related work, most algorithms for finding maximum as well as maxi-mal quasi-cliques are based on local search [15, 33, 62]. However, traditionallocal search does not scale to large networks, because the search space ofneighboring solutions in the iterative optimization increases exponentiallywith the network size. Thus, using traditional approaches, it is difficult toquickly find an initial solution corresponding to a quasi-clique containing S.To address these challenges, in this chapter, we propose an efficient frame-work for the QMQ search problem on large-scale networks. The architectureof this framework is illustrated in Fig. 4.2. There are two major compo-nents: an offline component which recursively identifies and organizes densesubgraphs in the given network, and an online component which respondsto the QMQ search for query nodes S. The offline component, discussedin Section 4.3, transforms the network G into a dense subgraph (DSG) treeto support efficient retrieval of query-driven quasi-cliques. The online com-ponent quickly locates the tree node with the closest cohesion to λ (calledpre-solution) on the DSG tree for query nodes S and starts the iterativemaximization process to approach the QMQ, as discussed in Section 4.4. Incase no qualified solution is found, the algorithm will return NULL.To simplify the presentation, we assume throughout the chapter thatS 6= ∅, i.e., the query nodes are non-empty. This assumption does not resultin any loss of generality. For the special case S = ∅, we can augment thegraph G = (V,E) to a new graph G′ by adding a new node v into G withedges connecting v to every node in G. Let S′ = {v}. Then QλS is a λ-quasi-clique in G iff QλS′ is a λ-quasi clique in G′, where QλS′ = QλS + {v}.4.3 Pre-solution on Core TreeIn this section, we propose core tree, a convenient representation of a graphwhere each tree node is a certain dense subgraph. As we will see in Sec-tion 4.3.2, core tree facilitates the quick look-up of pre-solution, which is aconnected subgraph containing query set S and has cohesion very close toλ. Meanwhile, we are able to confine the size of the QMQ to a small rangebetween the size of a tree node and its parent, which accelerates the QMQsearch process.424.3. Pre-solution on Core Tree'HQVH6XEJUDSK7UHH,WHUDWLYH0D[LPL]DWLRQ404OGHWHUPLQLVWLFGVERIIOLQH RQOLQHVWRFKDVWLFSFigure 4.2: Architecture of the efficient query-driven maximum quasi-cliquesearch by DSG tree.4.3.1 Core Tree: Definition and PropertiesIn the following, we calibrate a dense subgraph using three measures anddefine the core tree for a given graph.Measures. As remarked earlier, degree distributions of real-world networkstend to be skewed, making it common for such networks to have embeddeddense subgraphs. Several notions of dense subgraphs have been proposedin the literature, including densest subgraphs, k-cores, k-plexes, and quasi-cliques [47, 73]. In principle, we could have defined a “dense subgraph (DSG)tree” representation of a graph using any of the above notions of dense sub-graphs. We provide the rationale for using k-Cores for defining our DSGtree later in this section. To quantify the properties of a dense subgraph, wemake use of the CCD measure, namely Coreness, Cohesion and Density, asdefined below.Definition 12 (CCD Measure). Given a connected subgraph Gi(Vi, Ei),the CCD measure of Gi is defined as• Coreness: K(Gi) = minv∈Videg(v,Gi);• Cohesion: C(Gi) = K(Gi)|Vi|−1 ;• Density: D(Gi) = |Ei|(|Vi|2) = 2|Ei||Vi|(|Vi|−1) .Cohesion is the ratio between the minimum degree and the maximumpossible degree. Clearly, a λ-quasi-clique is a connected subgraph with co-hesion at least λ. Density, also called local clustering coefficient [76], can be434.3. Pre-solution on Core TreeG GGGG(a)GGGG Gk  k  k  k  (b)Figure 4.3: A graph G with k-Cores identified recursively in (a) and itscorresponding core tree in (b).viewed as the ratio between the average degree and the maximum possibledegree. Both cohesion and density fall in the range [0, 1].Definition. Conceptually, a core tree is a representation of the given graphas a recursively inclusive tree structure. Each tree node in a core tree is amaximal k-Core.Definition 13 (Maximal k-Core). A maximal k-Core is a connected sub-graph Gi(Vi, Ei) such that:• K(Gi) ≥ k;• Gi is not a subgraph of any other k-Core.Definition 14 (Core Tree). Given an undirected graph G(V,E), the coretree of G is a tree T whose nodes correspond to maximal k-Cores of G, forsome k, such that:• Root: The root of T corresponds to G and has depth 0;• Parent-Child: Whenever Ti is a tree node of T at depth k correspondingto a maximal k-Core Gi, and Gj is a maximal (k + 1)-Core which isa subgraph of Gi, then Ti has a child Tj corresponding to Gj; Tj hasdepth k + 1.We show an example in Fig. 4.3 to illustrate the core tree. When k = 1,isolated nodes are removed. As k increases, the size of k-Core decreasesalong the path from the root, making the cohesion monotone increasing.Note that G is trivially a maximal 0-core. It follows that every node ofT at depth k is a maximal k-core. It is possible that multiple tree nodesof T may correspond to the same subgraph of G. E.g., suppose G has no444.3. Pre-solution on Core Treeisolated nodes. Then G is a maximal 0-core as well as a maximal 1-core.The apparent redundancy in the core tree can be eliminated by a concisestorage, as introduced below.Concise Storage. The height of a core tree T is the largest k for whichG has a k-core (called degeneracy in [47]). This is usually a small number(e.g., below 100) for real world large graphs [28]. Conceptually, since a treenode is a subgraph of its parent, a subgraph may appear repeatedly in mul-tiple levels and bring redundancy. However, there is no need to materializeany subgraphs. Instead, we can represent a tree node Ti corresponding toGi(Vi, Ei) by the format:Ti = (k, Vi, parent, children, leftover)where k is the depth of Ti, parent and children respectively denote the IDand set of IDs of the parent and children of the tree node Ti, Vi is the set ofnodes in the subgraph of G that Ti corresponds to, and leftover is the setof nodes that are in Vi but not in any child.We can always induce Gi from G by Vi, and for simplicity, we use Ti andGi interchangeably. In addition, as an inverted index, for every node v in G,we also remember the ID of the tree node Ti with highest k that contains v.Properties. Core trees enjoy many desirable properties by virtue of ef-fectively organizing dense subgraphs at different granularity levels. Theseproperties facilitate efficient search for QMQ. We state and establish theseproperties below.Proposition 1 If Ti and Tj are siblings on the core tree T, then Vi∩Vj = ∅.Proof: Since Ti and Tj are siblings, their corresponding subgraphs Gi andGj are both maximal k-Cores. Suppose Vi ∩ Vj 6= ∅ and Gk(Vk, Ek) is theunion of Gi and Gj, i.e., Vk = Vi ∪ Vj, Ek = Ei ∪ Ej. Then Gk satisfiesthe definition of a k-Core, which contradicts the maximality of Gi and Gj,so Vi ∩ Vj = ∅. 2Proposition 2 Cohesion of a tree node is non-decreasing along any root-to-leaf path on the core tree.Proof: Based on Def. 14, K(Gi) is non-decreasing and |Vi| is non-increasing, so cohesion is non-decreasing. 2Proposition 3 Given any two nodes Ti and Tj on the core tree, if Vi∩Vj 6= ∅and |Vi| ≤ |Vj |, then Vi ⊆ Vj.Proof: Since Vi ∩ Vj 6= ∅, Ti and Tj are on the same path from the rootto a leaf. Since |Vi| ≤ |Vj |, Gi should be a subgraph of Gj, so Vi ⊆ Vj. 2454.3. Pre-solution on Core TreeProp. 1 indicates the disjointness property, which is useful to help prunethe search space: once we find the query set S is contained in a tree nodeTi, we can reduce the search space to the branch of Ti and safely ignoreother tree nodes in different root-to-leaf paths. Prop. 2 shows the monotoneincreasing property of cohesion on the path from the root to a leaf in thecore tree. Prop. 3 presents the inclusion property: if two tree nodes sharecommon graph nodes, they must be on the same root-to-leaf path. BothProp. 2 and 3 are essential for quasi-clique maximization.Why Core Trees? Popular dense subgraph definitions include densestsubgraph, k-Plex, clique, quasi-clique, k-Core and k-Truss ([34, 47, 73]).In the following, we briefly discuss each of them, and argue why k-Coreis the best fit for instantiating a dense subgraph (DSG) tree. First, themonotone property of Cohesion shown in Prop. 2 is crucial for the fast searchof the QMQ. Densest subgraph is a subgraph that maximizes the averagedegree, and cohesion of densest subgraphs is not necessarily monotone asthe average degree increases. k-Plex is defined as a subgraph Gi = (Vi, Ei),with K(Gi) ≥ |Vi| − k, and C(Gi) = |Vi|−k|Vi|−1 = 1 − k−1|Vi|−1 . A natural wayto define a k-Plex tree is to start with G as the root with k = |V |, thenumber of nodes in G. When we decrease k, the resulting (k − 1)-Plexesare guaranteed to be subgraphs of the subgraph corresponding to the givenk-Plex. However, in general, depending on the size of the (k−1)-Plexes, thecohesion may be more or less than that of the parent. Specifically, considera k-Plex H with n nodes and a (k− 1)-Plex subgraph H ′ with n′ nodes. Wecan show that C(H ′) ≥ C(H) iff n′ ≤ 1 + k−2k−1(n− 1). In general, there is noguarantee on the value of n′ and so cohesion is not monotone on a root toleaf path of a DSG tree defined using k-Plexes. Clique and quasi-clique arespecial cases of query-driven quasi-cliques with S = ∅, so we cannot assumethey are already available. Moreover, all the aforementioned dense subgraphproblems are NP-Hard, and are hard to approximate. In contrast, maximalk-Cores can be exactly found in polynomial time [57], so the constructioncost of core tree is very cheap. k-Truss [34] is a special kind of (k− 1)-Core,which has the added constraint that every edge should be contained in atleast (k − 2) triangles. The time complexity of k-Truss detection is muchhigher than the k-Core detection – O(|V | · |E|) vs. O(|V |+ |E|) – and it hasno distinct advantages over k-Cores for the QMQ search. We thus concludethat k-Cores are the ideal choice for instantiating a DSG tree.Algorithm. Given a graph G(V,E), the procedure for generating a coretree from G is shown in Algorithm 1. Specifically, given a k-Core Gi, wegenerate all (k + 1)-Cores by recursively removing nodes with degree less464.3. Pre-solution on Core TreeAlgorithm 1: Core Tree GenerationInput: G(V,E)Output: Core tree T1 set G as the root of T;2 Set = {(G, 0)};3 while Set 6= ∅ do4 get (Gi, k) from Set and remove from Set;5 recursively remove nodes from Gi if degree less than k + 1;6 for each connected component Gj in the remaining graph of Gi do7 set Gj as a child of Gi in T;8 add (Gj , k + 1) into Set;than k + 1, and these (k + 1)-Cores will be added as the children of Gi incore tree T. Since the highest k is the degeneracy of G, the time complexityof core tree generation is O(|E| + |V |), similarly to the graph degeneracycomputation [56].4.3.2 Search for Pre-SolutionHenceforth, by a solution, we mean a λ-quasi-clique containing the querynodes S. By a pre-solution, we mean a maximal k-core containing S whosecohesion is closest to that of a λ-quasi-clique. It may fall shy of the requiredcohesion to be a solution, hence the term pre-solution. A graph G may ormay not have a λ-quasi-clique containing S. If it does, it can be obtainedefficiently from the pre-solution. Below, we give the formal definition of apre-solution.Definition 15 (Pre-Solution). Given a core tree T, λ and a query set S,the pre-solution for the QMQ QλS is a tree node Ti with a set of graph nodesVi, such thatPreSolution(λ, S) = arg min{|C(Ti)− λ| | S ⊆ Vi} (4.2)Core trees offer two benefits for accelerating QMQ search: (1) given aquery node set S and parameter λ, the pre-solution for the QMQ search canbe generated efficiently by traversing the core tree; (2) given the pre-solutionor the current solution for the QMQ, we can find a better neighboring solu-tion efficiently, by reducing the search space, exploiting the core tree. We usethe term graph node and tree node to distinguish between nodes of G andnodes of core tree T. Recall that each tree node corresponds to a subgraphof G which is a maximal k-core, for some k.474.3. Pre-solution on Core TreeMain Idea. Given a graph G, λ, and S, let T be the corresponding coretree. Here are the major steps in the search for a pre-solution, which wedenote by TQ:1. For each graph node s ∈ S, find the tree node Ts that contains s andhas the largest depth in T. Let TS denote the lowest common ancestorin T, of the set of the tree nodes {Ts|s ∈ S};2. If TS is not a λ-quasi-clique, the pre-solution is simply TS , i.e., TQ =TS .3. If TS is a λ-quasi-clique, we climb up the core tree T from TS until wefind an ancestor Tl which is a λ-quasi-clique but its parent Tu is not aλ-quasi-clique.4. If |C(Tl) − λ| ≤ |C(Tu) − λ|, set the pre-solution TQ = Tl; otherwise,set TQ = Tu.Notice that TQ is sometimes a λ-quasi-clique (i.e., a solution) and some-times not (a non-solution). Notice also that Ts is not necessarily a leaf node.For example, in Fig. 4.3, supposing s is a node in G1 but not in G2, Ts isG1. From the leaf to the root, let {Ts|s ∈ S} be the set of those tree nodeshaving the first appearance of query nodes. These are the maximal k-coreswith the highest k containing the query nodes. Since the root of core treeT is the whole graph G, given query S, the lowest common ancestor TS fortree node set {Ts|s ∈ S} always exists. If TS is not a λ-quasi-clique, it is stillpossible to find a subgraph of TS which is a query-driven λ-quasi-clique forS. In this case, we let TQ = Tu = TS and jump to the optimization phase tobe discussed in Section 4.4. Otherwise, TS is a λ-quasi-clique, and we climbup the core tree from TS and find an ancestor Tl, which is a λ-quasi-cliquebut whose parent Tu is not. According to Prop. 2, such a tree node pair Tland Tu always exists for given TS and λ. In the extreme case that Tl = Rootand Tu = NULL, we return G as the final solution directly. Otherwise, thepre-solution TQ is chosen from Tl and Tu, depending on which tree nodehas a closer cohesion with λ. Pre-solution TQ is the start of the iterativeoptimization discussed in Section 4.4.Lower/Upper Bounds of QMQ. By leveraging core tree, we can confinethe size of the QMQ QλS to a small range. Prop. 4 reveals the relationshipbetween Tl,Tu and QλS .484.3. Pre-solution on Core TreevvvGvvG(a)GGG Gk  k  k  k  k  k  GGGGvv,v,vv(b)Figure 4.4: (a) A small graph shown and (b) its corresponding core tree,showing deepest tree nodes containing graph nodes v1, · · · , v5; G5 and G6 in(b) are annotated by dotted circles in (a).Proposition 4 Given tree nodes Tl and Tu on core tree corresponding tosubgraphs Gl and Gu, both containing the query S, if C(Gl) ≥ λ ≥ C(Gu),then the QMQ QλS(V Q, EQ) for S and λ satisfies:|Vl| ≤ |V Q| ≤ |Vu| (4.3)Proof: Since Gl contains S and C(Gl) ≥ λ, Gl is a query-driven λ-quasi-clique for S; so QλS(V Q, EQ), being a maximal λ-quasi-clique for given Sand λ, satisfies |Vl| ≤ |V Q|. Notice that C(QλS) ≥ C(Gu), so supposing|V Q| > |Vu|, we get that the minimum degree K(QλS) > K(Gu). This impliesQλS is a K(QλS)-Core, and given that V Q ∩ Vu ⊇ S 6= ∅, QλS should be asubgraph of Gu. This contradicts the assumption that |V Q| > |Vu|, so wehave |V Q| ≤ |Gu|. 2Prop. 4 indicates that the size of the QMQ is bounded by the sizeof Gl and Gu. Although in general, the QMQ QλS shares many commongraph nodes with Gl and Gu, QλS is not necessarily a supergraph of Gl ora subgraph of Gu. E.g., in Fig. 4.4(a), for S = {v2, v4, v5} and λ = 0.4,the 4-Core G5 containing S has cohesion 2/3, the QMQ (denoted by hollownodes) has cohesion 3/7, and the former is not a subgraph of the latter.However, as a maximal k-Core containing S with a very close cohesion to λ,the pre-solution picked from Gl and Gu serves as an ideal candidate for theiterative maximization towards QλS .Example. The search for pre-solution is illustrated in Fig. 4.4. Supposeλ = 0.2 and S = {v1, v2}. Then G5 is the lowest common ancestor TS .494.4. Query-Driven Quasi-Clique MaximizationThen we climb up the core tree to find Gl = G3 and Gu = G2 that satisfyC(Gl) ≥ λ ≥ C(Gu). According to Prop. 4, the QMQ will be a subgraph ofG whose size is “sandwiched” between those of G3 and G2. The pre-solutionTQ is set to G2 or G3, depending on whichever has a cohesion closer to λ. Asanother example, suppose λ = 0.5 and S = {v1, v3}. We then get TS = G2,which is not a λ-quasi-clique. In this case, we set the pre-solution TQ = G2and the iterative maximization process will check the existence of the QMQinside G2.4.4 Query-Driven Quasi-Clique MaximizationIn this section, we discuss our iterative maximization techniques for searchingfor a QMQ based on a given pre-solution.4.4.1 Objective and OperationsObjective. Let Gi(Vi, Ei) be any connected subgraph of G, with S ⊆ Vi.Gi may or may not be a λ-quasi-clique. For convenience, if Gi is a λ-quasi-clique, we call Gi a solution; otherwise, we call Gi a non-solution. Recallthat a pre-solution may be a solution or a non-solution. Our objective, givensuch a graph Gi, is to find a solution and maximize its node size. In orderto facilitate our search, we propose a new objective function below. Recall,D(Gi) denotes the density of Gi (see Def. 12).F(Gi) ={ |Vi|+D(Gi) if Gi is a λ-quasi-cliqueC(Gi) otherwise (4.4)F(Gi) is designed tactfully and has the following benefits:• When Gi is a λ-quasi-clique, since 0 ≤ D(Gi) ≤ 1, the term |Vi| domi-nates the objective function F(Gi) and F(Gi) > 1. Thus, maximizingF(Gi) prefers a solution with higher size. On the other hand, the densitypart D(Gi) encourages preferring the subgraph with a higher density,among those with the same cardinality. Clearly, a subgraph with ahigher density has a higher potential to attract more nodes. In thiscase, F(Gi) is optimized to make the size of Gi as large as possible.• When Gi is not a λ-quasi-clique, the cohesion C(Gi) stimulates the op-timization process by rewarding subgraphs with a higher cohesion. Inthis case, F(Gi) < λ ≤ 1 and F(Gi) is designed to transform Gi from anon-solution to a solution as soon as possible.504.4. Query-Driven Quasi-Clique MaximizationOperation ExplanationAdd:G′i = Gi + {v}add a new node v into a λ-quasi-clique Gi such that F(G′i) > F(Gi)Goal: increase the cardinality |Vi|Remove:G′i = Gi − {v}remove an existing node v (v 6∈ S)from Gi such that F(G′i) > F(Gi)Goal: increase the cohesion C(Gi)Swap:G′i = Gi+{v1}−{v2}add a new node v1 into a λ-quasi-clique Gi and remove another nodev2 (v2 6∈ S) such that F(G′i) >F(Gi)Goal: increase the density D(Gi)Table 4.2: Three Maximization Operations.• As another benefit, the maximization of F(Gi) naturally prevents thecycling of quasi-cliques, i.e., a previous quasi-clique appearing againafter a series of add and remove operations (defined in the next para-graph). To appreciate this point, in related work, specifically, ReactiveLocal Search (RLS, [9]) relies on an extra parameter T to prevent cy-cling: every time a node is added or removed from their current solution,it cannot be considered for removal or addition for the next T iterations,where T needs to be tuned for different graphs, a tedious process.Operations. The QMQ search can be viewed as a process of maximizing theobjective function F(Gi) where Gi is initialized to TQ, where the pre-solutionTQ may be a λ-quasi-clique or not. We introduce three operations: Add,Remove and Swap defined in Table 4.2 and explained below.• Add: G′i = Gi + {v}. Add operation is applied when both Gi and G′iare λ-quasi-cliques.• Remove: G′i = Gi−{v}. If Gi is not a λ-quasi-clique, Remove operationis applied to improve the cohesion. To make F(G′i) > F(Gi), we need toensure K(G′i) ≥ K(Gi), i.e., we cannot remove a node v that decreasesthe coreness.• Swap: G′i = Gi + {v1} − {v2}. Swap is applied when adding a node v1makes Gi lose the λ-quasi-clique property, but removing another nodev2 at the same time restores the property. Instead of viewing Swap asa combination of Add and Remove, we view swap as an atomic oper-ation since that allows us to move from one λ-quasi-clique to another.The motivation for swapping nodes is that the node swapped in mayincrease the density. Note that edges associated with a node are addedor removed automatically with the node.514.4. Query-Driven Quasi-Clique MaximizationIf the pre-solution TQ = Gl, which is a λ-quasi-clique, to ensure F(G′i) >F(Gi), Add and Swap will be used in the iterative maximization. If thepre-solution TQ = Gu, since C(Gu) < λ, on each iteration before C(Gi) ≥ λis satisfied, Gi should be a subgraph of Gu with coreness K(Gi) ≥ K(Gu).Prop. 5 explains this case and shows that Add cannot help improve theF(Gi) score, and similarly for Swap. Thus, when the optimization processstarts from Gu, we can only use Remove to improve the F(Gi) score untilGi becomes a λ-quasi-clique.Proposition 5 Suppose the iterative optimization starts from TQ = Gu andthe current state Gi is a non-solution, obtained by zero or more Removeoperations to Gu. Then adding a node v to Gi cannot make F(Gi + {v}) >F(Gi).Proof: Recall that a Remove operation is applied to a current state Gionly when the resulting F score does not decrease. We prove the result byinduction on the number of Remove operations applied to TQ = Gu.Base Case: Zero Remove operations are applied to Gu. Then since Gi = Guis a maximal k-Core, adding a node into Gu will decrease the coreness andmake F(Gi + {v}) < F(Gi).Induction: Suppose Gi is a non-solution obtained from Gu after m Removeoperations. Notice that if any of the predecessors of Gi obtained during theRemove sequence is Gl, we could not have applied Remove to it while keepingthe F score non-decreasing. From this, it follows that Gi cannot be a subgraphof Gl. By the rule for applying the Remove operation, K(Gi) ≥ K(Gu). LetG′i = Gi + {v}. By Eq. (4.4), since Gi is not a λ-quasi-clique, we haveF(Gi) = C(Gi) = K(Gi)|Vi|−1 . To make F(G′i) =K(G′i)|Vi| >K(Gi)|Vi|−1 , we have to makeK(G′i) > K(Gi). Since K(Gl) = K(Gu) + 1, we have K(G′i) ≥ K(Gl). Noticethat G′i and Gl share the node set S, we conclude by Proposition 1 that G′ishould be a subgraph of Gl. This conflicts with our inference that Gi is nota subgraph of Gl, so F(Gi + {v}) > F(Gi) does not hold for any v. 24.4.2 Efficient Solution Search and RankGiven a connected subgraph Gi which may be a solution or a non-solutioncontaining S, in this section, we discuss efficient techniques for identifyingneighboring subgraphs of Gi with better F score, using three search oper-ations. According to Eq. (4.4), if Gi is not a λ-quasi-clique, improving itsF score amounts to improving its cohesion. If it is, then improving its Fscore amounts to increasing the size, and, if the size cannot be increased,improving the density.524.4. Query-Driven Quasi-Clique MaximizationWe let E(V1, V2) denote the set of edges connecting the node sets V1 andV2. Below, we discuss the solution search using Add, Swap and Removerespectively.Search by Add. Add is designed to increase the existing quasi-clique size.The solution set for Add isAdd(Gi) = {v|F(G′i) > F(Gi) > 1, G′i = Gi + {v}} (4.5)That is, Add(Gi) is the set of single nodes that could be added to Gito improve its F score. According to Eq. (4.4) and Def. 12, since |V ′i | =|Vi|+1 is fixed, The only thing that discriminates between different candidatenodes v ∈ Add(Gi) in terms of F score is |E′i|. Thus, we can start from Giand rank neighboring nodes v 6∈ Vi by the edge size |E(Gi, {v})|, providedG′i is still a λ-quasi-clique. The node v with the highest |E(Gi, {v})| andsatisfying C(Gi + {v}) ≥ λ produces the best solution G′i with the highestF score. Notice that we do not need to check every neighboring node ofGi to find out the best solution: since Gi is the subgraph of some treenode TI with K(TI) ≤ K(Gi), if there is a graph node v in TI but notin Gi with |E(Gi, {v})| ≥ K(TI), then for any graph node v′ not in TI ,we have |E(Gi, {v′})| < |E(Gi, {v})|. Otherwise, if |E(Gi, {v′})| ≥ K(TI),then |E(TI , {v′})| ≥ |E(Gi, {v′})| ≥ K(TI), making TI + {v′} a K(TI)-core,violating its maximality. For example, in Fig. 4.4(b), supposing Gi is thesubgraph of the tree node G3, if there is a graph node v in G3 but not inGi having |E(Gi, {v})| ≥ 3, then we cannot find a node v′ outside G3 buthaving |E(Gi, {v′})| ≥ |E(Gi, {v})|. Thus, starting from the deepest treenode that contains Gi, we can climb up the core tree level by level to checkwhether in that tree node TI we can find a graph node v not in Gi butwith |E(Gi, {v})| ≥ K(TI). Once found, we stop the solution search andthe current best solution is guaranteed to be the best solution out of allsolutions in G(V,E). This approach can be extended to find exact top nsolutions (shown in Alg. 3). With guaranteed quality, since we do not needto check every neighbor of Gi, the search space of Add is distinctly smallerthan that of existing approaches [1, 15].Search by Swap. Let G′i = Gi + {v1}−{v2} and G′′i = Gi + {v1}. Swap isdesigned to improve the density when G′′i is not a λ-quasi-clique, but bothGi and G′i are. The solution set of Swap isSwap(Gi) = {(v1, v2)|F(G′i) > F(Gi) > 1 ≥ λ > F(G′′i ),G′i = Gi + {v1} − {v2},G′′i = Gi + {v1}}(4.6)534.4. Query-Driven Quasi-Clique MaximizationSwap indicates deg(v2, Gi) ≥ λ(|Vi| − 1), deg(v2, G′′i ) < λ(|Vi|) anddeg(v1, G′i) > deg(v2, Gi). In practice, if Add(Gi) = ∅ and there is anode v2 (v2 6∈ S) in Gi satisfying deg(v2, G′′i ) < λ(|Vi|) and deg(v1, G′i) >deg(v2, Gi), the swap operation may happen. We rank the node pair (v1, v2)by deg(v1, G′i)− deg(v2, Gi) for Swap solutions.Search by Remove. Remove is designed to improve the cohesion of anon-solution Gi. We define the neighboring (solution or non-solution) set ofRemove asRemove(Gi) = {v|F(G′i) > F(Gi),F(Gi) < λ ≤ 1,G′i = Gi − {v}}(4.7)Clearly, v is a node in Gi. We can rank candidate v in Gi by the corenesschange ∆K = K(G′i)−K(Gi) resulting from removing v from Gi. If ∆K < 0,then v 6∈ Remove(Gi), since the F score decreases. Thus, we cannot simplyremove the node with the minimum degree inside Gi: suppose v1 and v2both have the minimum degree in Gi and they are connected; then removingof v1 will make the coreness of the remaining graph lower than before, whichresults in a lower F score. If multiple candidate nodes have the same corenesschange ∆K, we rank them further by their degree in Gi. Removing the nodewith the highest ∆K and lowest degree in Gi maximizes the gain on F anddensity, while a higher density has a higher potential to increase the cohesion.4.4.3 Iterative Maximization AlgorithmsSince the maximum clique problem is NP-Hard and is inapproximable withina factor of n1− ([6, 23, 31]), the design of heuristic algorithms for QMQsearch with a guaranteed quality is unrealistic. In this section, we pro-pose two iterative maximization algorithms, DIM and SUM, to approachthe QMQ. DIM (Deterministic Iterative Maximization) is a deterministic ap-proach which greedily moves to the best neighboring solution or non-solutionon each iteration. In contrast, SUM (Stochastically Updated Maximization)is a stochastic approach which moves to one of several good neighboring so-lutions or non-solutions with a probability, which has the potential to findbetter results than DIM. Both DIM and SUM are based on local search oncore tree. In practice, we find that both of them find solutions with highquality, if a QMQ containing the query nodes exists.If the current state Gi is a non-solution, according to Prop. 5, bothDIM and SUM will try to remove nodes from Gi and make it become asolution (i.e., a λ-quasi-clique containing S). Once Gi is a solution, DIM544.4. Query-Driven Quasi-Clique Maximizationand SUM will try to apply Add or Swap to grow it. In this section, sinceF(Gi) provides a uniform way to evaluate Gi, no matter whether Gi is asolution or non-solution, for ease of presentation, we use the term “answer”as a generic term that may refer to a solution or a non-solution.Local Search. Readers are referred to [33, 52] for complete details of localsearch. Here, we briefly revisit major steps:1. Given the current answer, find feasible candidate answers in the neigh-borhood by one of the operations;2. Evaluate candidates using the objective function F ;3. If the marginal gain is positive, move to the best candidate, and thenjump to step 1. Otherwise, return the current solution or NULL if thecurrent answer is a non-solution.Deterministic Iterative Maximization. DIM follows the greedy strategyby always moving to the best available neighboring answer. Major steps ofthe DIM approach are shown in Alg. 2. The pre-solution TQ is set as thestarting point of the maximization. If Gi is a non-solution (C(Gi) < λ), werepeat Remove operations until Gi becomes a λ-quasi-clique or there is noqualified λ-quasi-clique containing S in G (line 4-8). If Gi is already a solu-tion, we repeat Add and Swap operations iteratively, until F(Gi) cannot beimproved any more (line 9-23). In particular, we start from Add operationsby exploring neighbors on core tree level by level, and if qualified nodes canbe found for Add, we pick the best node to add (line 10-18); otherwise, wetry to Swap nodes (line 20-23). We repeat Add and Swap until the F scorecannot be improved. An example for the scheduling of operations is shownin Fig. 4.5(a), in which Remove operations are performed first if TQ is nota λ-quasi-clique, followed by Add or Swap operations. Supposing the totalnumber of iterations of DIM is T and on each iteration, on an average, nanswers are explored, time complexity of DIM is O(Tn).DIM always picks the best available neighboring answer on each iteration.While this greedy heuristic is effective and is followed by lots of local searchmethods, it may converge to a local maximum.In the following, we propose SUM (Stochastically Updated Maximiza-tion), which is a stochastic approach, moving with a probability to one ofthe good neighboring answers. SUM has the potential to break the localmaximum limitation and find better results than DIM.Stochastically Updated Maximization. Given the current answer Gi,we define the marginal gain ∆F = F(G′i)−F(Gi) for a neighboring answer554.4. Query-Driven Quasi-Clique MaximizationAlgorithm 2: DIMInput: Pre-solution TQ, S, λOutput: the QMQ QλS1 if TQ is the root node G and C(TQ) ≥ λ then return G ;2 Gi = TQ;3 while Gi is not a λ-quasi-clique do4 Rank node v in Gi by ∆ = K(Gi−{v}) +D(Gi−{v})−K(Gi)−D(Gi);5 Remove the node not in S with the highest positive ∆;6 if Gi is not connected or not changed then return NULL ;7 while true do8 AddSet = ∅ and TI = the deepest tree node with Gi;9 while max |E(Gi, v)| < K(TI) for v in TI but not in Gi and AddSet do10 Add v into AddSet if C(Gi + {v}) ≥ λ;11 if TI is not the root then12 TI = the parent of TI ;13 else Break WHILE loop;14 if AddSet 6= ∅ then15 Add v = arg max{|E(Gi, v)| | v ∈ AddSet} into Gi;16 else17 v1 = arg max{|E(Gi, v)| | v ∈ V − Vi};18 G′′i = Gi + {v1}, G′i = Gi + {v1} − {v2};19 if it exists v2 ∈ Vi − S such that deg(v2, G′′i ) < λ(|Vi|) anddeg(v1, G′i) > deg(v2, Gi) then Gi = G′i ;20 else Break WHILE loop and return Gi;G′i. As discussed in Section 4.4.2, for each operation, neighboring answersare ranked by ∆F in descending order, and we use the function F(x) todenote the marginal gain of a solution ranked at the x-th place, e.g., F(1)denotes the maximum marginal gain.In related work, GRASP [1] is a well-known randomized local searchmethod based on the threshold δ = minF(x)+c(maxF(x)−minF(x)) wherec ∈ [0, 1]. The new answer is picked randomly from all answers with F(x) ≥ δ.If c = 1, GRASP always moves to the best answer like DIM. If c = 0,GRASP performs a random selection from all neighboring answers. However,GRASP has two drawbacks: (1) the parameter c is difficult to tune; (2)GRASP totally ignores the actual distribution of marginal gains as x varies.In our proposed Stochastically Updated Maximization (SUM) algorithm, weovercome these limitations by picking the new answer based on the “inflectionpoint” (x∗,F(x∗)) on F(x). The inflection point is the point when the F(x)564.4. Query-Driven Quasi-Clique Maximization2SHUDWLRQV5 5$6 $6 iG5Remove$Add6Swap(a)5DQNx xx*    x x(b)Figure 4.5: (a) An example for the QMQ maximization. (b) Illustrating theinflection node on F(x), where solutions with rank x ≤ x∗ are preferred.change is maximized. That is,x∗ = arg maxx(F(x)− F(x+ 1)) (4.8)The inflection point splits F(x) into a high marginal gain part and a lowmarginal gain part. Fig. 4.5(b) shows an illustration of the inflection pointon F(x). SUM draws an answer from all high marginal gain answers withrank x < x∗, using the probability:P(x) =F(x)∑F(x)≥F(x∗) F(x)(4.9)P(x) can be understood as the ratio between answer x’s marginal gainand the sum of all marginal gains higher than F(x∗). It is possible that ananswer is not the best but is picked as the new answer, though its probabilityof being picked is lower than that of the best answer. On the other hand,since non-greedy answers can be selected with a non-zero probability, SUMhas the potential to avoid being trapped in a local maximum. Clearly, theoutput of DIM is just one possible output of SUM, and SUM may producemuch better results than DIM.Major steps of SUM are shown in Alg. 3. As can be seen, SUM and DIMhave a similar scheduling strategy for maximization operations. However,SUM probabilistically picks the new answer from a set of good answers.The preparation for the solution set SolGain can be found in lines 5-8 forRemove, lines 14-20 for Add and lines 24-28 for Swap. Supposing the totalnumber of iterations of SUM is T and on each iteration the marginal gain isevaluated on n answers, the time complexity of each simulation of SUM isO(Tn2).574.5. Experimental Study4.5 Experimental StudyAll experiments are conducted on a computer with Intel 2.90 GHz CPU and 8GB RAM. All algorithms are implemented in Java. We conduct experimentson three real data sets, as explained below.LiveJournal Social Network. LiveJournal is an online blogging websitewith a friendship network. We model each user as a node, and friendshipties as edges in LiveJournal social network. Available at SNAP5, this datasethas 3,997,962 nodes and 34,681,189 edges in total. A social circle in Live-Journal social network is best modeled by a quasi-clique, since every user isrequired to declare friendship with the majority of members in a social cir-cle so defined. The query-driven maximum quasi-clique corresponds to thelargest social circle containing the query users, which supports applicationslike social community search.DBLP Co-Authorship Network. To simulate social collaborations, weuse the DBLP6 Co-Authorship Network. Each author is modeled as a node,and if two authors have collaborated on at least one paper, there will bea link between them. In total, DBLP co-authorship network has 1,206,881nodes and 8,894,745 edges. A quasi-clique in the co-authorship network isa group of authors closely working with each other. Given a set of queryresearchers, the query-driven maximum quasi-clique search seeks to find thelargest collaboration network between them.Youtube Sharing Network. This dataset, available at SNAP7, is basedon Youtube’s social network, where two people are connected if they havesimilar interests in videos and become friends. There are a total of 1,134,890nodes and 2,987,624 edges. Given a set of users, the query-driven maximumquasi-clique is the largest video sharing group containing these query users.4.5.1 Core Tree ConstructionCore tree construction is an offline step. The running time for core treeconstruction on three datasets is shown in Table 4.3, where we also show thehighest k (i.e., degeneracy [47]) for each dataset. Considering that LiveJour-nal and DBLP datasets have 4M and 1.2M nodes respectively, DBLP datasettakes more time than LiveJournal dataset, which indicates that apart fromthe graph size, graph structure is a significant factor influencing the perfor-5http://snap.stanford.edu/data/com-LiveJournal.html6http://dblp.uni-trier.de/xml/7http://snap.stanford.edu/data/com-Youtube.html584.5. Experimental Study1101001000100001000001 11 21 31 41 51 61 71 81 91 101 111 121 131 141SizeCorenessLiveJournal Social NetworkDBLP Co-Author NetworkYoutube Sharing NetworkFigure 4.6: The number of maximal cores first rapidly increases, and thenslowly decreases with the coreness k on three data sets.mance of the core tree construction. Specifically, in Fig. 4.6, we show thenumber of maximal k-Cores for each k, for the three data sets. As we can see,from the root to leaves, the number of maximal cores first rapidly increases,and then slowly decreases with the coreness k for all datasets. Especially,there is a huge number of 2-Cores and 3-Cores in DBLP co-authorship net-work, because there are many graduate students only publishing papers withtheir supervisors or a few collaborators. This is a dominant factor in the coretree construction time, as this means the core tree for DBLP has many morenodes than for the other datasets.4.5.2 Performance EvaluationBaselines. To the best of our knowledge, there is no previous study onquery-driven maximum quasi-clique search. Therefore, we adapt previousapproaches for maximum quasi-clique detection (without query) as base-lines. In our evaluation, we designed two categories of baselines based onrelated work: operation baselines and optimization baselines. For operationbaselines, we use Add-MC and Remove-MC, which are discussed in [15] andare extended from traditional maximum clique problem. These baselines arecompared with Add, Remove and Swap discussed in Section 4.4.2.• Add-MC: Call a node u in a λ-quasi-clique Gi a critical node if λ(|Vi|−1) ≤ deg(u,Gi) < λ|Vi|. Then we add a node v with the maximumdeg(v,Gi+{v}) to the quasi-clique Gi such that: (i) deg(v,Gi+{v}) ≥λ|Vi| and (ii) v is adjacent to every critical node of Gi.• Remove-MC: Remove a node v from Gi whose removal results in thelargest set of Add-MC nodes.For optimization baselines, we use RLS (Reactive Local Search) [15],594.5. Experimental Study(a) (b)Figure 4.7: (a) Running time of 1000 pre-solution queries as the increasingof query set size. (b) Portion of pre-solution chosen from Gl, out of allpre-solutions, for different query set sizes. λ = 0.9.GRASP [1], and DFS-Tree [50, 74], which are compared with DIM and SUM.• RLS: At each step, choose the best Add-MC node. If no nodes can beadded, choose the best Remove-MC node. Every time a node is addedor removed, it cannot be removed or added again for the next move.• GRASP: A randomized local search method based on thresholds. Weset c = 0.5 and adapt GRASP on core tree by picking a new solutionrandomly from all solutions with objective scores higher than the aver-age.• DFS-Tree: Build depth-first search trees for quasi-cliques by settingeach query node as the root, and return the largest tree node in thesetrees that contains all query nodes, adapted from [50, 74].Pre-Solution on Core Tree. Finding a pre-solution is one of the key stepsof our method. In this section, we evaluate the efficiency of our algorithmfor finding a pre-solution. We generate queries with q nodes by randomlychoosing q nodes from a subgraph within radius 2, and this subgraph is itselfrandomly chosen from the original graph. The core tree built in the offlinestep provides an extremely efficient way to find the pre-solution for a givenquery. In Fig. 4.7(a), we show the running time of 1000 pre-solution queriesfor different query sizes. With the help of core tree, the pre-solution search forλ and S reduces to a few tree traversal operations and is extremely efficient:1000 pre-solution searches can be done in a few milliseconds, on data setswith millions of nodes. Except for two special cases – the whole graph isa λ-quasi-clique or there is no tree node that is a λ-quasi-clique containing604.5. Experimental StudyS – a pre-solution will be chosen from either Gl or Gu, depending on theproximity of their cohesion to λ. As shown in Fig. 4.7(b), the majority ofpre-solutions are chosen from Gl, for which Add and/or Swap are applicable.Maximization. The efficiency of an iterative maximization process is deter-mined by two key aspects: the cost of a single iteration and the convergencerate. We evaluate the performance of iterative maximization from these twoaspects respectively.Fig. 4.8(a) shows the running time for searching neighboring solutionsusing Add and Add-MC with different sizes (i.e., number of nodes) of currentsolution Gi on LiveJournal. Due to given graph structure, different Gi mayhave very different numbers of neighbors, resulting in a skewed distributionof running time. On average, Add operation is 2.6 times faster than Add-MC: each Add takes 24 ms, while each Add-MC takes 62 ms. This can beexplained by the observation that an exploration using Add has a remarkablysmaller search space than Add-MC, since Add searches solutions by exploringneighboring nodes level by level and stops immediately if top n solutions arealready found on core tree, with n set to a small number, e.g., 10. But Add-MC needs to check every neighboring node to find top solutions. Notice thatfor SUM, Add finds top-n solutions whereas for DIM, only the top 1 solutionis needed. Fig. 4.8(b) compares the ratio of the search space for Add to thatfor Add-MC. As query size increases, the ratio between the search spacesof Add and Add-MC decreases, indicating that Add only needs to search afraction of the search space of Add-MC, as the query size grows. The addedvalue of Add over Add-MC in the quality of results will be discussed in detailbelow (see Table. 4.4).Next, given a pre-solution as the starting point of optimization, we mea-sure the number of iterations taken by each optimization method beforeconvergence. For every query size, we repeat the experiment 100 times andshow in Fig. 4.8(c) the average number of iterations for each of the fourmaximization methods: the numbers show a common decreasing trend asthe query size increases. This makes sense because larger query size is morelikely to result in a larger solution, which sets a higher bar to add new nodes.Fig. 4.8(c) also shows that SUM has the highest number of iterations, butwe will see later that SUM also succeeds in finding larger quasi-cliques moreoften than all other methods.Total Running Time. Unlike RLS and GRASP, DFS-Tree baseline is notan iterative optimization method. Instead, “DFS-Tree” represents a categoryof methods [50, 74] that find maximal quasi-cliques by exploring the space ofquasi-cliques in a depth-first search tree, in which each tree node corresponds614.5. Experimental Studyto a quasi-clique. Since the DFS-Tree method is not originally designed forthe QMQ search problem, to use it as a baseline, we adapt it by applyingpruning techniques and returning the largest node of the DFS tree (corre-sponding to a quasi-clique) containing all query nodes. In Fig. 4.8(d), weshow the average running time of these methods on LiveJournal, as the querysize |S| increases. RLS, GRASP, DIM, and SUM have comparable runningtimes, with RLS being slower than the other three. This is explained by thefact that RLS uses Add-MC and Remove-MC in place of the faster Add andRemove. DFS-Tree quickly explodes as the query size increases. In fact, itsrunning time trend is opposite to that of DIM and SUM. This is becausethe DFS-Tree method needs to build a DFS tree for every query node in Sand the cost of searching the largest tree node containing all query nodesincreases with |S|, making DFS-Tree based methods consume more memoryand perform worse than our core tree based methods, DIM and SUM. Therunning time of GRASP is comparable to SUM.Quality Comparison. As discussed in Section 4.2, there is unlikely tobe any heuristic algorithm with guaranteed approximation for the optimalQMQ. To measure the quality of proposed approaches empirically, we de-signed the following experiment: given the same pre-solution, we run RLS,GRASP, DFS-Tree, DIM and SUM independently to get the QMQ w.r.t. Sand λ. We rank these QMQs by size, and the method producing the largestQMQ will earn a hit score of one point. By randomly selecting queries withsize 3, we repeat the experiment 100 times, and list the final score for eachmethod in Table 4.4. Notice that multiple methods may generate the samefinal QMQ and may earn a hit score in a given round, so the sum of scoresis higher than 100. The score of GRASP is the lowest, because it uniformlyselects a neighboring solution from a set of above-average solutions, and inmany cases, GRASP ends up with not-the-best final solutions. Theoreti-cally, DFS-Tree should yield a high score overall, but since we have to prunethe search path to control its memory consumption [50, 74], it generates anoverall score only slightly higher than GRASP. Both RLS and DIM greed-ily move to a new solution with the highest marginal gain, but since DIMuses Swap to improve the density while RLS does not provide Swap, DIMachieves a higher quality score than RLS. Finally, SUM achieves the highestoverall score, because not only is SUM capable of using Swap to improvethe density, it is able to strike a balance between greed and opportunity:while it is more probable to move to a new solution with a higher marginalgain, it retains the potential to jump out of the local maximum and thusachieve a better final solution, thanks to the non-zero probability of choosing624.6. Discussion and Conclusiona neighboring solution that is not (locally) the best.4.6 Discussion and ConclusionIn this chapter, we discussed the cohesion-persistent story search problem.Given a query consisting of a set of nodes S in a graph G(V,E), and param-eter λ ∈ (0, 1], we focus on finding the largest λ-quasi-clique containing S,which has lots of applications in real world networks. This is NP-hard andis hard to approximate, calling for clever heuristic solutions. To quickly finda dense subgraph containing S with cohesion close to λ, as the pre-solutionof the final solution, we propose the notion of core tree by recursively orga-nizing maximal cores of G. We make use of three optimization operations:Add, Remove and Swap. Then, we propose two iterative maximization algo-rithms, DIM and SUM, to approach the query-driven maximum quasi-cliquein terms of deterministic and stochastic means respectively. With extensiveexperiments on three real datasets, we demonstrate that our algorithms sig-nificantly outperform several natural baselines based on the state of art, inrunning time and/or the quality of the solution found. This work raises anumber of open questions. It’s interesting to ask if we can apply the oper-ations in bulk mode, at the level of subgraphs, instead of the node-by-nodemode that we have taken. It is also important to investigate whether ourtechniques can be extended and generalized to search other kinds of densesubgraphs such as k-plexes or optimal quasi-cliqes [73].634.6. Discussion and ConclusionAlgorithm 3: SUMInput: Pre-solution TQ, S, λ, nOutput: the QMQ QλS1 if TQ is the root node G and C(TQ) ≥ λ then return G ;2 Gi = TQ;3 while Gi is not a λ-quasi-clique do4 Set SolGain = ∅;5 for each node v in Gi but not in S do6 Solution = Gi − {v}, Gain = F(solution)−F(Gi);7 Add (Solution,Gain) into SolGain;8 if SolGain 6= ∅ then9 Gi = Stochastically pick an answer from SolGain with gain higherthan F(x∗);10 else Return NULL ;11 while true do12 AddSet = ∅ and TI = the deepest tree node with Gi;13 while top n-th |E(Gi, v)| ≤ K(TI) for v in TI but not in Gi andAddSet do14 Add v into AddSet if C(Gi + {v}) ≥ λ;15 if TI is not the root then16 TI = the parent of TI ;17 else Break WHILE loop;18 if AddSet 6= ∅ then19 Stochastically pick v from nodes in AddSet with gain higher thanF(x∗) and let Gi := Gi + {v};20 else21 Set SolGain = ∅;22 for each node v1 ∈ V − Vi with E(Gi, v1) > deg(v2, Gi) + 1 do23 G′′i = Gi + {v1}, G′i = Gi + {v1} − {v2};24 if it exists v2 ∈ Vi − S such that deg(v2, G′′i ) < λ(|Vi|) anddeg(v1, G′i) > deg(v2, Gi) then25 Add (G′i,F(G′i)−F(Gi)) into SolGain;26 if SolGain 6= ∅ then27 Gi = Stochastically pick an answer from SolGain with gainhigher than F(x∗);28 else Break WHILE loop and return Gi ;644.6. Discussion and ConclusionData Sets LiveJournal DBLP YoutubeTime (seconds) 346 486 29Highest Coreness 361 119 52Table 4.3: Running time for core tree construction.(a) (b)(c) (d)Figure 4.8: (a) Running time of Add and Add-MC for different solutionsizes, with query node size |S| = 3; (b) ratio between the search spaces ofAdd and Add-MC, given current solution, as query size increases; (c) average#iterations for various methods on LiveJournal data set; (d) average runningtime of different methods on LiveJournal, as query size increases; λ = 0.9 bydefault.Method SUM DIM DFS-Tree GRASP RLSScore 77 41 11 9 17Table 4.4: Aggregated quality scores for different methods from 100 tests.65Chapter 5Context-Aware Story-TellingStories that we identified from the post network may be highly related, e.g.,two stories “the launch of Blackberry 10” and “BlackBerry Super Bowl ad”are highly related. The detection of story relatedness, or called story con-text search, is a fundamental task of social stream mining. Story contextsearch aims at finding the neighboring stories of a given story in the postnetwork efficiently. In this chapter, We try to exploit and integrate signalsfrom different perspectives (e.g., content or time) to compute the related-ness between stories. After that, effective and efficient story context searchalgorithms are proposed to build a network of stories in social streams.5.1 IntroductionThere are many previous studies [48, 55, 66, 77] focusing on detecting newemerging stories from social streams. They serve the need for answering“what’s happening now? ”. However, in reality, stories usually do not hap-pen in isolation, and existing studies fail to track the relatedness betweenthem. For example, “Crimea votes to join Russia” (on May 6, 2014) and“President Yanukovych signs compromise” (on February 21, 2014) are twoseparate stories, but they are actually highly related under the same event“Ukraine Crisis”. In this chapter, our goal is to design a context-aware story-teller for streaming social content, which not merely detects trending storiesin a given time window of observation, but also builds the “context” of eachstory by measuring its relatedness with other stories on the fly. As a result,our story-teller has advantages on responding advanced user queries like “tellme related stories”, which is crucial to help digest social streams.Building a context-aware story-teller over streaming social content is ahighly challenging problem, as explained below:• The effective organization of social content. It is well known that postssuch as tweets are usually short, written informally with lots of gram-matical errors, and even worse, a correctly written post may have nosignificance and be just noise.665.1. Introduction• Identification of transient stories from time window. Story detectionshould be robust to noise and efficient to support the single-pass trackingessential in the streaming environment.• Story context search on the fly. Story relatedness computation shouldbe interpretable and efficient to support online queries.To the best of our knowledge, no training data set for the context-awarestory-telling on social streams is publicly available, which renders the existingworks [59, 68] on Story Link Detection (SLD) not applicable, because SLDis trained on well-written news articles. Furthermore, we cannot apply topictracking techniques [27, 32] to story context search, because topic tracking isusually formulated as a classification problem [4], with an assumption thattopics are predefined before tracking, which is unrealistic for story contextsearch on social streams.To address above challenges, we propose CAST in this chapter, which isa Context-Aware Story-Teller specifically designed for social streams. CASTtakes a noisy social stream as the input, and outputs a “story vein”, which is ahuman digestible and evolving summarization graph by linking highly relatedstories. The major workflow of CAST is illustrated in Figure 5.1. First ofall, we treat each post as a node and add an edge between two posts if theyare similar enough. For example, “Australian authorities update search forMH370” and “new MH370 search area is closer to Australia” are two similarposts and we add an edge between them. By this way, a post networkwill be constructed for social posts in the same observing time window.Second, we propose the new notion of (k, d)-Core to define a transient story,which is a cohesive subgraph in post network. In a (k, d)-Core, every nodehas at least k neighbors and two end nodes of every edge have at least dcommon neighbors. We propose two algorithms, Zigzag and NodeFirst,for the efficient discovery of maximal (k, d)-Cores from post network. Afterthat, we define the iceberg query as finding highly related stories for a givenstory. Two approaches are proposed, deterministic context search (DCS)and randomized context search (RCS), to implement the iceberg query withhigh efficiency. The story vein is constructed based on the iceberg queryand serves as the backend of context-aware story-teller on social streams,which discovers new stories and recommends related stories to users at eachmoment. Typical users of CAST are daily social stream consumers, whoreceive overwhelming (noisy) buzzes and wish to digest them by an intuitiveand summarized representation in real time.The main contributions of this chapter are summarized below:• We define a new cohesive subgraph called (k, d)-Core to represent sto-675.2. Story Vein6RFLDO6WUHDP 6WRU\9HLQ&DSLOODU\1HWZRUN&RQVWUXFWLRQ6WRU\'LVFRYHU\6WRU\&RQWH[W6HDUFK" WLPHFigure 5.1: Illustrating the workflow of StoryVein, which has three majorsteps: (1) post network construction, (2) transient story discovery and (3)story context search.Q a social streams(pi, pj) the similarity between posts pi and pjG(V,E) post networkG(V,E) story veinS(VS , ES) a story SCor(S, S′) the relatedness between story S and S′Table 5.1: Major Notations.ries and propose two efficient algorithms, Zigzag and NodeFirst, toidentify maximal (k, d)-Cores from the post network (Section 5.3);• We propose deterministic and randomized context search to support theiceberg query for highly related stories, which builds the story vein onthe fly (Section 5.4);• Our experimental study on real Twitter streams shows that StoryVeincan digest and effectively build an expressive context-aware story-telleron streaming social content (Section 5.5).We conclude this chapter in Section 5.6. Major notations are shown inTable 5.1.5.2 Story VeinIn this chapter, we propose CAST, an effective context-aware story-tellerfor social streams. As social posts flow in, CAST is able to discover newstories and track the “vein” between stories, in which each story is a group ofhighly similar posts telling the same thing in the social stream, and each veinis a relatedness link between two stories. We define a story vein as follows,685.3. Transient Story Discoverywhere we use Cor(Si, Sj) to denote the relatedness between stories Si andSj .Definition 16 (Story Vein) Given a post network G(V,E) and a thresholdγ (0 < γ < 1), the output of CAST can be represented by a directed graphG(V,E), where each node S ∈ V is a story, and each edge (Si, Sj) ∈ E meansthe story relatedness Cor(Si, Sj) ≥ γ and Si happens earlier than Sj.The motivation behind CAST is, real world stories are not isolated butcommonly correlated with each other. Intuitively, the context of S in G(V,E)is the neighboring stories of S. In particular, we use upper and lower contextto represent the stories connected by incoming and outgoing edges of S,respectively. The target of CAST is to discover new stories and build thecontexts of these stories. As the social stream Q gets updated over time, thestory vein G(V,E) will be also updated in real time. The dynamic natureof context-aware story-telling raises challenges both in the discovery of newstories and in the tracking of story context. In the following, we will addressthese challenges by discussing transient story discovery in Section 5.3 andstory context tracking in Section 5.4.5.3 Transient Story DiscoveryIn this section, we describe the motivation and algorithms for identifyingtransient stories as a new kind of cohesive subgraph, (k, d)-Core, from thepost network.5.3.1 Defining a StoryEdge weights in a post network have very natural semantics: the higher thepost similarity, the more likely two posts are to talk about the same story.It is well-known that a cohesive subgraph is a sub-structure with high edgedensity and very low internal commute distance [75]. Suppose S(VS , ES) is acohesive subgraph in G(V,E). Since nodes in S(VS , ES) have a high densityto connect with each other, it is very likely that all nodes in S(VS , ES) sharethe same content and tell the same story. Based on this observation, wemodel a story in social streams as a cohesive subgraph in post network.There are many alternative ways to define a cohesive subgraph. Cliquemay be the best candidate in spirit, because any two nodes have a sufficientlyhigh similarity and thus share the same content. However, in real world datasets, cliques are too restrictive for story definition and this calls for some695.3. Transient Story Discoveryrelaxations. There are several choices for clique relaxations, e.g., quasi-clique, k-Core, k-Plex, etc [47]. Given a connected subgraph S(VS , ES),supposing N(p) is the neighbor set of node p ∈ VS in S(VS , ES), these cliquerelaxations are defined as:• Quasi-clique: |N(p)| ≥ λ(|VS | − 1) for every post p with 0 < λ ≤ 1 ;• k-Plex: |N(p)| ≥ |VS | − k for every post p ∈ VS ;• k-Core: |N(p)| ≥ k for every post p ∈ VS .Notice that finding the maximum clique, quasi-clique or k-Plex from agraph is an NP-Hard problem [8, 15]. Even worse, [31] proved that thereare no polynomial algorithms that provide any reasonable approximationto the maximum clique problem. Since a clique is a special case of quasi-clique or k-Plex, these hardness results carry over to maximum quasi-cliqueand maximum k-Plex problems. There are a lot of heuristic algorithms thatprovide no theoretical guarantee on the quality [1, 8, 15]. Since most of themare based on local search [15], they do not scale to large networks, becauselocal search involves the optimization of solutions by iteratively moving to abetter neighbor solution in an exponential search space.The good news is that k-Cores can be exactly found in polynomialtime. By adjusting k, we can generate k-Cores with desired edge density2|ES |/|VS |. For example, increasing k will improve edge density because theminimal edge degree is increased and |VS | is decreased in the same time.However, k-Cores are not always capable of capturing our intuition on sto-ries: although each post has at least k neighbors, the fact that two posts areconnected by a single edge may be not a strong enough evidence to provethat they tell the same story. Sometimes, posts only share some commonwords but discuss different stories, e.g., “Google acquires Motorola Mobility”and “Bell Mobility acquires Virgin Mobile”. We show an example of 3-Corein Figure 5.2(a), where p1 and p2 are connected but they belong to two sep-arate cliques. To address this challenge, we make a key observation thatthe existence of more common neighbors between two edge-connected postssuggests a stronger commonality in story-telling. Supposing pi and pj areconnected by an edge and there exists post pl ∈ N(pi)∩N(pj), then we callpl a witness for the post similarity s(pi, pj). We capture this intuition in thefollowing definition, where we formalize a story as a maximal (k, d)-Core.Definition 17 (Story in CAST) A story in social stream Q is defined bya maximal (k, d)-Core S(VS , ES) in post network G(V,E) associated with Q,where k, d are numbers with k > d > 0 and• S(VS , ES) is a connected subgraph;• For every post p ∈ VS, |N(p)| ≥ k;705.3. Transient Story Discoveryp1p2(a)3-Core (3,1)-Corep1p2p3p4p1p3p2(b)Figure 5.2: (a) A 3-Core without similarity witness between p1 and p2. (b)The illustration of generating a (3, 1)-Core from a 3-Core.• For every edge (pi, pj) ∈ ES, |N(pi) ∩N(pj)| ≥ d.Why (k, d)-Core? In (k, d)-Cores, we use k to adjust the edge density, anduse d to control the strength of similarity witness. It is easy to see that amaximal (k, d)-Core is a subgraph of a maximal k-Core. However, comparedwith k-Cores, (k, d)-Cores have a more cohesive internal structure enhancedby at least d common neighbors as witnesses of commonality between twonodes connected by an edge. This enhancement makes posts in (k, d)-Coremore likely to tell the same story. We show an example of 3-Core and (3,1)-Core in Figure 5.2(b). As we can see, p4 is in 3-Core but not in (3,1)-Core,because the similarity between p1 and p4 is not witnessed by other posts.Besides, [65] defines a kind of cohesive subgraph called k-Dense, in whichevery edge has at least k witnesses. It is easy to infer that a k-Dense is a(k+1, k)-Core. Thus, k-Dense is a special case of (k, d)-Core, but provides noflexibility to adjust k and d independently. Therefore, our proposed (k, d)-Core is better than k-Core and k-Dense in capturing a story.Non-overlap Property. Similar to maximal k-Cores, a maximal (k, d)-Core cannot be a subgraph of any other (k, d)-Cores. Maximal (k, d)-Coreshave the following important property.Proposition 6 The maximal (k, d)-cores of a graph are pairwise disjoint.Proof: Suppose Si(Vi, Ei) and Sj(Vj , Ej) are two different maximal (k, d)-Coresand Vi ∩ Vj 6= ∅. Now we can construct a connected subgraph S(Vi ∪ Vj , Ei ∪ Ej).Since each node in S has at least degree k and each edge in S has at least d simi-larity witnesses, S exactly satisfies the definition of a (k, d)-Core. This conclusioncontradicts the assumption that Si and Sj are maximal. 2715.3. Transient Story Discovery5.3.2 Story FormationWe now discuss the computation of (k, d)-Cores. We first review the k-Coregeneration process [47]: given a network G(V,E), iteratively remove thenodes with degree less than k from G(V,E), until all the remaining nodeshave a degree at least k. The result will be a set of maximal k-Cores, whichare obtained in polynomial time. The k-Core generation process forms thebasis of our (k, d)-Core generation algorithms. We propose the first solutionfor (k, d)-Core generation in Algorithm 4. We call it a Zigzag algorithm,because the basic idea is to repeatedly change the property of the currentnetwork between two states: the first state is k-Core set G1 obtained byremoving nodes recursively, and the second state is (d + 1, d)-Core set G2obtained by removing edges recursively. This Zigzag process will terminateif each connected component in the result set is a (k, d)-Core, or the resultset is empty. It is easy to see Algorithm 4 takes polynomial time and theresult is exact.We notice that the transition costs between two states are not symmetric:the computational costs of G1 and G2 are very different. If we can reduce thefrequency of the transition with a higher overhead, the overall performanceof Zigzag can be optimized. The following lemma formalizes the property.Proposition 7 Given a network G(V,E) with |V | < |E| and integers k, d(k > d), for each iteration in Zigzag, the computation of k-Core set G1 ismore efficient than the computation of (d+ 1, d)-Core set G2.Proof: To compute G1, we need to check the degree of every node. To compute G2,we need to check the common neighbor size of the two end nodes for every edge.Given that |V | < |E| in most networks and the checking of node degree is moreefficient than the checking of common neighbors, we complete the proof. 2Proposition 7 suggests that if we reduce the computation frequency ofthe (d + 1, d)-Core set G2, the performance will be improved. Followingthis, we propose Algorithm 5 to improve the performance of Algorithm 4, byapplying node deletions as much as possible. The heuristic is, whenever anedge is deleted, we check whether this deletion makes the degree of its endnodes smaller than k, and if it does, a recursive node deletion process startingfrom end nodes will be performed (Line 7). We call Algorithm 5 NodeFirstbecause it greedily invokes the node deletion process when possible. SinceNodeFirst avoids to perform a complete edge deletion process as Zigzagdid, the network converges very fast to the set of maximal (k, d)-Cores. Thefollowing propositions indicate that NodeFirst produces exactly the sameresults as Zigzag, but its performance is better than Zigzag.725.4. Story Context TrackingAlgorithm 4: (k, d)-Core Generation: ZigzagInput: G(V,E), k, dOutput: All maximal (k, d)-Cores1 Generate a set of k-Cores G1 by removing nodes with degree less than krecursively from G(V,E);2 while G1 is not empty do3 Generate a set of (d+ 1, d)-Cores G2 by removing edges with witnessesless than d recursively from G1;4 if G1 equals G2 then5 break while loop;6 Generate a set of k-Cores G1 by removing nodes with degree less than krecursively from G2;7 if G1 equals G2 then8 break while loop;9 return G1;Proposition 8 Given a network G(V,E) and numbers k, d (k > d), bothZigzag and NodeFirst generate the same result, which is the set of allmaximal (k, d)-Cores in G(V,E).Proof: Suppose S(k,d) is the set of all maximal (k, d)-Cores inG(V,E). Both ZigzagandNodeFirst try to remove nodes or edges in each iteration, so the sizes of nodesand edges follow a monotone decreasing trend. With this message, we show theproof by following algorithm logics: Zigzag converges to S(k,d) because S(k,d) is thefirst time when a k-Core set and a (d+1, d)-Core set are equal (Line 7 in Algorithm4); NodeFirst converges to S(k,d) because it is the first time when no edges withwitness less than d can be found in a k-Core set (Line 8 in Algorithm 5). 2Proposition 9 On each iteration, NodeFirst is more efficient than Zigzag.Proof: We measure the efficiency by the number of node degree check or edge sim-ilarity witness check operations. Zigzag takes |V | + |E| operations, while Node-First takes |V | + 1 operations. Thus, NodeFirst is more efficient than Zigzag.25.4 Story Context TrackingTransient stories we identified from the post network may be highly cor-related, e.g., “the launch of Blackberry 10” and “BlackBerry Super Bowl735.4. Story Context TrackingAlgorithm 5: (k, d)-Core Generation: NodeFirstInput: G(V,E), k, dOutput: All maximal (k, d)-Cores1 Generate a set of k-Cores G′ by removing nodes with degree less than krecursively from G(V,E);2 while G′ is not empty do3 Find an edge e(pi, pj) with less than d witnesses;4 if e(pi, pj) exists then5 delete (pi, pj) from G′;6 if Deg(pi) < k or Deg(pj) < k then7 Remove nodes with degree less than k recursively from G′;8 else9 break while loop;10 return G′;ad”. In this section, we exploit and integrate signals from different measuresin story relatedness computation, and propose deterministic context search(DCS) to construct the veins for a story S on the fly. The performance ofDCS is proportional to the size of S’s neighboring posts. In the case whenthe neighboring post size is huge, we propose randomized context search(RCS) to boost the performance.We remark that DCS and RCS proposed in this section are two generalcomputation frameworks, which can be applied to different definitions ofstories. As long as a story is a connected subgraph in a post network, DCSand RCS will apply to find the context between stories. That is to say,stories defined as λ-quasi-cliques and (k, d)-Cores can work seamlessly astwo instantiations of story concept in the same context tracking framework.For simplicity, we take (k, d)-Core as the structure to instantiate story andexplain its context search in this section.5.4.1 Story Relatedness DimensionsRecall that stories correspond to (k, d)-cores in the post network and letSi(Vi, Ei) and Sj(Vj , Ej) denote two stories. Here we introduce differenttypes of relatedness dimensions, which capture the story relatedness fromdifferent perspectives, and quantify the relatedness by a value in [0, 1]. No-tice that node overlap is a very common evidence to assess the relatednessbetween two subgraphs [69]. However, since Proposition 6 shows that (k, d)-Cores generated by Zigzag or NodeFirst are pairwise disjoint in nodes,745.4. Story Context Trackingnode overlap is not useful in story relatedness computation.Dimension 1: Content Similarity. By viewing a story as a document anda post entity as a term, existing document similarity measures can be usedto assess the story relatedness. However, TF-IDF based Cosine similarityfail to be effective, since TF vectors of stories tend to be very sparse. Weexploit another popular measure, LDA-based symmetric KL-divergence [59].Supposing d is the document representation of story S and θ(d) is the topicdistribution of d produced by LDA, we haveC1(Si, Sj) =KL(θ(di) ‖ θ(dj)) +KL(θ(dj) ‖ θ(di))2(5.1)where KL(θ(di) ‖ θ(dj)) =∑x θx(di) logθx(di)θx(dj).Dimension 2: Temporal Proximity. Stories that happen closer in timeare more likely to correlate together. Given a story Si(Vi, Ei), we can getthe histogram of post volume along time dimension on each bin with sizeequal to the time window sliding step ∆t. Supposing Dist is the normalizeddistribution (with the sum equal to 1) of story S’s histogram, we take thecomplement of L1 Norm as an example to measure the temporal proximitybetween Si and Sj :C2(Si, Sj) = 1−len∑t=1|Disti(t)−Distj(t)| (5.2)where len is the length of the observation time window.Dimension 3: Edge Connectivity. Capillaries associated with posts oftwo stories can be used to determine the story relatedness. This approachcalculates the strength of edge connectivity between posts of Si and Sj inpost network. These edges serve as the bridge between two stories, and theconnectivity strength can be measured by various ways, e.g., the JaccardCoefficient based on the portion of bridge edges:C3(Si, Sj) =|E(Vi, Vj)||E(Vi, V ) ∪ E(Vj , V )| (5.3)where E(A,B) is the edge set between node set A and B.Baseline: Hybrid Relatedness Model. As a baseline approach, storyrelatedness can be computed by a hybrid model on all relatedness dimensions,with the form:CorH(Si, Sj) =3∏k=1Ck(Si, Sj) (5.4)755.4. Story Context TrackingIt is easy to see that 0 ≤ CorH(Si, Sj) ≤ 1. One drawback of the hybridrelatedness model is the performance. First, LDA computation for largecorpus is expensive [70]. Second, to construct the context of story S inthe story vein, the hybrid relatedness model needs to compute CorH(S, S′)between S and every other story S′ in G(V,E), which is redundant andtime-consuming.5.4.2 Story Context SearchStory context search aims at finding the neighbors of a story in the storyvein G(V,E) efficiently. However, simply applying the hybrid relatednessmodel shown in Eq.(5.4) results in an inefficient pairwise computation. Toovercome the challenge, we introduce iceberg query on story vein, as statedbelow.Definition 18 (Iceberg Query) Given a story vein G(V,E), a thresholdγ (0 < γ < 1) and a story S (S 6∈ V), the iceberg query for S is to find thesubset of all stories VS ⊆ V, where for each S′ ∈ VS, Cor(S, S′) ≥ γ.Iceberg query is the key technique to support CAST. Iceberg querygrows the story vein on the fly: for each new story S, it can quickly buildpossible relatedness links between S and the remaining stories. Iceberg querydoes not need to explore all stories in V to generate the exact VS , whichimproves the performance.In this section, we propose two approaches to implement iceberg queryon story vein. The first approach is deterministic, which integrates contentsimilarity, temporal proximity and edge connectivity together by aggregatingall capillaries between two stories. The second approach is randomized,which improves the performance of the deterministic approach by filteringthe flow from S to S′. They are discussed separately below.Deterministic Context Search. The basic idea of deterministic contextsearch (DCS) follows a propagation and aggregation process. To initialize,we treat story S(VS , ES) as source and every other story S′(VS′ , ES′) astarget. In the propagation phase, we start from each post in source andbroadcast the post similarity to neighboring posts along capillaries. In theaggregation phase, we aggregate the post similarity received by posts in eachtarget, and denote it by PAD(S, S′). In detail, we havePAD(S, S′) =∑p∈VS ,p′∈VS′ ,(p,p′)∈Es(p, p′) (5.5)765.4. Story Context TrackingSince s(p, p′) is computed by combining content similarity and tempo-ral proximity (see Eq.(3.1)), PAD(S, S′) naturally integrates the above dis-cussed three story relatedness dimensions. The relatedness between S andS′ is assessed by the ratio between the post similarity aggregated by S′ andpropagated by S:CorD(S, S′) =PAD(S, S′)PAD(S,G)(5.6)where PAD(S,G) is the post similarity sum of all capillaries associatedwith posts in S. When CorD(S, S′) ≥ γ, there will be a story link betweenS and S′ in story vein G(V,E). Let Sτ denote the average time stamp ofS, i.e., Sτ = 1|VS |∑p∈VS pτ . If Sτ < S′τ , the edge direction is from S to S′.Otherwise, the edge direction is from S′ to S.The sketch of DCS is shown in Algorithm 6. Notice that post similarityis not aggregated on every story. Some stories may receive zero or littlepost similarity, and can be omitted. On average, a source story visits atmost min{|V |, |SV |·|E||V | } neighboring posts in DCS. Compared with the hybridrelatedness model, which requires to access the whole post network G(V,E)in pairwise computation, DCS improves the performance significantly.The propagation and aggregation process is also extensively discussed inthe structural similarity search problem [45]. In [45], the similarity betweentwo nodes are modeled as the authority score of corresponding node pair, andthe propagation and aggregation process on node-pairs is used to computethe similarity score accurately.Randomized Context Search. To improve the performance of the de-terministic context search further, we propose randomized context search(RCS) based on random walk. The motivation behind RCS is, in the propa-gation phase, it is preferable to only visit the neighboring posts with a highpost similarity. Given a post p in story S, suppose max(p) and min(p) arethe maximum and minimum post similarity associated with p, respectively.Technically, we use α (0 ≤ α ≤ 1) as a throttle to propagate high postsimilarities: only if s(p, p′) ≥ min(p) +α(max(p)−min(p)), RCS propagatess(p, p′) to p′ by a uniformly random probability.There are two special cases in RCS:• α = 0: Post p propagates to every neighboring post.• α = 1: Post p only propagates to the neighboring post with the highestpost similarity.We empirically choose α = 0.5. Suppose N is the total number ofsimulations we run for the source story S. On each simulation, a ran-775.4. Story Context TrackingAlgorithm 6: Deterministic Context Search (DCS)Input: G(V,E), G(V,E), new story S(VS , ES), γOutput: In-neighbor set NI(S), out-neighbor set NO(S)1 PAD(S,G) = 0, N(S) = ∅, NI(S) = ∅, NO(S) = ∅;2 Sτ = 1|VS |∑p∈VS pτ ;3 for each p ∈ VS do4 for each neighbor p′ of p in G(V,E) do5 if p′ is in a story S′ then6 if S′ 6∈ N(S) then7 add S′ into N(S);8 add s(p, p′) into PAD(S,G) and PAD(S, S′);9 for each S′ ∈ N(S) do10 CorD(S, S′) = PAD(S,S′)PAD(S,G);11 if CorD(S, S′) ≥ γ then12 S′τ = 1|VS′ |∑p′∈VS′ p′τ ;13 if Sτ < S′τ then add S′ into NO(S);14 else add S′ into NI(S);15 return NI(S) and NO(S);dom surfer starts from a post p in S and randomly visits a neighbor p′if s(p, p′) ≥ min(p) + α(max(p) − min(p)). The aggregated post similarityfrom source S to target S′ on simulation i can be described asRi(S, S′) ={s(p, p′) if p ∈ VS , p′ ∈ VS′0 otherwise (5.7)The aggregated post similarity by S′ on N simulations is denoted byPAR(S, S′):PAR(S, S′) =N∑i=1Ri(S, S′) (5.8)We show the sketch of RCS in Algorithm 7. Similar to DCS, the relat-edness between S and S′ is computed by CorR(S, S′) =PAR(S,S′)PAR(S,G). The veinthreshold and direction is determined by the same way of DCS. Clearly, inRCS, a source story visits at most N neighboring posts and usually N  |V |.The value of N is decided by the time budget of story context search in realCAST system. Compared with DCS, RCS achieves better performance byaccessing smaller number of neighboring posts connected by capillaries withhigher post similarity.785.4. Story Context TrackingAlgorithm 7: Randomized Context Search (RCS)Input: G(V,E), G(V,E), new story S(VS , ES), γ, α, nOutput: In-neighbor set NI(S), out-neighbor set NO(S)1 PAR(S,G) = 0, N(S) = ∅, NI(S) = ∅, NO(S) = ∅, count = 0;2 Sτ = 1|VS |∑p∈VS pτ ;3 while count < n do4 randomly select a node p ∈ VS ;5 randomly select a neighbor p′ of p in G(V,E);6 while s(p, p′) < min(p) + α(max(p)−min(p)) do7 randomly select a neighbor p′ of p in G(V,E);8 if p′ is in a story S′ then9 if S′ 6∈ N(S) then10 add S′ into N(S);11 add s(p, p′) into PAR(S,G) and PAR(S, S′);12 for each S′ ∈ N(S) do13 CorR(S, S′) = PAR(S,S′)PAR(S,G);14 if CorR(S, S′) ≥ γ then15 S′τ = 1|VS′ |∑p′∈VS′ p′τ ;16 if Sτ < S′τ then add S′ into NO(S);17 else add S′ into NI(S);18 return NI(S) and NO(S);5.4.3 Interpretation of Story VeinSince story vein is described in graph notation, it becomes very importantto explain the story vein to end users who may be non-technical. There arevarious options to present a (k, d)-Core in human consumable form. The firstoption is annotating a (k, d)-Core by the most representative posts inside,e.g., posts with the highest edge degree. The second option is rendering a(k, d)-Core into a word cloud. Since we extracted entities from each post,the frequency of an entity in a (k, d)-Core can be used to indicate the fontsize of that entity in the word cloud. In this paper, we provide both optionsto aid human perception. We will show some interpretation examples inexperimental study.795.5. Experimental Study5.5 Experimental StudyAll experiments are conducted on a computer with Intel 2.66 GHz CPU, 8GB RAM, running 64-bit Windows 7. All algorithms are implemented inJava. We empirically set the threshold  = γ = 0.2 for the construction ofpost network and story vein. We set α = 0.5 to filter out capillaries withlow post similarity in RCS.All data sets are crawled from Twitter.com via Twitter4J8 API, with atime span from Jan. 1 to Feb. 1, 2012. Although our story teller CASTworks regardless of the domain, we make the data set domain-specific inorder to facilitate evaluation of the generated results.CNN-News. By simulating a social news stream, we collect tweets cre-ated in the first half year of 2014 from CNN channels, which include @cnn,@cnnbrk, @cnnmoney, @cnnlive and @cnni. This data set has 10872 tweetsand serves for the quality analysis.Tech-Lite. We built a technology domain dataset called Tech-Lite by ag-gregating timelines of users listed in Technology category of “Who to follow”9and their retweeted users. Tech-Lite has 352,328 tweets, 1402 users and thestreaming rate is 11700 tweets/day.Tech-Full. Based on the intuition that users followed by users in Technol-ogy category are most likely also in the technology domain, we obtained alarger technology social stream called Tech-Full by collecting all the time-lines of users that are followed by users in Technology category. Tech-Fullhas 5,196,086 tweets, created by 224,242 users, whose streaming rate is about7216 tweets/hour.5.5.1 Tuning Post NetworkPost similarity computation shown in Eq. (3.1) influences the structure ofpost network directly. We can tune the content similarity function sT (pTi , pTj )and temporal proximity function sτ (|pτi − pτj |) to make the post networkconcise and expressive. Many set-based similarity measures such as Jaccardcoefficient [53] can be used to compute the similarity sT (pTi , pTj ) betweenposts. Since entity usually appears once in one tweet, similarity measuressuch as Cosine similarity and Pearson correlation [53] will degenerate to aform very similar to Jaccard, so we use Jaccard as our similarity functionand omit further discussion of alternatives.8Twitter4J. http://twitter4j.org/9http://twitter.com/who_to_follow/interests805.5. Experimental StudyMethods Capillaries Stories Vein LinksNo-Fading 1159364 373 495Exp-Fading 327390 87 275Reci-Fading 357132 123 384Table 5.2: The number of edges in the post network, and the number ofstories and relatedness links in the story vein as the changing of temporalproximity functions. We define a story as a (5, 3)-Core.Temporal proximity function sτ (|pτi − pτj |) determines how similarity toolder posts is penalized, compared to recent posts. We compared three dif-ferent functions: (1) reciprocal fading (“Reci-Fading”) with D(|pτi − pτj |) =1|pτi −pτj |+1 , (2) exponential fading (“Exp-Fading”) withD(|pτi−pτj |) = e−|pτi −pτj |and (3) no fading (“No-Fading”) with D(|pτi − pτj |) = 1. For any posts pi, pj ,clearly e|pτi −pτj | ≥ |pτi − pτj |+ 1 ≥ 1. Exp-Fading penalizes the posts in theold part of time window severely (see Table 5.2): the number of capillariesand stories generated by Exp-Fading is lower than by other approaches. SinceNo-Fading does not penalize old posts in the time window, too many capil-laries and stories will be generated without considering recency. Reci-Fadingis in between, which is a more gradual penalty function than Exp-Fading.In the rest of experiments, we use Exp-Fading, since it generates the mostconcise post network with an emphasis on recent posts.5.5.2 Quality EvaluationIn this section, we evaluate the quality of CAST on two tasks: story dis-covery and context search, by comparing with baselines.Ground Truth. We build the ground truth of tech stories in Jan 2012 bybrowsing news articles on main stream technology websites like CNET.com,TechCrunch.com, and ReadWrite.com, etc. We pick headlines with highoccurrence and build a ground truth of top 10 stories. They include CESrelated stories, SOPA & PIPA related stories, Facebook IPO, Yahoo co-founder Jerry Yang resignation, RIM CEO changes, etc.Baseline 1: Peak Detection. Some recent work on event detection[48, 55, 67] can be used for story discovery, and they share the same spiritthat aggregates the frequency of topic-indicating phrases at each moment tobuild a histogram and generates stories by detecting volume peaks in thehistogram. We design three baselines to capture the major techniques used815.5. Experimental StudyRank  HashtagPeaks MentionPeaks EntityPeaks 1 #CES @CNETNews google 2 #SOPA @thatdrew	 ces 3 #EngadgetCES @m4tt	 apple 4 #opengov @TNWapple video 5 #gov20 @TNWinsider sopa 6 #CES2012 @TNWapps	 twitter 7 #PIPA @TGW	 year 8 #opendata @jonrussell	 facebook 9 #StartupAmerica @CNET	 app 10 #win7tech @harrisonweber iphone Precision 0.5 0.2	 0.7 Recall 0.4 0.2	 0.5 			 	Figure 5.3: Top 10 results of HashtagPeaks, MentionPeaks and EntityPeakson Tech-Lite dataset. The ground truth for precision and recall is top 10major stories selected from main stream technology news websites.by these approaches.• HashtagPeaks: aggregate frequent hashtags;• MentionPeaks: aggregate frequent mentions.• EntityPeaks: aggregate frequent entities.We show top 10 results of HashtagPeaks, MentionPeaks and EntityPeaksin Tech-Lite dataset with time stamps between Jan 1 and Feb 1, 2012 inFigure 5.3. Their precision and recall are also listed by comparing withthe ground truth. Notice that multiple peaks may correspond to the samestory in ground truth, where the precision and recall differ. As we can see,MentionPeaks is the worst because most of these mentions are not topic-indicating phrases. Hashtags are the “twitter” way to indicate an event, butit requires manual assignation by human. EntityPeaks is the best out of threebaselines, since entities are extracted from the social stream preprocessingstage to annotate stories. Although these baselines are able to indicate somewords about a story, they’re not qualified for a successful story-teller becausethe internal and external structure of the story is missing. In particular, theycannot show how users interact with each other, how posts are clustered, andhow the story is distributed along time dimension.Baseline 2: Topic Modeling. Topic modeling is a typical way to detecttopics from text document corpus. As we mentioned earlier, existing work ontopic detection and tracking falls into a classification problem, and prevalenttopic modeling models like Latent Dirichlet Allocation (LDA) [13] are mainlytrained on formal-writing news articles. We design a baseline using LDA,and treat top 100 posts of each topic as a story. Post time stamps are totally825.5. Experimental StudyRank LDA-Detected Story Description1 blog post, great buy in january, start sign2 report amazon kindle fire3 check real youtube interview video series4 win chance south beach trip5 small business company story, local community6 make great start, things learn7 bad thing in life, feel good8 send email hey, glad to hear, follow tweet9 jan travel, days ago, back home tonight10 president obama, iran war killed11 daily top stories, tech news, health care12 apple tv iphone ipad, ces event, app store13 watch night show, tonight fun, miss baby14 twitter mobile app, android apps, tweet free15 Iowa caucus results, Santorum and Romney16 play half game, nyc new york city, winter fans17 super bowl, man football party18 happy new year, hope wonderful year, friends family19 love word, perfect friend, people20 google internet search ad, https linkTable 5.3: Top 20 stories detected by LDA from Tech-Full and described byhigh frequency words. We treat top 100 posts of each topic as a story.ignored in LDA. We set the topic number as 50, and after rendering topicsto stories, we show the top 20 stories in Table 5.3. As we can see, thetopic cohesion of LDA-detected stories is not very high: posts sharing somecommon words are very easily classified into the same topic, even though theyare not exactly talking about the same story. Besides, LDA model cannotdeal with noisy posts, so the quality is compromised in social streams.CAST on Tech-Lite. Recall that our proposed story-teller CAST usesa (k, d)-Core to represent a story, which is a cohesive subgraph in the postnetwork. Figure 5.4 shows top 10 transient stories generated byCAST. Eachstory is represented as a (5,3)-Core. It is worth noting that there are variousways to present stories on the user interface. In this experiment, we rendera story into an entity cloud. Each cloud is annotated by a short descriptionbeside. Interestingly, we observe the curve of tweet frequency goes down onevery weekend, which reflects the living habit in reality. Compared with theground truth, our story-teller achieves a precision of 0.9 and a recall of 0.8.835.5. Experimental Study                              7ZHHW)UHTXHQF\zz+XJKRSHQHZ\HDU &(6WRPRUURZ &(6623$3,3$*RRJOHDQWL623$:LNLDQWL623$EODFNRXW<DKRR-HUU\<DQJUHVLJQV6XSUHPHFRXUW*36GHFLVLRQ5,0FKDQJHV&(2 )DFHERRN,32Figure 5.4: Top 10 transient stories generated by our proposed story-telleron Tech-Lite. Each story is represented as a (5,3)-Core, and we render astory into an entity cloud for human reading. Some transient stories maybe related, as linked by lines. The curve on the bottom is the breakdown oftweet frequency on each day in Jan 2012.For precision, the only case we fail is that “hug/hope new year” is not a storyin ground truth. The reason may be lots of people tweeting about “happynew year” but this is not formally reported in technology news websites, sinceit is a well-known fact. For recall, the ground truth story not appearing inour top 10 is “Apple iBooks 2 Release” on Jan 19. This story is ranked onthe 13th position in our story-teller.We can see that stories may be related with each other. In Figure 5.4,there are multiple stories about CES (Consumer Electronic Show) 2012, heldfrom Jan 8 to Jan 13. Meanwhile, stories related to SOPA & PIPA are highlyrelated, even though each of them tells a relatively different story.When inspecting the top 50 transient stories returned by CAST, we ob-serve two major advantages compared with news reports. First, our story-teller can identify stories that are hardly noticeable in news websites. Forexmple, “Rupert Murdoch joins Twitter” is a story widely spread in socialstreams, but not commonly reported by news articles. Second, the forma-tion of a transient story in social streams generally occurs earlier than thefirst publishing of the corresponding news articles. [48] also observed thisphenomenon, and the typical time lag is around 2.5 hours.CAST on Tech-Full. Figure 6.1 shows an example to illustrate the storyvein on Tech-Full. To help understand, we group stories densely connectedby vein links in rectangles, where each rectangle can be viewed as a storyseries. As we can see, “CES” and “SOPA & PIPA” are two very differentstory series, but they can be connected by indirect links via posts such as“CES 2012: Microsoft Keynote With Steve Ballmer” and “Microsoft opposesSOPA”. CAST will track the relatedness between stories and build a highlyinformative story vein for story-teller.845.5. Experimental StudyVKRRWHUDWPDOOLQFROXPELDPDU\ODQG PDOD\VLDDLUOLQHVIOLJKWORVW3XWLQDQGFULPHDDXVWUDOLDRIILFLDOVVD\SRVVLEOHREMHFWVLQLQGLDQRFHDQPDOD\VLDQRIILFLDOVXSGDWHVHDUFKIRU0+ODQGVOLGHLQZDVKLQJWRQVWDWHDXVWUDOLDQDXWKRULWLHVXSGDWHVHDUFKIRU0+6KRRWLQJDWIRUWKRRG6HDUFKIRU0+VRXWKNRUHDQIHUU\GHDWKVSRQVRUEDFNODVKDJDLQVWFOLSSHUVRZQHUGRQDOGVWHUOLQJFRQWURYHUV\ZRQWVLQNFOLSSHUVILQDQFHVFigure 5.5: A fragment of story vein tracked from CNN-News, which has atime span from January 1 to June 1, 2014.CAST on CNN-News. CNN-News simulates the daily social stream re-ceived by a Twitter user and has a time span of six months. We show afragment of the story vein tracked from CNN-News in Figure 5.5. In thisexample, “MH370 lost and search” is a story series with an internal structure,by organizing individual stories together. Users can click on each story tosee the details, or traverse along story veins to read highly related storieshappening earlier or later. Compared with Twitter Trends Search10 whichshows trending topics, CAST has distinct advantages on providing both theinternal structure and the external context of a specific story.5.5.3 Performance TestingIn this section, we test the performance of proposed algorithms. All tests areperformed on Tech-Full dataset with the time window set to one week. Onaverage, the post network has a node size 710,000 and edge size 4,020,000.Story Identification. Recall that both k-Core and (k, d)-Core generationshave polynomial time complexity. We test the running time of generatingall maximal k-Cores and (k, d)-Cores from the post network and show theresult in Figure 5.7(a). Let k = d + 1. As d increases, the running time ofk-Core generation drops. The reason is, although a bigger k means morenodes need to be removed at each iteration, the total number of iterationsdrops even more quickly. The running time for (k, d)-Cores is nearly sta-ble, but much higher than k-Core generation. This observation implies that10https://twitter.com/search-home855.5. Experimental StudyFigure 5.6: An example to illustrate our context-aware story-teller. Each tagcloud here is a single story identified from the post network. Sets of storieswith higher relatedness are grouped together in rectangles to aid readability.NodeFirst should be much faster than Zigzag, since NodeFirst performsk-Core generation whenever possible. This conclusion is supported by theexperimental results in Figure 5.7(a). The relationship between d and thenumber of connected components we can find from the post network is dis-covered in Figure 5.7(b). We see that with increasing d, both the numbers of(d+ 1)-Cores and (d+ 1, d)-Cores drop. Since a (d+ 1, d)-Core is at least a(d+ 1)-Core but more cohesive, we may find multiple (d+ 1, d)-Cores from a(d+1)-Core by removing edges that do not satisfy the (k, d)-Core definition.Thus, given the same post network, the number of (d+ 1, d)-Cores is usuallyhigher than the number of (d+ 1)-Cores.Iceberg Queries. In CAST, we use iceberg query to construct possiblevein links between a given story and all other stories in the post network.As discussed in Section 5.4, we treat the given story as source and all otherstories as targets, and the story relatedness threshold γ will govern the con-struction of story vein links.In Figure 5.7(c), we treat each story in the post network as the query andshow the total running time of iceberg queries for all stories. As d increases,we observe that the number of (d+1, d)-Cores decrease gradually from Figure5.7(b). This will make the total size of transient stories smaller and naturallythe total running time of iceberg queries decrease. In experiments, we can865.6. Discussion and Conclusion0.111010010001 2 3 4 5Time (seconds in log scale) d (d+1)-Core(d+1,d)-Core: NodeFirst(d+1,d)-Core: Zigzag(a)01000200030004000500060001 2 3 4 5# Connected Components d (d+1)-Core(d+1,d)-Core(b)01234561 2 3 4 5Time (seconds)dDCSRCSHybrid Correlation Model(c)Figure 5.7: (a) Running time of (d+ 1)-Core generation and (d+ 1, d)-Coregeneration (Zigzag,NodeFirst). (b) The number of connected componentsgenerated by (d+1)-Cores and (d+1, d)-Cores. (c) Running time of differentcontext search approaches. All experiments are running on Tech-Full datasetwith the time window set to one week.see that the performance of RCS is remarkably better than the performanceof DCS. For the quality, we show the accuracy of RCS in Table 5.4, as thenumber of simulations n grow. We define n as a number proportional to theneighboring post size of story S and as we can see, 20% simulations alradyproduce an accuracy higher than 95%.5.6 Discussion and ConclusionIn this chapter, we focus on two problems: (1) Efficiently identify tran-sient stories from fast streaming social content; (2) Perform Iceberg queryto build the structural context between stories. To solve the first problem,we transform social stream in a time window to a post network, and model875.6. Discussion and Conclusionn|neighboring posts of S| 10% 20% 30% 40%Avg(1− |CorD(S,S′)−CorR(S,S′)|CorD(S,S′) ) 0.91 0.95 0.98 0.99Table 5.4: The accuracy of RCS, as the number of simulations n grow. Wedefine n as a number proportional to the neighboring post size of story Sand measure the accuracy based on DCS.transient stories as (k, d)-Cores in the post network. Two polynomial timealgorithms are proposed to extract maximal (k, d)-Cores. For the secondproblem, we propose deterministic context search and randomized contextsearch to support the iceberg query, which allows to perform context searchwithout pairwise comparison. We perform detailed experimental study onreal Twitter streams and the results demonstrate the effectiveness and valueof our proposed context-aware story-teller CAST. In future work, we areinterested in mining opinions from transient stories, e.g., the sentiment on apolitical event. Besides, we are interested in various visualization techniquesto present transient stories to end users in a friendly way.88Chapter 6Incremental Event EvolutionTrackingIn Chapter 3, we discussed the modeling of an event as a density cluster.Density-based clustering is superior to other clustering methods such as K-Means or hierarchical clustering for event identification, since it is robustto noisy posts and can quickly find core posts in social streams. As thepost network changes over time, the maintenance of density clusters withoutcomputation from scratch is very important and this is a highly challengingtask. In this chapter, we focus on the tracking of event evolution patterns,which corresponds to the incremental density-based cluster maintenance onthe post network level.6.1 IntroductionPeople easily feel overwhelmed by the information deluge coming from socialwebsites. There is an urgent need to provide users with tools which canautomatically extract and summarize significant information from highlydynamic social streams, e.g., report emerging bursty events, or track theevolution of one or more specific events in a given time span. There are manyprevious studies [10, 26, 48, 55, 66, 67] on detecting new emerging events fromtext streams; they serve the need for answering the query “what’s trendingnow? ” over social streams. However, in many scenarios, users may want toknow more details about an event and may like to issue advanced querieslike “how’re things going? ”. For example, for the event “SOPA (Stop OnlinePiracy Act) protest” happening in January 2012, existing event detectionapproaches can discover bursty activities at each moment, but cannot answerqueries like “how SOPA protest has evolved in the past few days?”. An idealoutput to such an evolution query would be a “panoramic view” of the eventhistory, which improves user experience. In grpah perspective, we can modelsocial streams as dynamically evolving post networks and model events asclusters over these networks, obtained by means of a clustering approach that896.1. IntroductionTime WindowVolumeIdentifySocial Streamsmoment t Incremental TrackingMonitorC2: growthC1: decayC3: birthFigure 6.1: Post network captures the correlation between posts in the timewindow at each moment, and evolves as time rolls on. The skeletal graph isshown in bold. From moment t to t+ 1, the incremental tracking frameworkwill maintain clusters and monitor the evolution patterns on the fly.is robust to the large amount of noise present in social streams. Accordingly,we consider the above kind of queries as an instance of the cluster evolutiontracking problem, which aims to track the cluster evolution patterns at eachmoment from such dynamic networks. Typical cluster evolution patternsinclude birth, death, growth, decay, merge and split. Event detection can beviewed as a subproblem of cluster evolution tracking in social streams.In this chapter, we propose an incremental tracking framework for clus-ter evolution over highly dynamic networks. To illustrate the techniquesin this framework, we consider the event evolution tracking task in socialstreams as an application, where a social stream and an event are modeledas a dynamic post network and a post cluster respectively. The reasons wedeploy our framework on this application are: social streams usually surgevery quickly, making it ideal for the performance evaluation, and events arehuman-readable, making it convenient to assess the quality. In detail, sincea significant portion of social posts like tweets are just noise, we first definea Skeletal Graph as a compact summary of the original post network, fromwhich post clusters can be generated. Then, as we will discuss later, wemonitor the network updates with a fading time window, and capture theevolution patterns of networks and clusters by a group of primitive evolutionoperations and their algebra. Moreover, we extend node-by-node evolutionto subgraph-by-subgraph evolution to boost the performance of evolutiontracking of clusters. Figure 6.1 shows an overview of major modules we usefor cluster evolution tracking in social streams.We notice that at a high level, our method resembles previous work ondensity-based clustering over streaming data, e.g., DenStream [17], DStream906.1. IntroductionSF (p1, p2) the fading similarity between posts p1 and p2(ε, δ) similarity threshold, priority thresholdwt(p) the priority of post p at moment tGt(Vt, Et) the post network at moment tGold the old subgraph that lapses at moment t+ 1Gnew the new subgraph that appears at moment t+ 1Gt(V t, Et) the skeletal graph at moment tC, St a component, a component set in GtC, St a cluster, a cluster set in GtN (p) post p’s neighbor set with similarity larger than εNc(p) the cluster set of post p’s neighboring core postsTable 6.1: Notation.[18] and cluster maintenance in [2] and [5]. However, there are several majordifferences with this body of work. First, our approach works on highlydynamic networks and provides users the flexibility in choosing the scopefor tracking by means of a fading time window. Second, the existing workcan only process the adding of nodes/edges one by one, while our approachcan handle adding, deleting and fading of nodes, in bulk mode, i.e., subgraphby subgraph. This is an important requirement for dealing with the highthroughput rate of dynamic networks. Third, the focus of our approach istracking and analyzing the cluster evolution dynamics in the whole life cycle.By contrast, the previous works focus on clustering streaming data, which isa sub-task in our problem.On the application side, comparing with topic tracking approaches, wenote that they are usually formulated as a classification problem [4]: whena new story arrives, compare it with topic features in the training set bydecision trees or k-NN [78], and if it matches sufficiently, declare it to beon a topic. Since these approaches assume that topics are predefined beforetracking, we cannot simply apply them to event evolution tracking in socialstreams. Comparing with existing event detection and tracking approaches[48, 55, 66, 67], our framework has advantages in tracking the whole life cycleand capturing composite evolution behaviors such as merging and splitting.The problem is formalized in Sec. 6.2. For convenience, we summarizethe major notations used in this chapter in Table 6.1.916.2. Problem Formalization6.2 Problem FormalizationWe formally define dynamic network and dynamic clusters here, and thenintroduce the problem this paper seeks to solve.Dynamic Network. A dynamic network is a network with node and edgeupdates over time. We define a snapshot of an dynamic network at momentt as a weighted graph Gt(Vt, Et), where an edge e(u, v) ∈ Et connects nodesu, v in Vt and s(u, v) is the similarity between them. For the problem studiedin this paper, we assume a dynamic network is the input and s(u, v) ∈(0, 1] is a value usually set by a specific similarity function. From time tto t + 1, we use ∆Gt+1 to describe the updating subgraph applied to Gt,i.e., Gt + ∆Gt+1 = Gt+1. In real applications, the size of ∆Gt+1 is typicallymuch smaller than the size of Gt or Gt+1, and we make this as an assumptionthroughout this Chapter. Naturally, a dynamic network G1:i from moment1 to i can be described as a continuous updating subgraph sequence appliedat each moment. The formal definition is given below.Definition 19 A dynamic network Gi:j from moment i to j is denoted as(Gi; ∆Gi+1, · · · ,∆Gj), where Gi(Vi, Ei) is a weighted graph with node set Viand edge set Ei, and ∆Gt+1 (i ≤ t < j) is an updating subgraph at momentt+ 1 such that Gt + ∆Gt+1 = Gt+1.When the network evolves from Gt to Gt+1, we reasonably assume thatat moment t + 1, only a small portion of Gt is incrementally updated, i.e.,|Vt+1 − Vt| + |Vt − Vt+1|  |Vt| and |Et+1 − Et| + |Et − Et+1|  |Et|. Thisassumption generally holds in practice, and when it doesn’t, we can shortenmoment interval sufficiently to make the assumption hold. For simplicity, weexpress ∆Gt+1 as a sequence of node additions and deletions, e.g., ∆Gt+1 :=+v1− v2 means adding node v1 and all the edges incident with v1 in a singleoperation, and analogously, deleting node v2 and its incident edges in asubsequent operation.Dynamic Clusters. Let’s suppose Ct is a subgraph inGt, and isCluster(Ct)is a boolean function to validate whether Ct is a cluster or not, with the exactdefinition given in Sec. 6.4.2. In the following, we define a dynamic densitycluster.Definition 20 A dynamic cluster Ci:j from moment i to j is denoted as(Ci; ∆Ci+1, · · · ,∆Cj) where isCluster(Ci) = True, and ∆Ct+1 (i ≤ t < j)926.3. Incremental Tracking FrameworktG 1tG tS 1tS ĸĺĻķ Ĺ(a)1tG 'tGIncrementalTracking ModuletC1tC '1tC 1tG (b)Figure 6.2: (a) The commutative diagram between dynamic networksGt, Gt+1 and cluster sets St, St+1. The “divide-and-conquer” baseline andour Incremental Tracking are annotated by dotted and solid lines respec-tively. (b) The workflow of incremental tracking module, which shows ourframework tracks cluster evolution dynamics by only consuming the updat-ing subgraph ∆Gt+1.is an updating subgraph at moment t+ 1 that makes Ct + ∆Ct+1 = Ct+1 andisCluster(Ct+1) = True.The Problem. We focus on addressing the following problem:Problem 2 Supposing Gi:j = (Gi; ∆Gi+1, · · · ,∆Gj) is a large dynamic net-work and isCluster(Ct) is a binary validation function for cluster candi-date Ct, the problem of incremental cluster evolution is to generate an up-dating subgraph sequence (∆Ci+1, · · · ,∆Cj) with Ct + ∆Ct+1 = Ct+1 andisCluster(Ct+1) = True, where i ≤ t < j.The cluster evolution patterns can be observed from the updating se-quence. For example, if Ct 6= ∅ but Ct+1 = ∅, it means Ct dies at momentt + 1. Ct = ∅ but Ct+1 6= ∅, a new cluster Ct+1 is born at moment t + 1.Typical cluster evolution patterns include birth, death, growth, decay, mergeand split. In this paper, we aim to track the complete set of cluster evolutionpatterns in real time.6.3 Incremental Tracking FrameworkWe illustrate the relationship between dynamic networks Gt, Gt+1 and clus-ter sets St, St+1 at consecutive moments as a commutative diagram in Fig-936.4. Skeletal Graph Clusteringure 6.2(a). The traditional approaches for tracking dynamic network relatedproblems usually follow a “divide-and-conquer” spirit [38], which consists ofthree components: (1) decompose a dynamic network into a series of snap-shots for each moment, (2) apply graph mining algorithms on each snapshotto find useful patterns, (3) match patterns between different moments to gen-erate a dynamic pattern sequence. Applied to our problem, to track clusterevolution patterns, these steps are:Step 1©: At moment t, identify the cluster set St from Gt;Step 2©: At moment t + 1, as the network evolves, generate Gt+1 fromGt using ∆Gt+1;Step 3©: Again, identify the cluster set St+1 from Gt+1;Step 4©: Generate cluster evolution patterns from time t to t + 1 bytracing the correspondence between St and St+1.However, this approach suffers from both performance and quality. Firstly,repeated extraction of clusters from large networks from scratch is a very ex-pensive operation (steps 1© and 3©), and tracing the correspondence betweencluster sets at successive moments is also expensive (step 4©). Secondly, thestep of tracing correspondence, since it is done after two cluster sets aregenerated, may lead to loss of accuracy. In contrast, the method we proposeis incremental tracking of cluster evolution, which corresponds to step 5© inFigure 6.2(a). The workflow of this incremental tracking from time t to t+1is illustrated in Figure 6.2(b). More precisely, for the very first snapshot ofthe dynamic network, say G0, our approach will generate the correspondingevent set S0 from scratch. After this, we update the existing cluster setby the changed parts and move to the next moment recursively, by only ap-plying step 5©, i.e., we incrementally derive St+1 from St and ∆Gt+1. Theexperiments on real data set show that our incremental tracking approachoutperforms the traditional baselines in both performance and quality.6.4 Skeletal Graph ClusteringThe functional relationships between different types of objects defined inthis paper are illustrated in Figure 6.3. As an example, the arrow from Gtto Gt with label Ske means Gt is derived from Gt by function Ske, i.e.,Gt = Ske(Gt). See Table 6.1 for notations used. The various objects andtheir relationships will be explained in the rest of the paper.946.4. Skeletal Graph ClusteringtGtGtStSSkeCluCCSke GenSke Gen  Skeletal Cluster Skeletal Graph Dynamic NetworkFigure 6.3: The functional relationships between different types of objectsdefined in this paper, e.g., the arrow from Gt to Gt with label Ske meansGt = Ske(Gt). Refer to Table 6.1 for notations.6.4.1 Node PrioritizationIn reality, many posts tend to be just noise, so it is essential to identify thosenodes that play a central role in clusters. On web link graph analysis, thereis a lot of research on node authority ranking, e.g., HITS and PageRank[53]. However, most of these methods are iterative and not applicable tothe single-pass computation on streaming data. Node prioritization is atechnique to quickly differentiate and rank the processing order of nodes bytheir roles in a single pass. It is extremely useful in big graph mining, wherethere are too many nodes to be processed and many of them are of littlesignificance. However, to the best of our knowledge, there is insufficientstudy on single-pass node prioritization in a streaming environment.In this paper, we perform node prioritization based on density parameters(ε, δ), where 0 < ε < 1, and ε ≤ δ. In density-based clustering (e.g.,DBSCAN [20]), the threshold MinPts is used as the minimum number ofnodes in an ε-neighborhood, required to form a cluster. We adapt this anduse a weight threshold δ as the minimum total weight of neighboring nodes,required to form a cluster. The reason we choose density-based approachesis that, compared with partitioning-based approaches (e.g., K-Means [30])and hierarchical approaches (e.g., BIRCH [30]), density-based methods suchas DBSCAN define clusters as areas of higher density than the remainderof the data set, which is effective in finding arbitrarily-shaped clusters andis robust to noise. Moreover, density-based approaches are easy to adaptto support single-pass clustering. In the post network, we consider ε to bea similarity threshold to decide connectivity, and can be used to define thepost priority.956.4. Skeletal Graph ClusteringDefinition 21 Given a post p = (L, τ, a) in post network Gt(Vt, Et) andsimilarity threshold ε, the priority of p at time t (t ≥ pτ ), is defined aswt(p) =1e|t−pτ |∑q∈N (p)SF (p, q) (6.1)where N (p) is the subset of p’s neighbors with SF (p, q) > ε.Notice that post priority decays as time moves forward. Thus, postpriority needs to be continuously updated. In practice, we only store thesum∑q∈N (p) SF (p, q) with p to avoid frequent updates and compute wt(p)on demand.Skeletal Graph. With post priority computed, we use δ as a prioritythreshold to differentiate nodes in Gt(Vt, Et):• A post p is a core post if wt(p) ≥ δ;• It is a border post if wt(p) < δ but there exists at least one core postq ∈ N (p);• It is a noise post if it is neither core nor border, i.e., wt(p) < δ and thereis no core post in N (p).Intuitively, a post is a core post if it shares enough common entities withmany other posts. Neighbors of a core post are at least border posts, if notcore posts themselves. Core posts play a central role: if a core post p is foundto be a part of a cluster C, its neighboring (border or core) posts will alsobe a part of C. This property can be used in the single-pass clustering: if anincoming post p is “reachable” from an existing core post q, post p will beassigned to the cluster with q. Core posts connected by edges with similarityhigher than ε will form a summary of Gt(Vt, Et), that we call the skeletalgraph.Definition 22 Given post network Gt(Vt, Et) and density parameters (ε, δ),we define the skeletal graph as the subgraph of Gt(Vt, Et) induced by postswith wt(p) ≥ δ and edges with similarity higher than ε. We write Gt =Ske(Gt).Ideally, Gt(V t, Et) will retain important information in Gt. Empirically,we found that adjusting the granularity of (ε, δ) to make the size |V t| roughlyequal to 20% of |Vt| leads to a good balance between the quality of the skeletalgraph in terms of the information retained and its space complexity. Moretuning details can be found in Section 6.7.1.966.5. Incremental Cluster Evolution6.4.2 Skeletal Cluster IdentificationOne of the key ideas in our incremental cluster evolution tracking approach isto use the updating subgraph ∆Gt+1 between successive moments to main-tain the skeletal clusters. Post clusters are constructed from these skeletalclusers. Maintaining skeletal clusters can be done efficiently since the skele-tal graph is much smaller in size than the post graph it’s obtained from.Besides efficiency, skeletal cluster has the advantage of giving the correspon-dence between successive post clusters in a very small cost.Definition 23 Given Gt(Vt, Et) and the corresponding skeletal graph Gt(V t, Et),a skeletal cluster C is a connected component of Gt. A post cluster is a setof core posts and border posts generated from a skeletal cluster C, written asC = Gen(C), using the following expansion rules:• All posts in C form the core posts of C.• For every core post in C, all its neighboring border posts in Gt form theborder posts in C.In what follows, by cluster, we mean a post cluster, distinguished fromthe explicit term skeletal cluster. By definition, a core post only appears inone (post) cluster. If a border post is associated with multiple core posts indifferent clusters, this border post will appear in multiple (post) clusters.6.5 Incremental Cluster EvolutionIn this section, we discuss the incremental evolution of skeletal graph andpost clusters under the fading time window.6.5.1 Fading Time WindowFading (or decay) function and sliding time window are two common aggre-gation schemes used in time-evolving graphs (e.g., see [17]). Fading schemeputs a higher emphasis on newer posts, as captured by fading similarity inEq. (3.1). Sliding time window scheme (posts are first-in, first-out) is essen-tial because it provides a scope within which a user can monitor and trackthe evolution.Since clusters evolve quickly from moment to moment, even within agiven time window, it is important to highlight new posts and degrade oldposts using the fading scheme. Thus, we combine these two schemes and in-troduce a fading time window, as illustrated in Figure 6.4. In practice, users976.5. Incremental Cluster EvolutionUser Space1t toldG newGLen1t old t newG G G G  t'MomentsFigure 6.4: An illustration of the fading time window from time t to t + 1,where post priority may fade w.r.t. the end of time window. Gt will beupdated by deleting subgraph Gold and adding subgraph Gnew.can specify the length of the time window to adjust the scope of monitoring.Users can also choose different fading functions to penalize old posts andhighlight new posts in different ways. Let ∆t denote the interval betweenmoments. For simplicity, we abbreviate the moment (t + i · ∆t) as (t + i).When the time window slides from moment t to t + 1, the post networkGt(Vt, Et) will be updated to be Gt+1(Vt+1, Et+1). Suppose Gold(Vold, Eold)is the old subgraph (of Gt) that lapses at moment t+1 and Gnew(Vnew, Enew)is the new subgraph (of Gt+1) that appears (see Figure 6.4). Clearly,Gt+1 = Gt −Gold +Gnew (6.2)Let Len be the time window length. We assume Len > 2∆t, which makesVold ∩ Vnew = ∅. This assumption is reasonable in applications, e.g., we setLen to 1 week and ∆t to 1 day.6.5.2 Network Evolution OperationsWe analyze the evolution process of networks and clusters at each momentand abstract them into five primitive operators: +, −, , ↑, ↓. We classifythe operators based on the objects they manipulate: nodes or clusters, anddefine them below.Definition 24 Primitive node operations:• Gt+p: add a new post p into Gt(Vt, Et) where p 6∈ Vt. All the new edgesassociated with p will be constructed automatically by linkage search (ex-plained in Sec. 3.3);• Gt − p: delete a post p from Gt(Vt, Et) where p ∈ Vt. All the existingedges associated with p will be automatically removed from Et.986.5. Incremental Cluster Evolution• Gt: update the post priority scores in Gt.Composite node operations:• Gt ⊕ p = (Gt + p): add a post p into Gt(Vt, Et) where p 6∈ Vt andupdate the priority of related posts;• Gt 	 p = (Gt − p): delete a post p from Gt(Vt, Et) where p ∈ Vt andupdate the priority of related posts.Definition 25 Primitive cluster evolution operations:• +C: generate a new cluster C;• −C: remove an old cluster C;• ↑ (C, p): increase the size of C by adding post p;• ↓ (C, p): decrease the size of C by removing post p.Composite cluster evolution operations:• Merge(S) = +C−S: merge a set of clusters S into a new single clusterC and remove S;• Split(C) = −C + S: split a single cluster C into a set of new clustersS and remove C.In particular, composite node operations are designed to convenientlydescribe the adding/deleting of posts with priority scores updated in thesame time, and composite cluster operations are designed to capture theadvanced evolution patterns of clusters. Each operator defined above on asingle object can be extended to a set of objects, i.e., for a node set X ={p1, p2, · · · , p}, Gt +X = Gt + p1 + p2 + · · ·+ p. This is well defined since +is associative and commutative. We use the left-associative convention for‘−’: that is, we write A−B−C to mean (A−B)−C. These operators willbe used later in the formal description of the evolution procedures. Figure6.5(a) depicts the role played by the primitive operators in the tracking ofcluster evolutions from dynamic networks.6.5.3 Skeletal Graph Evolution AlgebraThe updating of skeletal graphs from Gt to Gt+1 is the core task in clusterevolution tracking. If we ignore the node priorities for a moment, the follow-ing formula shows different ways to compute the overlapping part in Gt+1and Gt, as illustrated in Figure 6.4:Gt+1 −Gnew = Gt −Gold = Gt+1 	Gnew = Gt 	Gold (6.3)996.5. Incremental Cluster EvolutionHowever, at the skeletal graph level, Ske(Gt+1−Gnew) 6= Ske(Gt−Gold):some core posts in Gt−Gold may no longer be core posts due to the removalof edges incident with nodes in Gold or simply due to the passing of time;some non-core posts may become core posts because of the adding of edgeswith nodes in Gnew. To measure the changes in the overlapping part, wedefine the following three components.Definition 26 Updated components in overlap:• S+ = Ske(Gt+1 −Gnew)− Ske(Gt+1 	Gnew): components of non-coreposts in Gt − Gold that become core posts in Gt+1 − Gnew due to theadding of Gnew;• S− = Ske(Gt − Gold) − Ske(Gt 	 Gold): components of core posts inGt−Gold that become non-core posts in Gt+1−Gnew due to the removingof Gold;• S = Ske(Gt	Gold)−Ske(Gt+1	Gnew): components of core posts inGt−Gold that become non-core posts in Gt+1−Gnew due to the passingof time.Based on Definition 26, from moment t to t + 1, the changes of coreposts in the overlapping part, i.e., Gt+1−Gnew (equivalently, Gt−Gold – seeFigure 6.4), can be updated using the components S+, S− and S. That is,Ske(Gt+1 −Gnew)− Ske(Gt −Gold)= (Ske(Gt+1 −Gnew)− Ske(Gt+1 	Gnew))−(Ske(Gt −Gold)− Ske(Gt 	Gold))−(Ske(Gt 	Gold)− Ske(Gt+1 	Gnew))= S+ − S− − S (6.4)Let Sold and Snew denote the sets of skeletal clusters in Gold and Gnewrespectively. The following theorem characterizes the iterative and incre-mental updating of skeletal graphs from moment t to t + 1, and it plays acentral role in the cluster evolution.Theorem 1 From moment t to t+ 1, the skeletal graph evolves by removingcore posts in Gold, adding core posts in Gnew and updating core posts in theoverlapping part. That isSt+1 = St − Sold − S− − S + Snew + S+ (6.5)Proof: Since operator ‘−’ does not update post priority, we have Ske(Gt+1−Gnew) = Ske(Gt+1)−Ske(Gnew) = St+1−Sn, Ske(Gt−Gold) = Ske(Gt)−1006.5. Incremental Cluster EvolutionPost Network CluSkeletal Clusters  n p :Skeletal Graph Ske Gen(a)|Nc(p)| 0 1 ≥ 2Add a core post p + ↑ MergeDelete a core post p − ↓ Split(b)Figure 6.5: (a) The relationships between primitives and evolutions. Eachbox represents an evolution object and the arrows between them describe in-puts/outputs. (b) The evolutionary behavior table for clusters when addingor deleting a core post p.Ske(Gold) = St− Sold. Then, St+1− Snew − St + Sold = S+− S−− S andwe get the conclusion. 2Theorem 1 indicates that we can incrementally maintain skeletal clustersSt+1 from St. Since we define (post) clusters based on skeletal clusters, thisincremental updating of skeletal clusters benefits incremental updating ofcluster evolution essentially.6.5.4 Incremental Cluster EvolutionLet St = Clu(Gt) denote the set of clusters obtained from the post networkGt. Notice that noise posts in Gt do not appear in any clusters, so thenumber of posts in St is typically smaller than |Vt|. Next, we explore theincremental cluster evolution problem from two levels: the node-by-nodeupdating level and subgraph-by-subgraph updating level.Node-by-Node Evolution. The basic operations underlying cluster evo-lution are the cases when St is modified by the addition or deletion of acluster that includes only one post. In the following, we analyze and showthe evolution of clusters by adding or deleting a post p. When adding p, welet Nc(p) denote the set of clusters that p’s neighboring core posts belongto before p is added. When deleting p, let Nc(p) denote the set of clustersthat p’s neighboring core posts belong to after p is removed. |Nc(p)| = 0means p has no neighboring core posts. Notice that Merge and Split arecomposite operations and can be decomposed into a series of cluster primi-tive operations. We show the evolution behaviors of clusters in Figure 6.5(b)and explain the detail below.(a) Addition: St + {p}1016.5. Incremental Cluster EvolutionIf p is a noise post after being added into Gt, ignore p. If p is a borderpost, add p to each cluster in Nc(p). Else, p is a core post and we do thefollowing:• If |Nc(p)| = 0: apply +C, where C = {p} ∪ N (p);• If |Nc(p)| = 1: apply ↑ (C, {p} ∪ N (p)), where C is the lone cluster inNc(p);• If |Nc(p)| ≥ 2: apply Merge = +C −∑C′∈Nc(p)C′ and C = Nc(p) ∪{p} ∪ N (p).(b) Deletion: St − {p}If p is a noise post before being deleted from Gt, ignore p. If p is a borderpost, delete p from each cluster in Nc(p). Else, p is a core post and we dothe following:• If |Nc(p)| = 0: apply −C where p ∈ C;• If |Nc(p)| = 1: apply ↓ (C, {p} ∪ N (p));• If |Nc(p)| ≥ 2: apply Split = −C +∑C′∈Nc(p)C′, where p ∈ C beforethe deletion.Subgraph-by-Subgraph Evolution. When dynamic networks such aspost networks in social streams surge quickly, the node-by-node processingfor cluster evolution will lead to a poor performance. To accelerate theperformance, we consider the subgraph-by-subgraph updating approach. LetClu(Gnew) = Snew and Clu(Gold) = Sold be the cluster sets of the graphsGnew and Gold, and St be the set of all clusters at moment t. As the timewindow moves forward to moment t+ 1, if we add Gnew to the network Gt,clusters will evolve as follows:Clu(Gt +Gnew) = Gen(Ske(Gt +Gnew))= Gen(Ske(Gt) + Ske(Gnew) + S+ − S) (Definition 26)= St + Snew + S+ − S (6.6)where S+ = Gen(S+) and S = Gen(S). Similarly, if we remove Goldfrom the network Gt, clusters evolve as follows:Clu(Gt −Gold) = Gen(Ske(Gt −Gold))= Gen(Ske(Gt)− Ske(Gold)− S−) (Definition 26)= St − Sold − S− (6.7)where S− = Gen(S−). Based on Equation (6.6) and (6.7), from momentt to t + 1, the set of clusters can be incrementally updated by the iterativecomputation1026.6. Incremental AlgorithmsSt+1 = Clu(Gt+1) = Clu(Gt −Gold +Gnew)= St − Sold − S− + Snew + S+ − S (6.8)Equation (6.8) can be also verified by applying Gen function on bothsides of Equation (6.5). Naturally, Equation (6.8) provides a theoretical basisfor the incremental computation of cluster evolution: as the post networkevolves from Gt to Gt+1, we do not compute St+1 from Gt+1. Instead,we incrementally update St by means of the five cluster sets appearing inEquation (6.8), using simple set operations. Since the sizes of Gold and Gneware usually very small compared with Gt, these five cluster sets are also ofsmall size and so we can generate St+1 quickly from them. The details ofincremental computation are discussed in Section 6.6.6.6 Incremental AlgorithmsThe traditional approach of decomposing an evolving graph into a series ofsnapshots suffers from both quality and performance, since clusters are gen-erated from scratch and matched heuristically at each moment. To overcomethis limitation, we propose an incremental tracking framework, as introducedin Section 6.5 and illustrated in Figure 6.2(b). In this section, we leverageour incremental computation by proposing Algorithm 8 for the incrementalcluster maintenance (ICM) and Algorithm 9 for the cluster evolution track-ing (eTrack) respectively. Since at each moment |Vold| + |Vnew|  |Vt|, ouralgorithms can save a lot computation by adjusting clusters incrementally,rather than generating them from scratch.Bulk Updating. Traditional incremental computation on dynamic graphsusually treats the addition/deletion of nodes or edges one by one [18, 22].However, in a real scenario, since social posts arrive at a high speed, thepost-by-post incremental updating will lead to very poor performance. Inthis paper, we speed up the incremental computation of St by bulk updating.Clearly, updating St with a single node {p} is a special case of bulk updating.Here, a bulk corresponds to a cluster of posts and we “lift” the post-by-post updating of St to the bulk updating level. Recall that Nc(p) is theneighboring cluster set of p where p is a core post. To understand the bulkupdating in Algorithm 8, for a cluster C, we define Nc(C) as the neighboringcluster set of posts in C, i.e., Nc(C) = ∪p∈CNc(p) where C = Ske(C). WhenC is added into or deleted from St as a bulk, the size of Nc(C) will decidethe evolution patterns of clusters from moment t to t+ 1 after C is added or1036.7. Experimentsdeleted, as shown in Figure 6.5(b). Since C is usually a small subgraph, weconsider a bulk operation can be done in constant time. Since internal edgesof a subgraph C can be ignored when determining Nc(C), bulk updating ismore efficient than node-by-node updating.Incremental Cluster Maintenance (ICM). The steps for incrementalcluster maintenance (ICM) from any moment t to t + 1 are summarized inAlgorithm 8. The ICM algorithm follows the iterative computation shownin Equation (6.8), that is St+1 = St − Sold − S− − S + Snew + S+. Asanalyzed in Section 6.5.4, each bulk addition and bulk deletion has threepossible evolution behaviors, decided by the size of Nc(C). Lines 3-13 dealwith deleting a bulk C, where three patterns {−, ↓, Split} are handled. Lines15-26 deal with adding a bulk C and handle another three patterns {+, ↑,Merge}. Supposing there are n bulk updates in ICM, the time complexityof ICM is O(n). Since a bulk operation is generally completed in constanttime, ICM is an efficient single-pass incremental computation algorithm.Cluster Evolution Tracking (eTrack). Given a dynamic network Gi:jand the set of clusters Si at the start moment i, the eTrack algorithm willtrack the primitive cluster evolution operations at each moment, workingon top of the ICM algorithm (Line 3). We summarize the steps of eTrackin Alg. 9. Basically, eTrack monitors the changes of clusters effected byICM at each moment. If the cluster is not changed, eTrack will take noaction; otherwise, eTrack will determine the corresponding cases and outputthe cluster evolution patterns (Lines 4-12). Notice that in Lines 5-8, if acluster C in St has ClusterId id, we use the convention that St(id) = Cto access C by id, and St(id) = ∅ means there is no cluster in St withClusterId id. Especially, lines 7-8 mean a cluster in St evolves into a clusterin St+1 by deleting the posts in St(id)− St+1(id) first and adding the postsin St+1(id)− St(id) later. As an efficient monitoring algorithm, once we getSt+1 incrementally by ICM, the time complexity of eTrack is linear in thenumber of clusters in St and St+1 at each moment.6.7 ExperimentsIn this section, we first discuss how to tune the construction of post networkand skeletal graph to find the best selection of entity extraction and densityparameters. Then, we test the quality and performance of cluster evolutiontracking algorithms on two social streams: Tech-Lite and Tech-Full that wecrawled from Twitter. Our event detection baseline covers the major tech-niques reported in [48, 55, 66, 67]. Our evolution tracking baseline captures1046.7. Experiments01020304050607080901000.3 0.4 0.5 0.6 0.7 0.8Percentage (%) Density Parameters # core posts# core edges# eventsFigure 6.6: The trends of the number of core posts, core edges and eventswhen increasing δ from 0.3 to 0.8. We set δ = ε = 0.3 as the 100% basis.the essence of the state of the art algorithms reported in [38]. All algorithmsare implemented in Java. We use the graph database Neo4J11 to store andmanipulate the post network.Datasets. All datasets are crawled from Twitter via Twitter API. Althoughour cluster evolution tracking algorithm works regardless of the domain,in order to facilitate evaluation, we make the dataset domain specific. Thecrawling of datasets is performed as follows. We built a technology domaindataset called Tech-Lite by aggregating all the timelines of users listed in theTechnology category of “Who to follow”12 and their retweeted users. Tech-Lite has 352,328 tweets, 1402 users and the streaming rate is about 11700tweets/day. Based on the intuition that the followees of users in Technologycategory are most likely to be in the same domain, we obtained a larger tech-nology social stream called Tech-Full by collecting all the timelines followedby users in the Technology category. Tech-Full has 5,196,086 tweets, createdby 224,242 users, whose streaming rate is about 7216 tweets/hour. BothTech-Lite and Tech-Full include retweets and have a time span from Jan. 1to Feb. 1, 2012. Since each tweet corresponds to a node in the post network,both Tech-Lite and Tech-Full produce highly dynamic networks. Notice thatthe performance of our single-pass incremental approach is mainly affectedby the streaming rate, rather than the dataset size.6.7.1 Tuning Skeletal GraphPost Preprocessing. As described in Section 6.4, we extract entities fromposts by POS tagger. One alternative approach to entity extraction isusing hashtags. However, only 11% of the tweets in our dataset have hash-tags, which results in lots of posts in the dataset having no similarity score11http://neo4j.org/12http://twitter.com/who_to_follow/interests1056.7. Experimentsbetween them. Another approach is simply tokenizing tweets into unigramsand treating unigrams as entities, and we call it the “Unigrams” approach,as discussed in [55]. Table 2(a) shows the comparison of the three entity ex-traction approaches in the first time window of the Tech-Full social stream.If we use “Unigrams”, obviously the number of entities is larger than othertwo approaches, but the number of edges between posts tends to be smaller,because tweets written by different users usually share very few commonwords even when they talk about the same event. The “Hashtags” approachalso produces a smaller number of edges, core posts and events, since it gen-erates a much sparser post network. Overall, the “POS-Tagger” approachcan discover more similarity relationships between posts and produce morecore posts and events given the same social stream and parameter setting.Density Parameters. The density parameters (ε, δ) control the construc-tion of the skeletal graph. Clearly, the higher the density parameters, thesmaller and sparser the skeletal graph. Figure 6.6 shows the number of coreposts, core edges and events as a percentage of the numbers for ε = 0.3, asδ increases from 0.3 to 0.8. Results are obtained from the first time windowof the Tech-Full social stream. We can see the rate of decrease of #eventsis higher than the rates of #core posts and #core edges after δ > 0.4, be-cause events are less likely to form in sparser skeletal graphs. More smallevents can be detected by lower density parameters, but the computationalcost will increase because of larger and denser skeletal graphs. However, forbig events, they are not very sensitive to these density parameters. We setε = 0.3, δ = 0.5 as a trade-off between the size and number of events onehand and processing efficiency on the other.6.7.2 Cluster Evolution TrackingGround truth. To generate the ground truth, we crawl news articles inJanuary 2012 from famous technology websites such as TechCrunch, Wired,CNET, etc, without looking at tweets. Then we treat the titles of newsas posts and apply our event tracking algorithm to extract event evolutionpatterns. Finally, a total of 20 major events with life cycles are identified asground truth. Typical events include “happy new year”, “CES 2012”, “sopawikipedia blackout”, etc. To find more small and less noticeable events, weuse Google Trends for Search13, which shows the traffic trends of keywordsthat appeared in Google Search along the time dimension. If an event-indicating phrase has a volume peak in Google Trends at a specific time,13http://www.google.com/trends/1066.7. ExperimentsJan 15, 2012 Jan 22, 2012 Jan 29, 2012SOPA WikipediaSOPA Facebook SOPA MegauploadApple iBooksFigure 6.7: Examples of Google Trends peaks in January 2012. We validatethe events generated by cTrack by checking the existence of volume peaks ata nearby time moment in Google Trends. Although these peaks can detectbursty events, Google Trends cannot discover the merging/splitting patterns.HashtagPeaks UnigramPeaks Louvain eTrackCES google Apple iphone ipad CES conferenceSOPA ces CES ultrabook tablet SOPA PIPAEngadgetCES apple Google search privacy Hug new yearopengov video Week’s Android games RIM new CEOgov20 sopa Kindle Netflix app Yahoo jerry yangCES2012 twitter Internet people time Samsung Galaxy NexusPIPA year Hope weekend Apple iBooksopendata facebook SOPA Megaupload Facebook IPO NewsStartupAmerica app SOPA PIPA Wikipedia Martin Luther Kingwin7tech iphone Facebook IPO Tim Cook Apple stockFigure 6.8: Lists of top 10 events detected from Twitter Technology streamsin January 2012 by baseline HashtagPeaks, UnigramPeaks, Louvain and ourincremental tracking approach eTrack.we say this event is sufficiently validated by the real world. We validatethe correctness of an event Ci by the following process: we pick the top 3entities of Ci ranked by frequency and search them in Google Trends, and ifthe traffic trend of these top entities has a distinct peak at a nearby time toCi, we consider that Ci corresponds to a real world event widely witnessedby the public. Four examples of Google Trends peaks are shown in Figure6.7. It is not surprising to find that the birth of events in social streams isusually earlier than its appearance in Google Trends.Cluster Annotation. Considering the huge volume of posts in a cluster, itis important to summarize and present a post cluster as a conceptual eventto aid human perception. In related work, Twitinfo [55] represents an event1076.7. Experimentsit discovers from Twitter by a timeline of tweets, showing the tweet activityby volume over time. However, it is tedious for users to read tweets one-by-one to figure out the event detail. In this paper, we summarize a snapshot ofa cluster by a word cloud [29]. The font size of a word in the cloud indicatesits popularity. Compared with Twitinfo, word cloud provides a summary ofthe cluster at a glance and is much easier for human to read.Comparison with DynDense [5]. DynDense works on entity graphs, withthe majority of tweets ignored since they have less than 2 entities. DynDensemodels events as dense subgraphs with size smaller than Nmax, usually setas 5. We observed that with less than Nmax entities, many DynDense resultsare difficult to interpret as specific real events, e.g., a subgraph with 3 nodes{“HP”, “Microsoft”, “Google”}. Compared with DynDense, models events aslarge dense post clusters, which can be validated as real events by checkinghighly correlated core posts in the cluster.Baseline 1: Peak-Detection. In recent works [48, 55, 66, 67], events aregenerally detected as volume peaks of phrases over time in social streams.These approaches share the same spirit that aggregates the frequency ofevent-indicating phrases at each moment to build a histogram and generatesevents by detecting volume peaks in the histogram. We design two variantsof Peak-Detection to capture the major techniques used by these state-of-the-art approaches.• Baseline 1a: HashtagPeaks which aggregates hashtags;• Baseline 1b: UnigramPeaks which aggregates unigrams.Notice, both baselines above are for event detection only. Lists of thetop 10 events detected by HashtagPeaks and UnigramPeaks are pre-sented in Figure 6.8. Some highly frequent hashtags like “#opengov” and“#opendata” are not designed for event indication, hurting the precision.UnigramPeaks uses the unigrams extracted from the social stream prepro-cessing stage, which has a better quality than HashtagPeaks. However,both of them are limited in their representation of events, because the inter-nal structure of events is missing. Besides, although these peaks can detectbursty words, they cannot discover cluster evolution patterns such as themerging/splitting. For example, in Figure 6.7, there is no way to know“Apple announced iBooks” is a split from the big event “SOPA” earlier, asillustrated in detail in Figure 6.9.Baseline 2: Community Detection. A community in a large networkrefers to a subgraph with dense internal connections and sparse connectionswith other communities. It is possible to define an event as a community of1086.7. Experimentsposts. Louvain method [14], based on modularity optimization, is the state-of-the-art approach community detection method in terms of performance.We design a baseline called “Louvain” to detect events defined based on postcommunities. The top 10 events generated by Louvain are shown in Fig-ure 6.8. As we can see, not every result detected by the Louvain method ismeaningful. For example, “Apple iphone ipad” and “Internet people time” aretoo vague to correspond to any concrete real events. The reason is, althoughLouvain method can make sure every community has relatively dense in-ternal and sparse external connections, it cannot guarantee that every nodein the community is important and has a sufficiently high connectivity withother nodes in the same community. It is highly possible that a low-degreenode belongs to a community only because it has zero connectivity withother communities. Furthermore, noise posts are quite prevalent in Twitterand they negatively impact Louvain method.Baseline 3: Pattern-Matching. We design a baseline to track the evo-lution patterns of clusters between snapshots. In graph mining, the “divide-and-conquer” approach of decomposing the evolving graph into a series ofsnapshot graphs at each moment is a traditional way to tackle evolvinggraph related problems (e.g., [38]). As an example, Kim et al. [38] firstcluster individual snapshots into quasi-cliques and then map them in adja-cent snapshots over time. Inspired by this approach, we design a baselinefor cluster evolution tracking, which characterizes the cluster evolution atconsecutive moments, by identifying certain heuristic patterns:• If |Ct∩Ct+1||Ct∪Ct+1| ≥ κ and |Ct| ≤ |Ct+1|, Ct+1 =↑ Ct;• If |Ct∩Ct+1||Ct∪Ct+1| ≥ κ and |Ct| > |Ct+1|, Ct+1 =↓ Ct.where Ct and Ct+1 are any two clusters detected at moment t and t + 1respectively, κ% is the minimal commonality to say Ct and Ct+1 are differentsnapshots of the same cluster. A higher κ% will result in a higher precisionbut a lower recall of the evolution tracking. Empirically we set κ% = 90%to guarantee the quality. It is worth noting that this baseline generates thesame clusters as the eTrack algorithm, but with a non-incremental evolutiontracking approach.Precision and Recall. To measure the quality of event detection, weuse HashtagPeaks, UnigramPeaks and Louvain as baselines to comparewith our algorithm eTrack. It is worth noting that Baseline 3 is designed forthe tracking of event evolution patterns between moments, so we omit it here.We compare the precision and recall of top 20 events generated by baselinesand eTrack and show the results in Table 2(b). Compared with the ground1096.7. Experimentstruth, HashtagPeaks and UnigramPeaks have rather low precision andrecall scores, because of their poor ability in capturing event bursts. Noticethat multiple extracted events may correspond to the same ground truthevent. eTrack outperforms the baselines in both precision and recall. Sincethere are many events discussed in the social media but not very noticeablein news websites, we also validate the precision of the generated events usingGoogle Trends. As we can see, HashtagPeaks and UnigramPeaks per-form poorly under Trends validation, since the words they generate are lessinformative and not very event-indicating. eTrack gains a precision of 95%in Google Trends, where the only failed result is “Samsung galaxy nexus”,whose volume is steadily high without obvious peaks in Google Trends. Thereason may be that the social stream is very dynamic. Louvain is worse thaneTrack. The results show eTrack is significantly better than the baselines inquality.Life Cycle of Cluster Evolution. Our approach is capable of trackingthe whole life cycle of a cluster, from birth to death. We explain this usingthe example of “CES 2012”, a major consumer electronics show held in LasVegas from January 10 to 13. As early as Jan 6, our approach has alreadydetected some discussions about CES and generated an event about CES.On Jan 8, most people talked about “CES prediction”, and on Jan 9, thehighlighted topic was “CES tomorrow” and some hearsays about “ultrabook”which would be shown in CES. After the actual event happened on Jan 10,the event grew distinctly bigger, and lots of products, news and messages arespreading over the social network, and this situation continues until Jan 13,which is the last day of CES. Afterwards, the discussions become weaker andcontinue until Jan 14, when “CES” was not the biggest mention on that daybut still existed in some discussions. Compared with our approach, Baselines1 and 2 can detect the emerging of “CES” with a frequency count at eachmoment, but no trajectory is generated. Baseline 3 can track a very coarsetrajectory of this event, i.e., from Jan 10 to Jan 12. The reason is, if anevent changes rapidly and many posts at consecutive moments cannot beassociated with each other, Baseline 3 will fail to track the evolution. Sincein social streams the posts usually surge quickly, our approach is superior tothe baselines. The illustration of “CES” evolution trajectory and extendeddiscussions can be found in [46].Cluster Merging & Splitting. Figure 6.9 illustrates an example of clustermerging and splitting generated by algorithm eTrack. eTrack detected theevent of SOPA (Stop Online Piracy Act) and Wikipedia on Jan 16, becauseon that day Wikipedia announced the blackout on Wednesday (Jan 18) to1106.7. Experiments			grow Jan 16, 2012 SOPA Wikipedia Protest Jan 16, 2012 SOPA Wikipedia blackout Jan 18, 2012 SOPA PIPA protest Jan 18, 2012 Apple products Jan 19, 2012 SOPA PIPA protest and Apple iBooksJan 20, 2012 SOPA PIPA protest and Apple iBooksJan 21, 2012 SOPA PIPA protest Jan 21, 2012 Apple iBooks grow mergemergedecay split splitFigure 6.9: The merging and splitting of “SOPA” and “Apple”. At eachmoment, an event is annotated by a word cloud. Baselines 1 and 2 onlyworks for the detection of new emerging events, and is not applicable for thetracking of merging and splitting dynamics. The evolution trajectories ofand Baseline 3 are depicted by solid and hollow arrows respectively.protest SOPA. This event grew distinctly on Jan 17 and Jan 18, by inducingmore people in the social network to discuss about this topic. At the sametime, there was another event detected on Jan 18, discussing Apple’s prod-ucts. On Jan 19, actually the SOPA event and Apple event were merged,because Apple joined the SOPA protest and lots of Apple products such asiBooks in education are directly related to SOPA. This event evolved on Jan20, by adding more discussions about iBooks 2. Apple iBooks 2 was actuallyunveiled in Jan 21, while this new product gained lots of attention, peoplewho talked about iBooks did not talk about SOPA anymore. Thus, on Jan21, the SOPA-Apple event was split into two events, which would evolveindependently afterwards. Unfortunately, the above merging and splittingprocess cannot be tracked by any of the baselines, which output some inde-pendent events.1116.7. Experiments0246810121416181 2 3 4 5 6 7 8 9 10Running Time (min) Time window length (·Δt) eTrack: Tech-FulleTrack: Tech-Lite(a) Varying time window0246810121416181 2 3 4 5Running Time (min) Step length (·Δt) eTrack: Tech-FulleTrack: Tech-LiteBaseline 3: Tech-FullBaseline 3: Tech-Lite(b) Varying step lengthFigure 6.10: The running time on two datasets as the adjusting of the timewindow length and step length.6.7.3 Running Time of Evolution TrackingRemind that Baseline 1 and 2 are for event identification in a fixed timewindow. For evolution tracking, we measure how the Baseline 3 and eTrackscale w.r.t. both the varying time window width and the step length. Weuse both Tech-Lite and Tech-Full streams, and set the time step interval∆t = 1 day for Tech-Lite, ∆t = 1 hour for Tech-Full to track events ondifferent time granularity. The streaming post rates for Tech-Lite and Tech-Full are 11700/day and 7126/hour respectively. Figure 6.10(a) shows therunning time of eTrack when we increase the time window length, and wecan see for a time window of 10∆t hours in Tech-Full, our approach canfinish the post preprocessing, post network construction and event trackingin just 3 minutes. A key observation is that the running time of eTrack doesnot depend on the overall size of the dataset. Rather, it depends on thestreaming speed of posts in ∆t. Thus, Tech-Lite takes more time than Tech-Full since its streaming posts in ∆t is higher. Figure 6.10(b) shows if we fixthe time window length as 10∆t and increase the step length of the slidingtime window, the running time of eTrack grows nearly linearly. Comparedwith our incremental computation, Baseline 3 has to process posts in thewhole time window from scratch at each moment, so the running time willbe steadily high. If the step length is larger than 4∆t in TechFull, eTrack doesnot have an advantage in running time compared with Baseline 3, because alarge part of post network is updated at each moment. However, this extremecase is rare. Since in a real scenario, the step length is much smaller thanthe time window length, our approach shows much better efficiency than thebaseline approach.1126.8. Discussion and Conclusion6.8 Discussion and ConclusionOur main goal is to track the evolution of events over social streams suchas Twitter. To that end, we extract meaningful information from noisy poststreams and organize it into an evolving network of posts under a sliding timewindow. We model events as sufficiently large clusters of posts sharing thesame topics, and propose a framework to describe event evolution behaviorsusing a set of primitive operations. Unlike previous approaches, our evolu-tion tracking algorithm performs incremental updates and efficiently tracksevent evolution patterns in real time. We experimentally demonstrate theperformance and quality of our algorithm over two real data sets crawledfrom Twitter. As a natural progression, in the future, it would be interestingto investigate the tracking of evolution of social emotions on products, withits obvious application for business intelligence.1136.8. Discussion and ConclusionAlgorithm 8: ICM: Incremental Cluster MaintenanceInput: St, Sold, Snew, S−, S+, SOutput: St+11 St+1 = St;// Delete Sold ∪ S−2 for each cluster C in Sold ∪ S− ∪ S do3 C = Ske(C);4 Nc(C) = ∪p∈CNc(p);5 if |Nc(C)| = 0 then6 remove cluster C from St+1;7 else if |Nc(C)| = 1 then8 delete C from cluster C ′ where C ′ ∈ Nc(C);9 else10 remove the cluster that C belongs to from St+1;11 for each cluster C ′ ∈ Nc(C) do12 assign a new cluster id for C ′;13 add C ′ into St+1;// Add Snew ∪ S+14 for each cluster C in Snew ∪ S+ do15 C = Ske(C);16 Nc(C) = ∪p∈CNc(p);17 if |Nc(C)| = 0 then18 assign a new cluster id for C and add C to St+1;19 else if |Nc(C)| = 1 then20 add C into cluster C ′ where C ′ ∈ Nc(C);21 else22 assign a new cluster id for C;23 for each cluster C ′ ∈ Nc(C) do24 C = C ∪ C ′;25 remove C ′ from St+1;26 add C into St+1;27 return St+1;1146.8. Discussion and ConclusionAlgorithm 9: eTrack: Cluster Evolution TrackingInput: G = {Gi, Gi+1, · · · , Gj}, SiOutput: Primitive cluster evolution operations1 for t from i to j do2 obtain Sold, Snew, S−, S+ from Gi+1 −Gi;3 St+1 = ICM(St, Sold, Snew, S−, S+, S);4 for each cluster C ∈ St+1 do5 id = ClusterId(C);6 if St(id) 6= ∅ then7 output ↓ (C, St(id)− St+1(id));8 output ↑ (C, St+1(id)− Si(id));9 else +C;10 for each cluster C ∈ Si do11 id = ClusterId(C);12 if St+1(id) = ∅ then −C;(a) Results of different entity extraction approaches.Methods #edges #coreposts #coreedges #eventsHashtags 182905 6232 28964 196Unigrams 142468 15070 46783 430POS-Tagger 357132 21509 47808 470(b) Precision and recall of top 50 events.Methods Precision Recall Precision(major events) (major events) (G-Trends)HashtagPeaks 0.40 0.30 0.25UnigramPeaks 0.45 0.40 0.20Louvain 0.60 0.55 0.75eTrack 0.80 0.80 0.95Table 6.2: Tuning post network.115Chapter 7Crowdsourcing-Based UserStudyUser study is an effective way to determine the ground truth, or evaluate thecorrectness of a hypothesis. Traditionally, user studies are usually conductedin an “offline” mode, e.g., domain experts are employed to mark the truth, orstudents in a lab are invited to annotate images. With the rising of the In-ternet, crowdsourcing has recently become a popular mechanism behind userstudies. Crowdsourcing is the process of obtaining needed services or con-tent by soliciting contributions from a large group of online users. Comparedwith the traditional offline evaluation, crowdsourcing has clear advantageson engaging high number of users who are ready to work at flexible time for afairly low price. However, crowdsourcing may easily fall into the low-qualityuser problem, since these users are typically not experts of crowdsourcingtasks, and they are motivated by earning money. The quality evaluation ofusers becomes a crucial problem in crowdsourcing based user study. In thischapter, we will analyze the problem and propose effective techniques likeExpectation-Maximization with Qualification (EMQ) to conquer the chal-lenges.7.1 IntroductionIn this thesis, we studied the cohesion, context and evolution problems of sto-ries and events in unstructured social streams. In Chapter 1, we mentionedthat the approaches we proposed in this thesis are based on two hypotheses:(1) When modeling social streams, users prefer correlated posts to individ-ual posts; (2) To model stories/events in social streams, structural approachis better than frequency-based approach and LDA-based approach. Thesetwo fundamental hypotheses determine that we model social streams as postnetworks and use graph mining approaches to mine stories and events. Inthis section, we conduct crowdsourcing-based user study to verify these twoimportant hypotheses in social stream mining. All crowdsourcing tasks are1167.2. Related Workimplemented on Amazon Mechanical Turk (MTurk). Given the fact that alarge number of workers on MTurk are bots and spammers, the most crit-ical challenge is user quality control in crowdsourcing. In the related worksection, we summarize existing techniques on user quality control, includingmajority voting, minimum time constraint, qualification test, etc. However,none of them can solve the “Smart Spammer” problem, in which workerspass the qualification test but perform like a spammer to get the rewardwith minimal work. To deal with the challenge, in this chapter, we proposeExpectation-Maximization with Qualification (EMQ), which is capable ofmeasuring user’s quality in crowdsourcing and detecting Smart Spammersamong all qualified workers. As an iterative process, EMQ recursively up-dates the probability of a worker being a valuable worker until convergence,and we call this probability the quality score of a worker. Finally, for acrowdsourcing task composed by a question and several options, the answerof this question is the option obtaining the highest votes, where each vote isweighted by the quality score of a worker.We organize this chapter as follows. In Section 7.2, we introduce existingtechniques for worker quality control in crowdsourcing, and review their prosand cons respectively. In Section 7.3, the two hypotheses that support themodeling of social streams and stories/events in this thesis are introduced.Section 7.4 introduces the quality control workflow used in crowdsourcingtasks of this thesis. Especially, we introduce Expectation-Maximization withQualification (EMQ), an advanced approach to evaluate the quality of work-ers. We show experiment results in Section 7.5.7.2 Related WorkUser (or worker) quality control is crucially important in guaranteeing thequality of submitted work in crowdsourcing. As the truth of each crowdsour-ing task is either unavailable or very time-consuming to obtain and workersare primarily motivated by the reward, user quality control on crowdsourcingtasks is very challenging. For example, we may consider that a higher rewardcan lead to a higher quality of answers to this task. However, as pointed by[24], there is no clear correlation between the reward and the final quality.The reason is that increasing the price is believed to attract spammers (i.e.,Turkers who cheat, not really performing the job, but using robots or an-swering randomly). In this section, the state-of-the-art techniques for qualitycontrol in crowdsourcing are summarized below.Majority Voting. The traditional approach to improve the answering qual-1177.2. Related Workity of a task is by assigning the same task to a large number of workers, andthen doing a majority voting [60]. However, this approach is costly, as eachworker needs to be paid.Minimum Time Constraint. This technique sets a minimum test costfor a task, which forces the workers to read and think about the task for atime span higher than the minimum, e.g., 20 seconds, before making theirdecision. This action is supported by Amazon MTurk, and has proved tobe effective to block bots and spammers [12]: bots will typically submit thework faster than any human, e.g., less than 1 second, and spammers usuallydo not read the instructions carefully and tend to make decisions faster thannormal workers.Approval Rate Constraint. As mentioned in [16], the Requester canrequire that all workers meet a particular qualification, such as sufficientaccuracy on a small test set or a minimum percentage of previously acceptedsubmissions. In [72], researchers only released HITs (Human IntelligenceTasks) to two groups of turkers: turkers with (1) the Master’s Qualification (aqualification awarded by Amazon) and (2) the default custom qualificationswhich requires the turkers to have completed at least 1000 HITs with a 95%approval rating. They reported that those turkers with high approval ratingcan achieve a quality as high as the quality achieved by skilled crowd, whichare a group of well-trained graduate students. The disadvantages of thisapproach is it requires workers having long historical submission records tomake the approval rate computed meaningfully.Check and Reject. This approach does a manual check of the answersprovided by the workers and rejects the work if the Requesters feel the sub-mission is of low quality. Also, the Requester has the option of rejectingthe work of individual workers, in which case these workers are not paid.However, the manual check of all answers is time-consuming and needs lotsof human labor.Qualification Test. Qualification test is a widely used technique for userquality control in crowdsourcing. For instance, Yashar [58] designed somegold units as a qualification test for the bilingual translation. That is, theyprovide an English sentence and a Spanish sentence, and then ask turkersyes/no on whether there is a translation between them. As an effectiveapproach, qualification test is widely used in crowdsourcing for the qualitycontrol of workers. The downside of this approach is workers need to spendconsiderable amount of time to complete the qualification test before theactual work.1187.3. Hypotheses in Social Stream MiningExpectation-Maximization. Aditya et. al [60] mentioned that the Ex-pectation Maximization algorithm can be used to estimate worker quality.These algorithms collect annotations from humans, and do disagreement-based analysis to deduce the true answers. Ipeirotis et. al. [36] discussedthe quality management on Amazon Mechanical Turk. They proposed asolution based on Expectation-Maximization: the algorithm iterates untilconvergence, following two steps: (1) estimate the correct answer for eachtask, using labels assigned by multiple workers, accounting for the quality ofeach worker; and (2) estimate the quality of the workers by comparing theirsubmitted answers to the inferred correct answers. However, the output ofEM method changes with the selection of intial seeds, and a bad selection ofseeds may produce low-quality results.Hybrid Approach. By working with Google, Ipeirotis et. al. [35] de-scribed Quizz, a gamified crowdsourcing system that simultaneously assessesthe knowledge of users and acquires new knowledge from them. Quizz op-erates by asking users to complete short quizzes on specific topics; as a useranswers the quiz questions, Quizz estimates the user’s competence. To ac-quire new knowledge, Quizz also incorporates questions for which we do nothave a known answer; the answers given by competent users provide usefulsignals for selecting the correct answers for these questions. Their exper-iments involve over ten thousand users and confirm that Quizz can auto-matically identify users with the desired expertise and interest in the giventopic, with cost below that of hiring workers through paid-crowdsourcingplatforms. The downside of Quizz is it may fall into the “smart spammer”problem, where workers perform competent innitially but then behave likespammers by giving random answers, because they want to get the rewardas quickly as possible.7.3 Hypotheses in Social Stream MiningAs defined in Chapter 1, in social streams, an event is a set of related stories,and a story is a set of similar posts with high cohesion. The goal of socialstream mining is to detect and track stories and events from social streams.To help achieve the goal, we have two fundamental hypotheses for socialstream mining in this thesis.Hypothesis 1 When modeling social streams, users prefer models in theform of correlated posts to models in the form of individual posts.1197.3. Hypotheses in Social Stream MiningFigure 7.1: Crowdsourcing task for Hypothesis 1.With Hypothesis 1, we can model a social stream as a post networkbased on their correlations. Hypothesis 1 is fundamental for the modelingof unstructured social streams in this thesis, since correlated posts providea better user experience than individual posts. All the data mining tech-niques proposed in this thesis are essentially based on Hypothesis 1, sincethese techniques are sophisticated graph mining algorithms which model acollection of posts as a post network.In Figure 7.1, we show our proposed user study for Hypothesis 1. Sup-posing the search is for “MH370”, we show result 1 and result 2 to Amazonworkers. Result 1 lists tweets containing MH370 one-by-one, ranked by fresh-ness. Result 2 lists grouped tweets, e.g., “MH370 Search Updates”, “MH370Causes of Disappearance”, etc., with each group telling one event related toMH370. We then provide the following five options:• Option A: Result 1 is much better than Result 2;• Option B: Result 1 is slightly better than Result 2;• Option C: They have no difference;• Option D: Result 2 is slightly better than Result 1;• Option E: Result 2 is much better than Result 1.The crowdsourcing task for testing Hypothesis 1 is asking Amazon work-ers to answer the question by choosing an option. To be fair, we do not1207.3. Hypotheses in Social Stream MiningResult 12. Following options try to illustrate the top events detected from tweets in a short time span. Which option you consider that illustrates top events in the best way?Result 2 Result 3Result 4 Option A: Result 1 is the best; Option B: Result 2 is the best; Option C: Result 3 is the best; Option D: Result 4 is the best. Option E: They have no difference. Figure 7.2: Crowdsourcing task for Hypothesis 2.inform Amazon workers which kind of algorithms we use to generate indi-vidual and correlated results.Hypothesis 2 To model stories/events in social streams, structural approachis better than frequency-based approach and LDA-based approach.Hypothesis 2 is fundamental to the cohesion, context and evolution prob-lems studied in this thesis. If stories/events are not defined in a structuralway, it will be extremely hard to define the notions like story cohesion, storycontext and event evolution. With Hypothesis 2, we can model a story/eventas a dense subgraph inside the post network, and subsequently, the cohesion,context and evolution problems can be defined and studied from the graphmining perspective.In Figure 7.2, we show our crowdsourcing task for Hypothesis 2. Weprovide five options, which correspond to results generated by four differentstory detection approaches:• Option A: (frequency-based) top hashtags• Option B: (frequency-based) top entities• Option C: (LDA-based) LDA approaches1217.4. Quality Control for Crowdsourcing• Option D: (structural) Cluster method• Option E: They have no difference.We then ask Amazon workers’ preference for the best result, by selectingone of the results. Notice that for the sake of fairness, we do not show theactual approach name in the crowdsourcing task, so the worker who votesfor an option has no idea on the algorithmic principle behind this option.7.4 Quality Control for CrowdsourcingDistinguishing Workers. Amazon Mechanical Turk (MTurk) is the mostpopular crowdsourcing Internet marketplace. As reported in January 2011,there are over 500,000 workers from over 190 countries. Besides the normalMTurk workers, it is a well-known fact that a large number of Amazonworkers are actually spammers and bots. In this user study, we distinguishthe following four different categories of workers:• Bots. A bot is an automatic program that virtually attempts to an-swer the task. Bots are designed to get the reward, and a bot usuallyreturns the answers within a very short time, typically not realistic foran average human.• Spammers. A spammer is a real worker who provides nonsensicalanswers. They are real persons and their sole target is to get the re-ward, without reading the instructions carefully and doing the necessarythinking or work.• Unqualified Workers. An unqualified worker is a real worker whocannot pass the qualification test. In the case of social stream mining,unqualified workers include the workers without comprehension ability:they either cannot understand the instructions very well, or fail to makea rational decision. It is worth noting that an unqualified worker for taskA may be a qualified worker for task B.• Qualified Workers. A qualified worker is a real worker who passes thequalification test. Depending on the rule of the qualification test, a qual-ified worker may fail some qualification questions, given the assumptionthat the portion of correctly answered questions is larger than a pre-defined threshold, e.g., 0.6. User quality management in crowdsourcingtasks aims to find a sufficiently high number of qualified workers.Typically, the minimum time constraint can make a distinction betweena bot and a real person, because a bot usually returns the answers within a1227.4. Quality Control for Crowdsourcingvery short time which is unrealistic for human. However, it is possible thata bot is designed to allow for a minimum elapsed time before it responds toa task. In this case, the historical approval rate constraint can easily filterout bots and spammers, since they usually do random guess and have a verylow approval rate. Clearly, a bot and spammer cannot pass the qualificationtest. A qualification test is designed to distinguish the qualified workers andunqualified workers. Depending on the effectiveness of qualification test, allunqualified workers are rejected and are not allowed to submit the work forthe designed crowdsourcing tasks.Smart Spammers. In this user study, we use the term “Smart Spammers”to refer to a category of qualified workers, who pass the qualification test,but make nonsensical submissions to a part of the subsequent crowdsourcingtasks. Smart Spammers prove their ability to answer the qualifying task,and thereafter the motivation of their behaviors resembles that of a spam-mer who focuses simply on getting the reward with very little actual work.Notice that a smart spammer may perform like a spammer only in a partof crowdsourcing tasks, but not in all of tasks. Thus, a qualified worker canbe either a smart spammer, or a “valuable worker” who really contributesto the crowdsourcing tasks. The detection of the smart spammers is a chal-lenging problem. In this user study, we propose Expectation-Maximizationwith Qualification (EMQ), an iterative approach to measure worker’s qualityby combining workers’ performance in the qualification test and subsequentcrowdsourcing tasks. EMQ approach is designed to detect the abuse of thesystem, by assigning low quality scores to those workers with a high proba-bility of being smart spammers.Quality Control Workflow. To guarantee maximum effectiveness, we usemultiple techniques to control the quality of workers in this user study. Theworkflow of the quality control is shown in Figure 7.3. In the beginning, weset the Approval Rate Constraint by a sufficiently high threshold to filter outMTurk workers who have a very low historical approval rate. For the remain-ing workers, we perform the qualification test, which is a series of questionswith known answers. These qualification questions will be treated as thegolden standard and workers’ qualification will be measured in terms of theratio of questions answered correctly. If the ratio is higher than a predefinedthreshold, this worker will be treated as a qualified worker. After this step,nearly all bots, spammers and unqualified workers will be blocked out be-fore the start of real crowdsourcing tasks. All qualified workers will submittheir work on the real crowdsourcing tasks. Since there are no predefinedanswers for crowdsourcing tasks, the quality of workers will be measured1237.4. Quality Control for CrowdsourcingQualification TestExpectation-Maximization with QualificationRemove bots, spammers and unqualified workersValuable WorkersAll Amazon MTurk WorkersPunish smart spammersApproval Rate Constraint Remove bots and spammersFigure 7.3: The workflow of quality control steps.by cross-comparison with peers in an iterative way, which is captured byExpectation-Maximization with Qualification (EMQ). In the case of smartspammers who passed the qualification test but submitted random answersto crowdsoucing tasks, their quality scores will be lowered down iterativelydue to the fact these random answers deviate from the majority voting dis-tinctly. Thus, EMQ is capable of measuring user’s quality in crowdsourcingand punishing smart spammers from among all qualified workers, by assign-ing low quality scores to them. We will discuss the EMQ approach in detailin next paragraph.Expectation-Maximization with Qualification (EMQ). The EMQ ap-proach measures user’s quality in crowdsourcing iteratively. To formalizeEMQ, let q denote the quality vector, where qi denotes the quality score forworker i on each iteration. Assuming there are n workers, we haveq =[q1 q2 q3 · · · qn](7.1)Next, we assume V is the voting matrix between workers and questions,where each element Vij is a 0/1 vector describing worker i’s vote on optionsof question j. As an example, V12 = [0 0 1 0 0] means worker 1 chooses the3rd option of question 2. Aussuming there are m questions, we get1247.4. Quality Control for CrowdsourcingV =V11 V12 · · · V1mV21 V22 · · · V2m· · · · · · · · · · · ·Vn1 Vn2 · · · Vnm (7.2)We also assume D is the distribution of weights on each option for eachquestion, and each question has l options. Thus, D is a m × l matrix, andfor each row Dj∗, its elements sum up to 1, i.e.,∑kDjk = 1.During the EMQ process, V will be fixed, q and D will be updated oneach iteration. To initialize q, since EMQ uses user’s score in qualificationtest as their initial quality, we set qi as the percentage of questions answeredcorrectly by worker i in the qualification test, and then normalize q by q =q/‖q‖1. On each iteration, EMQ works as follows:• Step 1:• Computation: Dj∗ =∑ni=1 qiVij ;• Normalization: Dj∗ = Dj∗/‖Dj∗‖1;• Step 2:• Computation: qi =∑mj=1Dj∗VTij ;• Normalization: q = q/‖q‖1;To explain, in Step 1, we treat qi as the weight of worker i with which tograde each answer provided by that worker in crowdsourcing tasks, and eachquestion will have a distribution of weighted votes on its options, denotedby Dj∗. In Step 2, we re-calculate the weight of each worker, by cross-comparison between his voting on question j and the current distribution ofvotes on options of question i. In each iteration, we normalize option weightdistribution vectors and quality vector, where ‖v‖1 means the normalizationby l1-norm, i.e., ‖v‖1 =∑i |vi|.The major steps of EMQ are shown in Algorithm 10. Recall that whilethe correct answer for each qualification question is predefined, we do notknow the correct answer for crowdsourcing tasks. Basically, EMQ computesthe distribution of answers for each question based on workers’ quality vectorand voting matrix, and workers’ qaulity vector will be evaluated again basedon the distribution matrix bewtween questions and answers. This processwill continue, until the quality vector q converges, with l1-norm of (qo − q)less than a small given threshold δ, or the maximum iteration number K isreached. We empirically set δ = 0.0001.We use the example in Figure 7.4 to illustrate the idea of EMQ algo-1257.4. Quality Control for CrowdsourcingAlgorithm 10: Expectation-Maximization with Qualification (EMQ)Input: Users’ answers in qualification test, users’ voting matrix V incrowdsourcing tasks, qualification threshold τ , convergencethreshold δ, max iteration KOutput: User quality vector q1 for each worker i do2 compute qi, the percentage of questions answered correctly in thequalification test;3 if qi < τ then4 mark worker i as unqualified and remove;5 set q as the qualification score vector of qualified workers;6 set k = 0;7 q = q/‖q‖1;8 while k < K do9 qo = q;10 Dj∗ =∑ni=1 qiVij ;11 Dj∗ = Dj∗/‖Dj∗‖1;12 qi =∑mj=1Dj∗VTij ;13 q = q/‖q‖1;14 if ‖qo − q‖1 < δ then15 return q;16 k = k + 1;17 return q;rithm. Suppose that there is a voting example with 3 workers and 3 ques-tions, where each question has two options A and B, as shown in Figure7.4(a). Figure 7.4(b) assumes that workers’ initial quality score vector ob-tained from the qualification test is [0.6, 0.8, 1.0], which is [0.2500 0.33330.4167] after the normalization in initialization. In the first iteration, wecompute the distribution of aggregated quality scores between options foreach question, as shown in Figure 7.4(c). In detail, for question Q1, wecompute 0.2500+0.4167=0.6667 for option A and 0.3333 for option B, andsimilarly for question Q2 and Q3. In turn, to update the quality scores, wecompare the weight distribution of options for each question and worker’svote for that question. For example, worker W1 answered A for Q1 andQ2, B for Q3, so W1 will get a total score of 0.6667+0.5833+0.5833=1.8333.Analogously, workers W2 and W3 get 1.4999 and 1.5001 respectively. Afterthe normalization, we get the quality score vector [0.3793 0.3103 0.3103] initeration 1, which will be used as the input for iteration 2. This iterative1267.4. Quality Control for CrowdsourcingFigure 7.4: (a) A voting example with 3 workers and 3 questions, whereeach question has two options: A and B. (b) The computation of normalizedquality vector q on each iteration, where quality scores on iteration 0 areobtained from the qualification test. (c) and (d): The distributions of qualityweights among options for each question on iteration 1 and 15, respectively.process continues and reaches the convergence on the 15th iteration. As wecan see in Figure 7.4(b), while W3 has the highest initial quality score, thefinal quality score of W3 is low, indicating that W3 may be a smart spam-mer who did well in the qualification test, but provided random answersfor crowdsourcing tasks to get the reward. Thus, EMQ is capable to catchsmart spammers because their quality scores become very low after manyiterations of punishments, even though these smart spammers may get veryhigh initial quality scores in the qualification test.It is well-known that the Expectation-Maximization (EM) algorithm is ahill-climbing approach, and it can only be guaranteed to reach a local maxi-mum. When there are multiple maximas, whether we will actually reach theglobal maximum depends on where we start. Clearly, the selection of startingseeds impacts the final user quality upon convergence. Existing user studies[36, 60] based on EM typically set the seed by a uniform distribution, i.e.,elements in Q1 are equal, or by a unreliable random guess. In contrast, weconsider that worker’s performance in the qualification test serves as a goodstarting point for EM. EMQ uses the scores obtained from the qualificationtest as the prior information, which is an ideal seed to measure worker’stask-specific quality in crowdsourcing.1277.5. Experiments7.5 ExperimentsIn this section, we perform crowdsourcing-based user studies on AmazonMTurk. We employed a total of 945 workers with historical approval ratehigher than 85%, out of which 890 workers passed the qualification test.Each qualified worker will work on two crowdsourcing tasks, as shown inFigure 7.1 and 7.2 respectively.Qualification Test Design. Our tasks expect that the users have somecomprehension ability, i.e., given a set of tweets, they can figure out whatthese tweets are talking about. Thus, a qualification test can be done by pro-viding multiple summarization phrases as options, and then asking turkersto select the best summarization phrase capturing the given tweet. Therewill be only one correct answer for each qualification question. To avoidthe situation that the qualification question becomes naive, we ensure thatthe summarization phrase itself does not occur in the tweets themselves.For example, given 5 tweets talking about an event “oil price drop” but notnecessarily mentioning the phrase “oil price drop”, and then provide usersa list of 3 options, e.g., “iphone 6 price”, “oil filter change” and “oil pricedrop”. If a worker does not select “oil price drop”, she will fail in this ques-tion. Very likely, it is a bot or a worker with very low comprehension ability.We assess each worker’s performance in the qualification test by a score,which is computed by the portion of qualification questions answered cor-rect, e.g., if a worker gets correct on four qualification questions out all thefive, the qualification score of this worker will be 0.8. In experiments, weset the qualification threshold τ as 0.5 (shown in Algorithm 10), with theintuition that any qualified workers should get correct on at least a half of allqualification questions. Workers with qualification scores lower than 0.5 areconsidered unqualified and will be rejected to participate in the subsequentcrowdsourcing tasks.We show our qualification test sample used before real crowdsourcingtasks in Figure 7.5. There are a total of five questions and we provide threeoptions for each question. Workers who answered at least three questionscorrectly will be selected as qualified workers. In our experiment, we have atotal of 945 Amazon workers who participated the qualification test, out ofwhich 890 workers passed the test with qualification scores higher than 0.5.Thus, the pass rate is 94.2% and these 890 qualified workers will proceed towork on real crowdsourcing tasks.EMQ Implementation. All qualified workers are eligible to participatethe subsequent crowdsourcing tasks, where the true answer is unknown. We1287.5. ExperimentsQualification Test: select the best summarization phrase capturing the given tweetNo. Tweets Summarization OptionsA. 2015 momentsB. PhotoshopC. Powerful women1.2.3.4.5.A. Childhood lifeB. MH370 crashC. Truck crashA. Mother and fatherB. Database centerC. New techA. MH370 new foundB. Indian peopleC. Australia citiesA. Korean musicB. MH370 engineC. New possibilityFigure 7.5: Qualification test sample used before real crowdsourcing tasks.use EMQ to measure the credibility of each qualified worker’s answers tothese crowdsourcing tasks. To implement EMQ, we set the initial seed ofuser quality scores as the qualification scores. By following the steps inAlgorithm 10, EMQ iterates and will be able to generate a convergent qualityscore for each qualified user, which in turn results in a convergent voting scoreweighted by user quality scores for each crowdsourcing task.Hypothesis Verification Using EMQ. Each qualified worker will work ontwo crowdsourcing tasks, as shown in Figure 7.1 and 7.2, which correspondto Hypothesis 1 and 2 (denoted by H1 and H2 for short) respectively. Ina neutral manner, to claim Hypothesis 1 and 2 to be true, we require thatthe majority of qualified workers vote for options D and E in H1 and optionD in H2, where each vote is weighted by the quality score. Recall that inSection 7.3, options D and E in H1 indicate users prefer correlated poststhan individual posts in social stream search, and option D in H2 meansstructural approach is the best in social stream modeling. We empiricallyset 60% as the threshold to claim “majority voting”, e.g., if more than 60%of votes choose option D in H2, then we say Hypothesis 2 is true.Based on the results returned by EMQ, we compute a weighted votingsum (WVS) score for each option in each hypothesis, where WVS of an op-tion is the sum of workers’ quality scores who voted for this option. BothHypothesis 1 and 2 will be validated by these WVS scores. In Hypothesis 1,options D and E correspond to the situation that Result 2 is better than Re-1297.5. Experiments(a) Workers’ voting table on Hypothesis 1 and 2(b) EMQ for Hypothesis 1 (c) EMQ for Hypothesis 2Figure 7.6: Running EMQ for verifying Hypothesis 1 and 2. In (b), thepercentage of weighted votes on options D/E increase from 74.4% (beforeEMQ) to 83.4% (after EMQ). In (c), the percentage of weighted votes onoptions D increase from 56.7% (before EMQ) to 75.4% (after EMQ).sult 1. In Hypothesis 2, if result D has higher WVS scores than A, B, C andD, we say structural approaches are better than frequency-based approachesand content approaches, and this hypothesis is true.We show the results of running EMQ for verifying Hypothesis 1 and 2in Figure 7.6. Workers’ original answers on Hypothesis 1 and 2 are shownin Figure 7.6(a). There are a total of 890 workers who participated in thesecrowdsourcing tasks, and EMQ takes 9 iterations to converge. H1 has fiveoptions: A, B, C, D and E. In Figure 7.6(b), we show the initial distributionof WVS scores before running of EMQ and the final distribution of WVSscores after running EMQ. As we can see, the final distrition is more skewedthan the initial distribution, which indicates EMQ iteratively strengthensthe WVS scores of options voted by high quality workers, and weakens theWVS scores of options voted by low quality workers. In the end, optionsD and E collect 32.9% and 50.5% of weighted votes respectively. Recallingthat option D means “result 2 is slightly better than result 1” and optionE means “result 2 is much better than result 1”, we conclude that 83.4% ofweighted votes agree that result 2 (in the form of correlated posts) is betterthan result 1 (in the form of individual posts).1307.6. Discussion and ConclusionFigure 7.6(c) shows the initial and final distributions of WVS scores foroption A, B, C, D and E in Hypothesis 2. As we can see, EMQ procedurelowers down the WVS scores for option A, B and C, while the WVS score foroption D is improved. Recall that option A, B are freqency-based approaches,C are LDA-based approaches and D is structural approach. Since option Dgets 75.4% of weighted votes finally, we conclude that structural approach isbetter than other approaches.In conclusion, Figure 7.6(b) and 7.6(c) show that the majority of highquality workers agree that correlated posts are better than individual postsin social stream modeling, and structural approach is better than other ap-proaches in story/event modeling in social streams.Detection of Smart Spammers by EMQ. Smart spammers are thoseworkers who passed the qualification test but performed like a spammer inreal crowdsourcing tasks. Smart spammers are very difficult to detect by ex-isting techniques. EMQmethod is capable to detect smart spammers becausetheir quality scores become very low after many iterations of punishments.It is normal for some workers to have relatively lower final quality scoresthan their initial quality scores, but if their final quality scores become ex-tremely low, these workers could be smart spammers. We empirically definesmart spammers as workers whose inital quality scores are at least five timeshigher than their final quality scores. Following this definition, we find 44smart spammers whose convergent EMQ quality scores are five times lowerthan their initial quality scores. In other words, 4.94% qualified workers areactually smart spammers, which is a portion that cannot be easily ignoredin the quality evaluation. In contrast, existing methods cannot detect smartspammers effectively.7.6 Discussion and ConclusionThe crowdsourcing based hypothesis verification proposed in this chapterhas several limitations:• Crowdsourcing tasks are designed by a few domain experts, which maylose generality and introduce bias that these tasks are not perfectlydesigned in both content and form to verify the hypothesis. For example,the current crowdsourcing task uses “MH370” event as an example, butmaybe this event is not well-known for every worker, or even workersknow about MH370 event, their responses are likely to be biased in howthis event is presented in crowdsourcing tasks.1317.6. Discussion and Conclusion• The visualization in crowdsourcing tasks may bring new bias. For ex-ample, to verify Hypothesis 2, we use circles with texts and links tovisualize structural approaches, but the visualization itself may bringdifferent understanding overhead for workers with different educationallevels.• Due to the diversity of human behaviors in crowdsourcing, the smartspammer detection algorithm by EMQ may be not effective in all situ-ations. For example, some qualified workers may work normally in thefirst half of tasks, but later they perform like spammers because of somereasons, e.g., being tired or wanting to get the reward quickly.There is room for the improvement. First, we can employ a larger numberof domain experts to design more tasks with different forms and topics, whichwill reduce the bias introduced by the tasks themselves. Second, betteralgorithms may be developed to combat smart spammers. Third, we canrandomly mix the qualification questions and actual tasks, making it hardfor the workers to distinguish them, so that the qualification questions andactual tasks have equal chances to be spammed by workers. This will makeit more challenging for smart spammers to pass the qualification test.132Chapter 8Summary and Future Research8.1 SummaryIn the current social web age, mining unstructured social streams has be-come a fundamental requirement to satisfy people’s needs in informationseeking. Existing user experience on social stream applications like TwitterSearch may easily lead to “information anxiety”, in which users input severalkeywords and the output will be a long list of tweets or posts with keywordscontained, ranked by time freshness. Since a post like a tweet or a Facebookupdate only contains a small piece of information, users are required to digesta long list of search results, which is time-consuming and painful. The noisyand redundant nature of social streams degrades user’s experience further.The social stream mining that we propose aims at providing users with anorganized and summarized view of what’s happening in their social world.In this thesis, we examine several important applications in social streammining, and propose efficient solutions based on graph mining techniques.Chapter 3 discussed the modeling of unstructured social streams. Indetail, we model a social stream as an evolving network of posts. Stories andevents are modeled as two special kinds of subgraphs. Specifically, a storyis a dense subgraph of posts with high cohesion in each snapshot of a postnetwork, while an event is a cluster of posts across consecutive snapshots asthe post network evolves.In Chapter 4, we instantiate a story as a quasi-clique, and given a querynode set S in a graph G(V,E), we try to solve the maximum quasi-cliquesearch problem, which is formalized as finding the largest λ-quasi-clique con-taining S. To quickly locate the initial solution, we propose k-Core tree byrecursively organizing dense subgraphs in G. Three maximization operationsare introduced to optimize the solution: Add, Remove and Swap. Then, wepropose two iterative maximization algorithms, DIM and SUM, to approachthe maximum quasi-clique that contains the given query node set S usingdeterministic and stochastic approaches respectively.In Chapter 5, we focus on two problems: (1) efficiently identify tran-sient stories from fast streaming social content; (2) perform iceberg query1338.2. Future Researchto build the structural context between stories. To solve the first prob-lem, we transform social stream in a time window to a capillary network,and model transient stories as (k, d)-Cores in the capillary network. Twoefficient algorithms are proposed to extract maximal (k, d)-Cores. For thesecond problem, we propose deterministic context search and randomizedcontext search to support the iceberg query, which allows to perform con-text search without pairwise comparison. We perform detailed experimentalstudy on real Twitter streams and the results demonstrate the effectivenessand value of our proposed context-aware story-teller CAST.Our main goal in Chapter 6 is to track the event evolution patterns fromhighly dynamic post networks. To that end, we summarize the network bya skeletal graph and monitor the updates to the post network by means of asliding time window. Then, we design a set of primitive operations and ex-press the cluster evolution patterns using these operations. Unlike previousapproaches, our evolution tracking algorithm eTrack performs incrementalbulk updates in real time. We deploy our approach on the event evolutiontracking task in social streams, and experimentally demonstrate the perfor-mance and quality on two real data sets crawled from Twitter.Finally, Chapter 7 performs crowdsourcing based user studies to vali-date two important hypotheses. These two hypotheses are fundamental toour mining tasks on social streams. The key problem of the crowdsourcingbased user studies is the quality evaluation of workers. We distinguish Ama-zon MTurk workers into four types: bots, spammers, unqualified workersand qualified workers. While bots and spammers can be detected by tradi-tional techniques like minimum time constraint and approval rate constraint,carefully designed qualification test is required to distinguish unqualified andqualified workers. We discussed detailed qualification test for crowdsourcingtasks in the preceding sections. As a special kind of qualified workers, smartspammers are workers who passed the qualification test but sometimes per-form like a spammer in the subsequent crowdsourcing tasks. The detection ofsmart spammers is a challenging and unsolved problem. In this chapter, weuse Expectation-Maximization with Qualification (EMQ), which is capableof measuring user’s quality in crowdsourcing and detecting Smart Spammersfrom among all qualified workers.8.2 Future ResearchThis thesis has made substantial progress in the study of cohesion, contextand evolution problems in unstructured social stream mining. In future1348.2. Future Researchresearch, there are still several open opportunities and challenges in socialstream mining, as listed below.• It would be interesting to investigate the incremental evolution of socialemotions and sentiments for business intelligence, associated with theevolution of events.• We look forward to performing advanced analytics on stories and events,such as the spread paths of rumors on social networks, personalizedrecommendation of new events with GPS signals, etc.• We are interested in various novel visualization techniques to presenttransient stories and incremental updating events to end users in afriendly way.135Bibliography[1] James Abello, Mauricio G. C. Resende, and Sandra Sudarsky. Massivequasi-clique detection. In LATIN, pages 598–612, 2002.[2] Manoj K. Agarwal, Krithi Ramamritham, and Manish Bhide. Real timediscovery of dense clusters in highly dynamic graphs: Identifying realworld events in highly dynamic environments. PVLDB, 5(10):980–991,2012.[3] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. Aframework for clustering evolving data streams. In VLDB, pages 81–92,2003.[4] James Allan, editor. Topic detection and tracking: event-based infor-mation organization. Kluwer Academic Publishers, 2002.[5] Albert Angel, Nick Koudas, Nikos Sarkas, and Divesh Srivastava. Densesubgraph maintenance under streaming edge weight updates for real-time story identification. PVLDB, 5(6):574–585, 2012.[6] Sanjeev Arora and Shmuel Safra. Probabilistic checking of proofs: Anew characterization of np. J. ACM, 45(1):70–122, 1998.[7] Yuichi Asahiro, Refael Hassin, and Kazuo Iwama. Complexity of findingdense subgraphs. Discrete Applied Mathematics, 121(1-3):15–26, 2002.[8] Balabhaskar Balasundaram, Sergiy Butenko, and Illya V. Hicks. Cliquerelaxations in social network analysis: The maximum k-plex problem.Operations Research, 59(1):133–142, 2011.[9] Roberto Battiti and Marco Protasi. Reactive local search for the maxi-mum clique problem. Algorithmica, 29(4):610–637, 2001.[10] Hila Becker, Mor Naaman, and Luis Gravano. Learning similarity met-rics for event identification in social media. In WSDM, pages 291–300,2010.136Bibliography[11] Pavel Berkhin. Survey: A survey on pagerank computing. InternetMathematics, 2(1):73–120, 2005.[12] Seth Bicknell. How to successfully use amazon’s mechanical turk. Re-trieved in January 2015.[13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichletallocation. Journal of Machine Learning Research, 3:993–1022, 2003.[14] V.D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fastunfolding of communities in large networks. J. Stat. Mech.,, 2008.[15] Mauro Brunato, Holger H. Hoos, and Roberto Battiti. On effectivelyfinding maximal quasi-cliques in graphs. In LION, pages 41–55, 2007.[16] Chris Callison-Burch. Fast, cheap, and creative: Evaluating translationquality using amazon’s mechanical turk. In Proceedings of the 2009 Con-ference on Empirical Methods in Natural Language Processing, EMNLP2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 286–295, 2009.[17] Feng Cao, Martin Ester, Weining Qian, and Aoying Zhou. Density-based clustering over an evolving data stream with noise. In SDM,2006.[18] Yixin Chen and Li Tu. Density-based clustering for real-time streamdata. In KDD, pages 133–142, 2007.[19] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, and Wei Wang.Online search of overlapping communities. In Proceedings of the 2013ACM SIGMOD International Conference on Management of Data, SIG-MOD ’13, pages 277–288, New York, NY, USA, 2013. ACM.[20] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.A density-based algorithm for discovering clusters in large spatialdatabases with noise. In KDD, pages 226–231, 1996.[21] David Nadeau et. al. A survey of named entity recognition and classi-fication. Lingvisticae Investigationes, 30(1):3–26, 2007.[22] Li Wan et.al. Density-based clustering of data streams at multiple res-olutions. TKDD, 3(3), 2009.137Bibliography[23] U. Feige, S. Goldwasser, L. Lov鞪asz, S. Safra, and M. Szegedy. Ap-proximating clique is almost np-complete. In Foundations of ComputerScience, pages 2 –12, 1991.[24] Karën Fort, Gilles Adda, and K. Bretonnel Cohen. Amazon mechanicalturk: Gold mine or coal mine? Computational Linguistics, 37(2):413–420, 2011.[25] Santo Fortunato. Community detection in graphs. CoRR,abs/0906.0612, 2009.[26] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu, and HongjunLu. Parameter free bursty events detection in text streams. In VLDB,pages 181–192, 2005.[27] Zekai Gao, Yangqiu Song, Shixia Liu, Haixun Wang, Hao Wei, YangChen, and Weiwei Cui. Tracking and connecting topics via incrementalhierarchical dirichlet processes. In ICDM, pages 1056–1061, 2011.[28] Christos Giatsidis, Dimitrios M. Thilikos, and Michalis Vazirgiannis.Evaluating cooperation in communities with the k-core structure. InASONAM 2011, Kaohsiung, Taiwan, 25-27 July 2011, pages 87–93,2011.[29] Martin Halvey and Mark T. Keane. An assessment of tag presentationtechniques. In WWW, pages 1313–1314, 2007.[30] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Conceptsand Techniques. Morgan Kaufmann, 2011.[31] J. Hastad. Clique is hard to approximate within n1−. Acta Mathemat-ica, 182:105éĹě42, 1999.[32] Yulan He, Chenghua Lin, Wei Gao, and Kam-Fai Wong. Tracking sen-timent and topic dynamics from social media. In ICWSM, 2012.[33] Holger H. Hoos and Thomas Stutzle. Stochastic local search: Founda-tions & applications. Morgan Kaufmann, 2004.[34] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu.Querying k-truss community in large and dynamic graphs. In SIGMOD2014, Snowbird, UT, USA, June 22-27, 2014, pages 1311–1322, 2014.138Bibliography[35] Panagiotis G. Ipeirotis and Evgeniy Gabrilovich. Quizz: targeted crowd-sourcing with a billion (potential) users. In 23rd International WorldWide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7-11,2014, pages 143–154, 2014.[36] Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality man-agement on amazon mechanical turk. In Proceedings of the ACMSIGKDD Workshop on Human Computation, HCOMP ’10, pages 64–67,New York, NY, USA, 2010. ACM.[37] Xin Jin, W. Scott Spangler, Rui Ma, and Jiawei Han. Topic initiatordetection on the world wide web. In WWW, pages 481–490, 2010.[38] Min-Soo Kim and Jiawei Han. A particle-and-density based evolution-ary clustering method for dynamic networks. PVLDB, 2(1):622–633,2009.[39] Dan Klein and Christopher D. Manning. Accurate unlexicalized parsing.In ACL, pages 423–430, 2003.[40] Bill Kovach and Tom Rosenstiel. Blur: How to Know What’s True inthe Age of Information Overload. Bloomsbury Publishing USA, 2010.[41] Pei Lee and Laks V. S. Lakshmanan. Query-driven maximum quasi-clique search. In SDM, 2016.[42] Pei Lee, Laks V. S. Lakshmanan, and Evangelos E. Milios. Keysee:supporting keyword search on evolving events in social streams. InKDD, pages 1478–1481, 2013.[43] Pei Lee, Laks V. S. Lakshmanan, and Evangelos E. Milios. Cast: Acontext-aware story-teller for streaming social content. In CIKM, 2014.[44] Pei Lee, Laks V. S. Lakshmanan, and Evangelos E. Milios. Incrementalcluster evolution tracking from highly dynamic network data. In ICDE,pages 3–14, 2014.[45] Pei Lee, Laks V. S. Lakshmanan, and Jeffrey Xu Yu. On top-k structuralsimilarity search. In ICDE, pages 774–785, 2012.[46] Pei Lee, Laks V.S. Lakshmanan, and Evangelos E. Milios. Event evolu-tion tracking from streaming social posts. Technical report, 2013.139Bibliography[47] Victor E. Lee, Ning Ruan, Ruoming Jin, and Charu C. Aggarwal. Asurvey of algorithms for dense subgraph discovery. In Managing andMining Graph Data, pages 303–336. 2010.[48] Jure Leskovec, Lars Backstrom, and Jon M. Kleinberg. Meme-trackingand the dynamics of the news cycle. In KDD, pages 497–506, 2009.[49] Pei Li, Jeffrey Xu Yu, Hongyan Liu, Jun He, and Xiaoyong Du. Rank-ing individuals and groups by influence propagation. In Advances inKnowledge Discovery and Data Mining - 15th Pacific-Asia Conference,PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, PartII, pages 407–419, 2011.[50] Guimei Liu and Limsoon Wong. Effective pruning techniques for miningquasi-cliques. In ECML/PKDD (2), pages 33–49, 2008.[51] Ning Liu. Topic detection and tracking. In Encyclopedia of DatabaseSystems, pages 3121–3124. 2009.[52] HelenaR. Lourenço, OlivierC. Martin, and Thomas Stützle. Iteratedlocal search. In Handbook of Metaheuristics, volume 57 of InternationalSeries in Operations Research & Management Science, pages 320–353.Springer US, 2003.[53] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze.Introduction to Information Retrieval. Cambridge University Press,2008.[54] Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger,Samuel Madden, and Robert C. Miller. Processing and visualizing thedata in tweets. SIGMOD Record, 40(4):21–27, 2011.[55] Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger,Samuel Madden, and Robert C. Miller. Twitinfo: aggregating and visu-alizing microblogs for event exploration. In CHI, pages 227–236, 2011.[56] David W. Matula and Leland L. Beck. Smallest-last ordering and clus-tering smallest-last ordering and clustering and graph coloring algo-rithms. Journal of ACM, 30(3):417–427, 1983.[57] Alberto Montresor, Francesco De Pellegrini, and Daniele Miorandi. Dis-tributed k-core decomposition. IEEE Trans. Parallel Distrib. Syst.,24(2):288–300, 2013.140Bibliography[58] Matteo Negri and Yashar Mehdad. Creating a bi-lingual entailment cor-pus through translations with mechanical turk: $100 for a 10-day rush.In Proceedings of the NAACL HLT 2010 Workshop on Creating Speechand Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10,pages 212–216, Stroudsburg, PA, USA, 2010. Association for Computa-tional Linguistics.[59] Tadashi Nomoto. Two-tier similarity model for story link detection. InCIKM, pages 789–798, 2010.[60] Aditya G. Parameswaran, Stephen Boyd, Hector Garcia-Molina, AshishGupta, Neoklis Polyzotis, and Jennifer Widom. Optimal crowd-poweredrating and filtering algorithms. PVLDB, 7(9):685–696, 2014.[61] Panos M. Pardalos and Steffen Rebennack. Computational challengeswith cliques, quasi-cliques and clique partitions in graphs. In SEA, pages13–22, 2010.[62] Jeffrey Pattillo, Alexander Veremyev, Sergiy Butenko, and VladimirBoginski. On the maximum quasi-clique problem. Discrete AppliedMathematics, 161(1-2):244–257, 2013.[63] Jian Pei, Daxin Jiang, and Aidong Zhang. Mining cross-graph quasi-cliques in gene expression and protein interaction data. In ICDE, pages353–354, 2005.[64] Mauricio G. C. Resende. Greedy randomized adaptive search proce-dures. In Encyclopedia of Optimization, pages 1460–1469. 2009.[65] Kazumi Saito, Takeshi Yamada, and Kazuhiro Kazama. The k-densemethod to extract communities from complex networks. InMining Com-plex Data, pages 243–257. 2009.[66] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquakeshakes twitter users: real-time event detection by social sensors. InWWW, pages 851–860, 2010.[67] Anish Das Sarma, Alpa Jain, and Cong Yu. Dynamic relationship andevent discovery. In WSDM, pages 207–216, 2011.[68] Chirag Shah, W. Bruce Croft, and David Jensen. Representing doc-uments with named entities for story link detection (sld). In CIKM,pages 868–869, 2006.141Bibliography[69] Dafna Shahaf, Jaewon Yang, Caroline Suen, Jeff Jacobs, Heidi Wang,and Jure Leskovec. Information cartography: creating zoomable, large-scale maps of information. In KDD, pages 1097–1105, 2013.[70] David Sontag and Dan Roy. Complexity of inference in latent dirichletallocation. In NIPS, pages 1008–1016, 2011.[71] Mauro Sozio and Aristides Gionis. The community search problem andhow to plan a successful cocktail party. In KDD, pages 939–948, 2010.[72] Matthew Staffelbach, Peter Sempolinski, David Hachen, Ahsan Kareem,Tracy Kijewski-Correa, Douglas Thain, Daniel Wei, and Greg Madey.Lessons learned from an experiment in crowdsourcing complex citizenengineering tasks with amazon mechanical turk. CoRR, abs/1406.7588,2014.[73] Charalampos E. Tsourakakis, Francesco Bonchi, Aristides Gionis,Francesco Gullo, and Maria A. Tsiarli. Denser than the densest sub-graph: extracting optimal quasi-cliques with quality guarantees. InKDD, pages 104–112, 2013.[74] Takeaki Uno. An efficient algorithm for solving pseudo clique enumer-ation problem. Algorithmica, 56(1):3–16, 2010.[75] NanWang, Srinivasan Parthasarathy, Kian-Lee Tan, and Anthony K. H.Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMODConference, pages 445–458, 2008.[76] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, pages 440–442, 1998.[77] Jianshu Weng and Bu-Sung Lee. Event detection in twitter. In ICWSM,2011.[78] Yiming Yang, Tom Ault, Thomas Pierce, and Charles W. Lattimer.Improving text categorization methods for event tracking. In SIGIR,pages 65–72, 2000.[79] Xiaohan Zhao, Alessandra Sala, Christo Wilson, Xiao Wang, SabrinaGaito, Haitao Zheng, and Ben Y. Zhao. Multi-scale dynamics in amassive online social network. In IMC, pages 171–184, 2012.142

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0343307/manifest

Comment

Related Items