UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Query-driven event search in news information network Chen, Shanshan 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_september_chen_shanshan.pdf [ 917.77kB ]
JSON: 24-1.0165990.json
JSON-LD: 24-1.0165990-ld.json
RDF/XML (Pretty): 24-1.0165990-rdf.xml
RDF/JSON: 24-1.0165990-rdf.json
Turtle: 24-1.0165990-turtle.txt
N-Triples: 24-1.0165990-rdf-ntriples.txt
Original Record: 24-1.0165990-source.json
Full Text

Full Text

Query-driven Event Search in NewsInformation NetworkbyShanshan ChenB.Eng., Beijing University of Posts and Telecommunications, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2014c© Shanshan Chen 2014AbstractTraditional search focuses on keyword matching and document ranking, thususers will only get a overload of related news articles or videos, with little se-mantics aggregation, when they input keywords among which they are inter-ested in exploring potential connections. To capture these connections, onegood way is to model news articles into a complex network of events, wherean event itself is a complex network of interrelated actions (or assertions).Event search enables users to get a big picture of essential actions involvedwithout going through overwhelming documents. Further on, Query-drivenevent search, in which users can access key actions that occurred in vari-ous events with respect to input query and construct a new event capturingcorresponding actions and connections between them, is able to inherentlyrevolutionize search from its traditional keyword matching to a higher levelof abstraction. Thus, we propose a natural and useful paradigm, Query-driven event search in News Information Network, to allow users to searchthrough news articles and obtain events as answers. We construct the newsinformation network with nodes representing actions and edges indicating re-latedness between nodes. We define a good answer as a structurally denselylinked and semantically related subgraph, describing a concise and informa-tive event. Thus our objective function combines both the linkage structureof graph and semantic content relatedness. We then formally define ourproblem to build a query-driven event search engine that processes usersquery and generates a subgraph satisfying following conditions: 1) coveringall keywords but without noisy nodes 2) maximizes the objective function.We prove this problem is NP-complete and propose heuristics algorithms.Our experimental evaluation on real-life news datasets demonstrates algo-rithms efficiency and meaningful solutions we obtain.iiPrefaceThis dissertation, and the project it covers, is fully original and unpublished.I came up with the idea of building an event search engine that determinesa densely linked, semantically related and query-covering subgraph with re-spect to keywords users are interested in exploring with the help of ProfessorLaks V.S. Lakshmanan and Pei Li. I prepared all datasets needed, did ex-tensive experiments, analyzed results and wrote this dissertation by myself.Professor Laks V.S. Lakshmanan and Pei Li were involved in the wholeprocess of brainstorming the idea, formally defining the problem, proposingheuristics algorithms and evaluating algorithms’ performance and results’quality.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Graph-based text summarization . . . . . . . . . . . . . . . . 52.2 Connectivity subgraph . . . . . . . . . . . . . . . . . . . . . 52.3 Dense subgraph . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Keyword search in graph . . . . . . . . . . . . . . . . . . . . 72.5 Answering relationship queries . . . . . . . . . . . . . . . . . 72.6 Team formation . . . . . . . . . . . . . . . . . . . . . . . . . 83 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Anaphora resolution . . . . . . . . . . . . . . . . . . . . . . . 103.3 Relation extraction . . . . . . . . . . . . . . . . . . . . . . . 123.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Similarity between triples . . . . . . . . . . . . . . . . . . . . 143.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 143.5.2 Textual similarity . . . . . . . . . . . . . . . . . . . . 15ivTable of Contents3.5.3 Semantic similarity . . . . . . . . . . . . . . . . . . . 153.5.4 Similarity combination . . . . . . . . . . . . . . . . . 163.5.5 Computation . . . . . . . . . . . . . . . . . . . . . . . 173.6 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 183.6.2 Density-based partitioning . . . . . . . . . . . . . . . 193.7 Processing running time . . . . . . . . . . . . . . . . . . . . . 204 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Graph overview . . . . . . . . . . . . . . . . . . . . . . . . . 225 Problem Formalization and Algorithms . . . . . . . . . . . . 255.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Problem formalization . . . . . . . . . . . . . . . . . . . . . . 265.2.1 Problem statement . . . . . . . . . . . . . . . . . . . 265.2.2 Discussion about objective function . . . . . . . . . . 265.3 Heuristic algorithms . . . . . . . . . . . . . . . . . . . . . . . 275.3.1 Rare . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3.2 Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3.3 Edge star . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.4 Node star . . . . . . . . . . . . . . . . . . . . . . . . . 315.3.5 RF* and EG* . . . . . . . . . . . . . . . . . . . . . . 316 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 356.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 366.2.1 Query set generation . . . . . . . . . . . . . . . . . . 376.2.2 Parameter changing . . . . . . . . . . . . . . . . . . . 376.2.3 Query set size changing . . . . . . . . . . . . . . . . . 406.2.4 Running time . . . . . . . . . . . . . . . . . . . . . . 416.3 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Conclusion and Future work . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48vList of Tables6.1 Worldwide dataset news distribution . . . . . . . . . . . . . . 366.2 Verge dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Statistics for worldwide and verge dataset . . . . . . . . . . . 366.4 Running time(ms.) for worldwide dataset . . . . . . . . . . . 426.5 Running time(ms.) for verge dataset . . . . . . . . . . . . . . 436.6 Statistics of generated subgraphs . . . . . . . . . . . . . . . . 44viList of Figures1.1 A graph example for specified query . . . . . . . . . . . . . . 33.1 Dataset processing workflow . . . . . . . . . . . . . . . . . . . 94.1 Graph overview . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1 An example for algorithm 5 . . . . . . . . . . . . . . . . . . . 316.1 Statistics of G′ when α ranges between [0,1] . . . . . . . . . . 376.2 Diameter of G′ when α ranges between [0,1] . . . . . . . . . . 386.3 Weightratio of G′ when α ranges between [0,1] . . . . . . . . 396.4 Objective function value of G′ when α ranges between [0,1] . 396.5 Average and maximum edge weight of G′ when α ranges be-tween [0,1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.6 Number of nodes in G′ when query size ranges between [4, 10] 416.7 Diameter of G′ when query size ranges between [4, 10] . . . . 416.8 Weightratio of G′ when query size ranges between [4, 10] . . . 426.9 Objective function value ofG′ when query size ranges between[4, 10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.10 Generated subgraph for “snowden” scenario . . . . . . . . . . 456.11 Generated subgraph for “ukraine” scenario . . . . . . . . . . 456.12 Generated subgraph for “mh370” scenario . . . . . . . . . . . 46viiAcknowledgementsI would like to thank all people who give me help on this thesis. I am mostgrateful to my supervisor, Professor Laks V.S. Lakshmanan, for his super-vision, inspiration, trust and support in the past two years. His passionin research, keen insight on solving problems and diligence set up a greatexample to me. I am also grateful to Professor George Tsiknis, the secondreader of my thesis, for his editorial and technical suggestions. I would alsolike to express my gratitude to Pei Li for spending generous time on helpingme brainstorming and giving insightful discussions in meetings. I also sin-cerely thank Min Xie for his valuable suggestions on problem definition andexperiments. Finally and above all, I want to thank my friends and familyfor their unconditional love and support.viiiChapter 1IntroductionIn the current social web page, various kinds of information from news ormicroblogs is pouring to us everyday, and thus information overload has beena serious problem. To mitigate this problem, traditional search focuses onkeyword matching and document ranking to help users find the best answereasier. However, traditional search has little semantics aggregation, thusthere are still quite a few cases where users have to read numerous documentsespecially when the query contains multiple keywords and users intend toexplore the connections between them. To capture these connections, onegood way is to model news articles into a complex network of events, wherean event itself is a complex network of interrelated actions (or assertions).Event search enables users to get a big picture of essential actions in-volved without going through overwhelming documents. State-of-the-artkeyword search in social stream can group related posts into events[25] toavoid reading duplicated information. Yan and Kong et al. [41] investigatethe problem of generating news timelines, consisting individual but corre-lated summaries on each date, from massive data on the Internet to provideusers a high-level summary of event. However, these existing approacheseither neglect the connections between input keywords within events or lackthe interface for users to retrieve information with respect to queries they areinterested in. We thus propose a natural and useful paradigm, Query-drivenevent search, in which users can access key actions that occurred in vari-ous events with respect to input query and construct a new event capturingcorresponding actions and connections between them. This new approach isable to inherently revolutionize search from its traditional keyword matchingto a higher level of abstraction.Query-driven event search in News Information Network allows users tosearch through news articles and obtain events as answers. For example,for query (Snowden, Julian Assange, Hong Kong, Russia), top ten resultsreturned by traditional search are news articles covering only part of queriesand “Edward Snowden” Wikipedia page, and users won’t be able to under-stand clearly the relationships between keywords without reading all results,while our event search engine is able to return an event shown in figure 1.11Chapter 1. Introductionin which all keywords are covered by the union of assertions (represented asrectangles) and connections between them are specified. In figure 1.1, eachnode contains assertion (action it represents) , timestamp this action hap-pened, and frequency of this assertion appearing in the news corpus; edgeweight measures the degree of relatedness between two nodes. Our resultis more interesting in following reasons: 1) it provides an concise and infor-mative event with respect to input query; 2) connections between actionsare specified; 3) query expansion is possible when two query nodes (nodecontaining any keyword in query) need intermediate nodes (not query node)to connect; and 4) it can access key actions from various static events toconstruct a new event.Keyword Search in Graph attracts increasing attention as many real lifedatasets can be represented by trees or graphs. Possible outputs are either aspanning tree [15, 17, 19] or a subgraph with various constraints [4, 7, 13, 20]covering all input keywords. For instance, it can be used to find connectionsbetween nodes in social network [13] to explore deeper relationships betweenpeople, and in a World Wide Web dataset, it detects relevance between webpages [14]; Hulgeri et al. [17] enables users to extract information from arelational database by just typing a few keywords, following hyper links,and interacting with controls on the displayed results, without writing anycomplex query language. Therefore, we adapt Keyword Search in Graphtechnique to build the query-driven event search engine in News informationNetwork.For Keyword Search in Graph problem, there are always two basic ques-tions to ask: first, what is the semantics of keyword search in the graph;second, what constitutes a good answer[1]. Some previous research [21, 38]seek to find a subgraph that contains query nodes, and is densely-connected.Yet, they rank candidate results by either degree or distance. Neither ofthem have combined both the linkage structure of subgraph and the seman-tic content relatedness between nodes. In our news information network,each node is an assertion representing one specific action (rectangle in figure1.1), links between them stating that those actions are related with eachother and edge weight denotes the importance (details in section 3). Thus,in our problem, both the size of graph and semantic relatedness betweennodes matter because the former feature guarantees conciseness and thelatter provides informativeness. To provide such event search engine whichcan generate group of densely-linked, semantic-related nodes which also sat-isfy particular keyword-based queries, we propose a new ranking techniquecombining both structure density of graph and semantic relatedness betweennodes to rank candidate results.2Chapter 1. IntroductionAs a densely-linked and closely-related subgraph can provide a goodsummary of query-covering actions, this problem can be applied to manyreal-life applications: 1) For columnist who wants to analyze historical orstreaming events, especially trying to explore connections between a few keypersons or organizations, our query-driven event search can output a sub-graph connecting input keywords in the format of actions. 2) For politicianwho has a campaign to launch, our framework can help analyze rival’s ac-tions towards certain policies or specific proposal, and prepare him or herbetter. 3) It enables Ordinary users to figure out how interested keywordsconstruct an event and the to explore connections between them in a shorttime without all related news articles.Therefore, we define our problem as follows: given a formally writtennews corpus, we build a query-driven event search engine which is able toprocess user’s query and generate actions covering input keywords and toprovide relationships between them. To achieve this, we first build an undi-rected and weighted news information network from input news corpus andthen seek to find a structurally dense, semantically related and connectedsubgraph which covers all input keywords. In this paper, we will detailedlydescribe how to build a news information network from raw news data, for-mally define the problem, propose a few heuristics algorithms to solve it,and experiment on three real-life datasets to evaluate our approaches.t o_exampleWikileaks founder Julian Assange attempts to broker a deal to grant Mr. Snowden asylum in Iceland(2013/06/20)-11Mr. Snowden applies to Russia for political asylum(2013/07/01)-18Mr. Snowden flies from Hong Kong to Moscow as extradition pressure builds(2013/06/22)-13108Figure 1.1: A graph example for specified query31.1. Challenges1.1 ChallengesThere are several challenges for this work.• how to build a news information network from raw news data, specif-ically what qualifies as a node (semantics content unit) and a connec-tion between nodes.• how to define a good answer for input queries, and design efficientalgorithms to generate it• how to evaluate algorithms by comparing with existing work1.2 Key contributionsIn a summary, our contributions are as follows:• To our best knowledge, we are the first to build a news informationnetwork from raw news article, and study the problem of query-drivenevent search on this graph• We formally define our problem and prove it to be NP-complete• We propose a few heuristics algorithms to solve the problem, andexperiment with real-life dataset showing our algorithm can outputmeaningful answers1.3 RoadmapThe outline of this thesis is arranged as follows: The next chapter will listthis problem’s related work. In the third chapter, we introduce the method-ology to process raw news articles to news information network. Then weexplain details in our data model in the fourth chapter. The following chap-ter formally defines our problem, proves the hardness, and proposes a fewheuristic algorithms. Then extensive experiments on real-life datasets andevaluations are presented. In the end, we conclude our work and discussavailable future work.4Chapter 2Related Work2.1 Graph-based text summarizationText summarization generally refers to the process of automatically gener-ating a compressed, yet with preserved important information, version ofthe given text[9]. It broadly consists of two approaches: extractive andabstractive summarization. Graph-based methods belong to extractive ap-proach, and are proposed to rank units based on “votes” between each other.For example, Erkan et al. [11] introduces a stochastic graph-based method,similar to PageRank and HITS, to compute sentence’s importance basedon the centrality concept in graphs. Mani et al. [28] uses a graph, with asingle word as node and several kinds of links, to represent one document,aiming to find similarities and dissimilarities in pairs of documents. Thiswork is related to our work in terms of using a graph to process raw text(news articles), building connections between text units (nodes) and adopt-ing graph attributes to measure nodes’ prestige. The difference is that allthe above papers focus on selecting exact words or sentences in text to forma compressed and coherent summary of the input text, while our work treatsassertion as text unit, introduces the notion of action, and seeks to find anevent composed by actions with respect to our query keywords.2.2 Connectivity subgraphFaloutsos and McCurley[13] addresses the problem of extracting a connec-tion subgraph to best capture the relationship between two nodes in a socialnetwork based on the measure of electrical-current flows. Tong et al. [39]uses the related notion of random walks to find nodes and subgraph thathave strong connections to all or most of input query nodes. They sharethe task to extract subgraph that is not only near to query nodes based ontheir proximity measure, but also connects query nodes. Cheng et al. [8]then put the object connection discovery problem in a context-aware situ-ation , where object has a context. They fisrt partition the original graph52.3. Dense subgraphinto a set of communities based on the concept of modularity, and studyobject connection in intra-community and inter-community. Kasneci andElbassuoni [20] set this in an entity-relationship diagram, and focuses onextracting informative subgraphs with respect to query nodes.Our approach differs from above research in following reasons: 1) we aremore interested to extract the best connection graph that connects our querynodes by adopting graph diameter and edge weight features; 2) The fun-damental difference between our news information network structure(edgeweight semantic meaning etc.) and theirs allows us to formalize our problemin a different way.2.3 Dense subgraphAngel et al. [3] maintains dense subgraph under streaming edge weightupdates and outputs dense subgraph periodically. This relates to our workin the way that they output groups of tightly-coupled entities as a story,while we treat a group of actions as an event. Dourisboure and Geraci [10]focuses on extracting a dense subgraph as cyber-communities(sets of citesand pages sharing a common interest) from World Wide Web (WWW).They also come up with new heuristic algorithms which demonstrate highscalability and effectiveness. [14] presents a new algorithm, based on arecursive application of fingerprinting via shingles, to charaterize large anddense subgraphs of a graph showing connections between hosts on WWW.Their algorithm is extremely efficient, capable of handling graphs with tensof billions of edges on a single machine with modests resources. Quite afew researches put constraint on subgraphs. Asahiro et al. [4] aims tofind a k− vertex (k ≥ 3) subgraph with maximum weight and derives tightbounds for k in different ranges. Charikar [7] studies the problems of findingsubgraphs maximizing notions of density for undirected and directed graphs.Andersen et al. [2] specified lower or upper bounds on the number of verticesin the subgrah, gave a 13 -approximation algorithm for the former problem,and proved there is no good approximation algorithm for the latter one.Our work also outputs a dense subgraph which, however, needs to coverall input query keywords. Moreover, we define our own notion of density,which combines structural density and semantic relatedness between nodes,instead of only using average node degree [7], diameter [21] or maximumweight [4].62.4. Keyword search in graph2.4 Keyword search in graphUnlike traditional graph search focusing more on graph’s structure ratherthan the semantic content, keyword search aims to extract a group of denselylinked nodes in the graph which satisfy particular keyword-based query[1].Hulgeri et al. [17] enables keyword-based search on relational databases, andoutputs rooted trees connecting tuples that match individual keywords in thequery as answers. Hao et al. [15] focuses on implementing efficient rankedkeyword searches on schemaless node-labeled graphs, and offers order-of-magnitude performance improvement over existing approaches. Our projectalso enables users to extract information without any knowledge of schema,yet we output compact subgraphs instead of rooted trees as answers. Sozioand Gionis [38] seek to find a subgraph that contains all query nodes andis densely connected. They measure density based on minimum degree anddistance constraints, and propose corresponding heuristic algorithms. Ourwork not only constrains the structure density of generated graph, but alsoconsiders the semantic relatedness between nodes. In addition, there aremultiple nodes corresponding to one query in our work, which is a challengein searching efficiency.2.5 Answering relationship queriesLuo and Tang [27] seeks to find relationships between two entities(people,places, institutes or documents etc.) on the Web. To filter massive noise inthe web pages without losing useful information, they identify potential con-necting terms capturing relationships between entities by assigning weightsto terms. It presents users with top ranked web page pairs based on the like-lihood of two web pages mentioning the relationship between entities andhighlights connecting terms to clarify relationships between query entities.Our work relates to this paper in the motivation of connecting relationshipqueries, but is different than this paper in the following aspects: first, wedecompose raw documents into triples(assertions), build a news informa-tion network to model connections(relatedness) between them, and outputsubgraph to connect input queries; secondly, this paper only deals with twoqueries, while our work has no upper limit on the number of input queries.72.6. Team formation2.6 Team formationLappas et al. [21] solves the problem of team formation: given an undirectededge weighted graph where each node is labeled with a set of skills (text)and edges represent communication cost between nodes they connect, thetask is to find a subgraph which covers all skills required and minimizes thecommunication cost at the same time. Our problem is most related to thispaper, and we actually prove the hardness of our problem(see Section 5.1)by reducing to their problem. Yet, our edge weight has the exact oppo-site semantics, and we aim to combine the structural density and semanticrelatedness between nodes, which leads to a clearly different problem thantheirs.8Chapter 3Data Preprocessing3.1 OverviewThis chapter will introduce approaches we take to process raw news text intoa new information network. The general work flow is shown in figure 3.1,and each part’s motivation and methodology will be introduced separatelyand detailedly in each section. Before getting into processing details, wedefine our worked-on text unit triple as follows:workflowNews CorpusNews Corpus with pronouns replaced by name entitiesClean assertionsAssertions with  article info  and timestampGroups of merged assertionsAssertion GraphAnaphora ResolutionRelation ExtractionSet each group as node, co-occurrence of assertions in same article as edgeClusteringNoisy assertion filteringFigure 3.1: Dataset processing workflowDefinition 1. (triple) A triple Γ consists of three parts of meta data: subjects, predicate p, object o, denotes as:Γ = (s; p; o)s and o are entities, and p denotes the relationship between them. Eachtriple Γ represents an action. Each triple Γ has following features:• (news id) Γid means the id of news from which this triple is generated• (timestamp) Γt means the timestamp of this triple(or news).• (description) Γd describes the fact this triple represents.93.2. Anaphora resolutionPrevious work mostly refer to sentence [11, 35, 41] as text unit to workon , others use entities([3, 27]) and relationships between them to exploreinteresting connections from the web. We extract triple from text for thefollowing reasons:• we aim to provide a subgraph(an event) with multiple actions, whichare able to tell users what happened between those queries speci-fied. Sentences in news articles are too long, while entity withoutrelationships convey little information. Triple instead can present afact clearly in the shortest length.• with text volume increasing, number of sentences will increase linearly,while triple volume will increase in the beginning and gets stable atsome point because of the fact that a triple transmits over web moreeasily than a sentence and later-detected triple has been found outalready.Thanks to above reasoning, we refer to triple as text unit. However, notall triples convey enough information to represent an action. For instance,triple (she; ate; one apple) provides little information to us as we won’t beable to know who she is without context. Thus, we define wise triple asbelow:Definition 2. (wise triple) A wise triple Γ needs to satisfy following condi-tions:• both subject s and object o are not pronounsTo get more wise triples from raw text, we first adopt Anaphora Res-olution to replace pronouns to their equal entities, and then use RelationExtraction to extract triples.3.2 Anaphora resolutionHalliday & Hasan 1976 defines anaphora as cohesion which points back toprevious item. It refers to the interpretation of one expression depends onits antecedent, the entity being pointed back to. The process of identifyingan anaphora’s antecedent is then called anaphora resolution[33](aka. coref-erence resolution). For example:Russia says it will give Ukraine a new tranche of a $15bn aid package11All examples in the thesis are from real-life news103.2. Anaphora resolutionIn this example, it is the anaphora, and Russia is the antecedent.In formal written news articles, it is quite common to have anaphorasituations, and the antecedent-search scope is usually located at the currentand preceding sentence. The reason we need anaphor in processing thenews article is because quite a few tuples we get from Relation Extraction3.3 starts with pronouns like it, she, he, and it is impossible to get whatthese pronouns refer to without context. For example:Russia had hinted it would freeze the aid until a new government is formedafter Ukraine’s PM quit last month.From relation extraction introduced in section 3.3, generated triples are Γ1= (it; would freeze; the aid[enabler=until a new government is formed afterUkraines PM quit last month]) and Γ2 = (Ukraines PM; quit in; last month).Without reading the whole sentence, we would have no idea who did theaction of freezing the aid, and Γ1 will be tossed later because it technicallyprovides no information. To avoid getting rid of valuable triples, we applyanaphora resolution to news articles so that all pronouns are replaced withtheir antecedents before relation extraction. Then the previous sentencebecomes the following:Russia had hinted Russia would freeze the aid until a new government isformed after Ukraines PM quit last month.Then generated triples become Γ1 =(Russia; would freeze; the aid[enabler=untila new government is formed after Ukraines PM quit last month]) and Γ2 =(Ukraines PM; quit in; last month), both of which are wise triples.In our project, we apply Standford Deterministic Coreference ResolutionSystem[23], which automatically identifies coreferring entities in text, as toolto solve this problem. Its system incorporates lexical, syntactic, semanticand discourse information in coreference resolution models[24], and rankedfirst at the CoNLL-2011 shared task2. It first identifies all mentions ofentities in the text, and then cluster them into equivalence classes. Basedon generated equivalence classes, pronouns among them can be replaced byname entities, which provides more information, as introduced before. Takethe following text for example:President Barack Obama and his Democrats must agree to delayimplementation of Obama’s health care law.2http://conll.cemantix.org/2011/113.3. Relation extractionThis system generates equaivalence classes as follows:1 CHAIN1-[“Barack Obama” in sentence 1, “his” in sentence 1, “Obama ’s”in sentence 1]3 CHAIN3-[“President Barack Obama and his Democrats” in sentence 1]4 CHAIN4-[“President Barack Obama” in sentence 1]5 CHAIN5-[“his Democrats” in sentence 1]7 CHAIN7-[“implementation of Obama ’s health care law” in sentence 1]8 CHAIN8-[“Obama ’s health care law” in sentence 1]From CHAIN1, where all entities belong to the same class, we are ableto replace pronoun “his” as “Obama ’s”. Therefore, with each news articleas input, we can get all equivalence classes containing multiple entities, andreplace pronouns with corresponding name entity.3.3 Relation extractionRelation Extraction is the task of extracting semantic relationships betweena set of entities from text3. In this project, triples in the format of (s; p; o)are needed to extracted from news articles. Generally, two kinds of Informa-tion Extraction(IE) systems are available for this; a traditional IE systemneeds relations of interest to be specified in advance, while open IE onescan generate relations between entities without providing domain-specificknowledge or target of relations.We refer to Ollie [30], a state-of-art open IE system, to generate binaryrelationships from news text. There are other advanced open IE systemsavailable, such as SONEX [31], TreeKernel [40]and EXEMPLAR[32]. Thereasons that we choose Ollie over these toos are as following:• EXEMPLAR and SONEX only extract relationships between nameentities(people, organization etc.), thus fewer than five tuples can begenerated from one news article, while Ollie can extract more thanhundred tuples.• TreeKernel is supervised approach and the lacking of training examplesmakes it hard to use for our work.• Ollie has extra features compared with other tools, which are:– it calculates confidence score for each triple it generates, whichshows how correct this triple could be.3http://en.wikipedia.org/wiki/Relationship extraction123.4. Filtering– it captures enabling condition, a condition that needs to be metso the tuple is true4. For example, for sentenceShe will move out if rent keeps risingTriple (She; will; move out) will be true when condition if rentkeeps rising is present.– it captures attribution clause which specifies an entity that as-serted an extraction and a verb that specifies the expression [30].A sentence like below:Russia has miscalculated over Crimea, says Hagueget result: (Russia; has miscalculated over; Crimea)[attrib=Haguesays]– When some relations are expressed without verb, like:Edward Snowden is the leaker of NSAOllie is able to get triple (Edward Snowden; be the leaker of;NSA).3.4 FilteringEach news article can averagely generate one hundred triples; however, notevery triple is wise triple. Take following sentence as example:Mr Cameron urged Mr Putin to support the formation of a contact groupthat could lead to direct talks between the governments of Russia andUkraine.Generated triples are:• Γ1 = (Mr Cameron; urged Mr Putin to support; the formation of acontact group); confidence score: 0.8• Γ2 = (Mr Cameron; urged; Mr Putin); confidence score: 0.75• Γ3 = (direct; talks between; the governments of Russia); confidencescore:0.2624https://github.com/knowitall/ollie133.5. Similarity between triplesΓ1 and Γ2 are able to deliver enough information without other context,while Γ3 can’t. To ensure high quality of the news information network,here are a few filtering criteria:• Confidence Score: Ollie[30] provides confidence score for each triple itgenerates. We observe that any triple with score under 0.7 is of lowquality, so we get rid of those triples.• Predicate filtering: Triple with “said in”, “said at” only contains dateor location information, which is not enough to describe one fact• Not a wise triple3.5 Similarity between triplesAfter generating triples from news articles, we observe that different newsarticles may contain the exact same or very similar triples, such as (EdwardSnowden; be the leaker of; NSA) and (Snowden; be the leaker of; NationalSecurity Agency). These two triples are presenting the same fact and thusneed to be merged as one triple. Similarity in this thesis has two flavors,one is textual similarity, which measures the ratio of exact common wordsbetween triples, and the other one is semantic similarity with the idea ofcomparing the likeness of triples’ meaning5.3.5.1 PreprocessingA triple averagely consists of ten words, thus noisy words, if any, will bringhuge bias on similarity score. Common words, like “the”, “so”, “and”,are the most frequent words in a document[36]. Even though they helpconvey the meaning, they don’t have much importance compared with otherrepresentative words. Treating the whole triple as a bag of words [29] willblur this triple’s focus and get lower accuracy. To get accurate result, werefer to [34] to focus only on entity words. We obtain entities from a tripleusing a Part-Of-Speech Tagger 6, and we change the noun to single formwhen it is plural(POS tag “NNS” or “NNPS”). Also, if there are duplicateentities in the tuple text(very rare), just count one. In the news corpusdataset we used in experiments (see chapter 6), each triple contains on theaverage 6.2 entities. For further reference, ΓL will denote as the list ofentities triple Γ’s text contains.5http://en.wikipedia.org/wiki/Semantic similarity6POS Tagger, http://nlp.stanford.edu/software/tagger.shtml143.5. Similarity between triples3.5.2 Textual similarityFor triple Γ1 and Γ2, Jaccard Similarity measures similarity between thesetwo finite sets (Sim(Γ1,Γ2)) by the ratio of the size of intersection of Γ1 andΓ2 to the size of their union[36], that is:Sim(Γ1,Γ2) =|ΓL1 ∩ ΓL2 ||ΓL1 ∪ ΓL2 |(3.1)Local ComparisonEach triple consists of Subject, Predicate, Object. Local Comparison is theaverage of the similarities of three parts of the triple. Then:Sim(Γ1,Γ2)local =Sims + Simp + Simo3(3.2)Global ComparisonGlobal Comparison is to regard this triple as a whole sentence rather thanthree parts and then compare. The reason behind this is because there arequite a few passive-format triples, like “the agreement; is signed by ; Putinand Obama”. If compared with “Putin and Obama; signed; this agreement”,local comparsion will conclude that they are not similar at all, while in factthey are talking about the same subevent. To prevent miscalculating thesetuples, we choose global similarity over local one as Jaccard Similarityscore.3.5.3 Semantic similarityTo measure the degree of semantic similarity, or more generally relatedness,between two lexically expressed concepts is a problem involved in many com-putational linguistics tasks [6]. Semantic relatedness covers more aspectsthan semantic similarity in the sense that the latter often only measuresthe likeness between entities, while the former regards antonymy (cellphone-battery), antonymy (love-hate) and some other relationships as relatednessmetrics. As we are only interested in clustering similar triples when theyrepresent the same fact, semantic similarity provides more accurate results.We refer to WordNet 7 as knowledge source to calculate semantic sim-ilarity score between triples. A few measures are available for this, such7http://wordnet.princeton.edu153.5. Similarity between triplesas HirstSt-Onge [16], LeacockChodorow[22], Resnik[37], JiangConrath[18],Lin[26]. Among these five, Hirst-St-Onge measures semantic relatednessas it uses all relationships in WordNet, and remaining four ways calculatesemantic similarity in that they only use the hyponymy relation. [6] exper-imentally compared these five measures, and conclude that JiangConrathperforms the best overall, followed by Lin, Leacok and Chodorow; Resnikis seriously under-related, and Hirst-St-Onge seriouly over-related. Lin usesthe same elements as Jiang-Conrath, which calculates semantic distance in-stead of semantic similarity. As Lin provides more promising score thanJiang-Conrath in our dataset8, we choose Lin for simpler combination withtextual similarity.Lin [26] defines similarity function as follows:sim(x1, x2) =2× logP (C0)logP (C1) + logP (C2)(3.3)where C is a synset, word x1 ∈ C1, x2 ∈ C2, C0 is the most specific classthat subsumes both C1 and C2, and P (C) is the probability of encounteringan instance of a synset C in specific corpus.3.5.4 Similarity combinationTo combine textual and semantic similarity, we propose the following simi-larity function to calculate the final similarity score of triple Γ1 and Γ2:S(Γ1,Γ2) =∑w2∈ΓL2maxw1∈ΓL1sim(w1, w2)|ΓL1 |+ |ΓL2 | − |ΓL1 ∪ ΓL2 |(3.4)where ΓL denotes the entity list of triple Γ, |ΓL| is the length of thelist, sim(w1, w2) is the semantic similarity score between words w1 and w2(When w1 = w2, sim(w1, w2) = 1), and |Γ2| ≤ |Γ1|.Intuition behind this formula is a combination of extractive and abstrac-tive approaches in text summarization. Extractive approaches focuses onextracting exact words, phrases, sentences or even paragraphs from text,and abstractive approach needs to rephrase the text using NLP techniques.Wordnet takes semantic co-relations between words into consideration, whichlifts the similarity score between triples which are telling the same fact whiletelling it in different ways.8Check and compare manually163.5. Similarity between triples3.5.5 ComputationWe generate more than around four million triples from news corpus usingexisted NLP packages(see section 3.3) 9. Pair-wise computation complexityis O(n2), which is extremely expensive in this scale. Also, it does not makemuch sense to compare triple Γ1 from event A with triple Γ2 from event B.However, we are not able to differentiate them before the news informationnetwork is created, then using existing techniques is impossible to locatewhich triples to compare based on the fact that they belong to one event.As explained in section 3.5, each triple is represented by a set of en-tities. We then borrow the Linkage Search idea from [34] where the unitfor computation can also be represented with entities. We first constructa triple-entity bipartite graph and then perform a two-step walk to locatewhich triples to compute. Take one uncomputed triple Γ for example, thefirst step is to choose an entity e1 in ΓL, and then find all triples connectedto e1, whose count ranges from 1 to 87228. The union set of neighbors forall entities in Γ form the triple group Γ should be compared with, averagelynumbering 68599, which is still very expensive.To reduce computation cost, we propose the following algorithm(line 4 toline 9 in algorithm 1): suppose the similarity score threshold is τ , |ΓL1 | = l1,|ΓL2 | = l2, and denote |ΓL1 ∩ ΓL2 | = |com|. Then we ideally want that if Γ1and Γ2 are similar to satisfy the inequality::|com||ΓL1 |+ |ΓL2 | − |com|≥ τ (3.5)Suppose we fix Γ1, we are not able to know |ΓL2 | as we are trying to avoidcheck every triple |ΓL1 | connects. Thus we can tighten equation 3.5.5 basedon (|ΓL1 |+ |ΓL2 | − |com|) ≥ |ΓL1 |, and we get equation:|com||ΓL1 |≥ τ (3.6)Therefore, |com| ≥ τ |ΓL1 | indicates that at least τ |ΓL1 | entities in ΓL1has to be contained by ΓL1 ∩ ΓL2 . In this case, when checking if S(Γ1,Γ2) isat least τ , instead of picking dτ |ΓL1 |e entities in Γ1 to see if they also existin Γ2, we can do the opposite by checking the remaining |ΓL1 | − dτ |ΓL1 |e+ 1entities. If these entities are not in |Γ2|, then S(Γ1,Γ2) will definitely be less9Some news articles generate more than one hundred triples, and also the headline willbe counted three times173.6. Clusterthan τ . Based on this theorem, we pick |ΓL1 | − dτ |ΓL1 |e+ 1 entities with theleast size of support set(a set of triples connecting to this entity) from ΓL1 .The union of these entities’ support set is Γ1’s candidate set. The averagesize for each triple is 2553.Algorithm 1: Similarity Score Computationinput : {ΓL}, τoutput: Each Γ’s similarity score with potential similar triples1 Build one triple-entity bipartite graph Bi;2 for each ΓL do3 // Generate candidate triples;4 for each entity e do5 S(e)← triples connected to e in graph Bi6 rank S(e) based on set increasing size ;7 Ψ(Γ)← φ ;8 for each first |ΓL| − dτ |ΓL1 |e+ 1 entities e do9 Ψ← S(e)10 // Calculate;11 for each triple Γ′ in Ψ do12 use equation 3.5.4 to calculate S(Γ,Γ′)3.6 Cluster3.6.1 IntroductionGenerated triples represent various actions . As the same action is referredto multiple times in one news article or different ones, such as Edward Snow-den is the leaker of NSA has been brought up for more than 1,000 times inour dataset, clustering similar instances into groups is necessary.To fix context and aid in understanding, we assume each triple is a datapoint belonging to the same space, and we examine this collection of points,then group them into clusters based on the similarity score (see section3.5) between each other. What we want to achieve from clustering is thattriples in each group have high similarity between each other and are able torepresent the same action, and different groups are talking about different183.6. Clusteractions.Clustering techniques can be generally divided into hierarchical and par-titioning way. Hierarchical algorithms, as its name suggests, seeks to builda hierarchy of clusters. It consists of “bottom up” approach, starting withits own cluster and paring up clusters as one to move up the hierarchy;and “Divisive” approach, which starts from one cluster and moves down thehierarchy as recursively splitting clusters10. This process stops until somecriterion is made, such as the number of clusters requested. Partition algo-rithms focuses on discovering dense connected components of data, so theyare less sensitive to outliers and quite flexible in terms of clusters’ shape.As hierarchical technique requires some termination criteria which aredifficult to establish in our domain, we choose cluster tuples with partitionway by which we don’t need to specify the number of clusters beforehand.3.6.2 Density-based partitioningThe intuition behind Density-Based Partitioning lies in the fact that anopen set in the Euclidean space can be divided into multiple connectedcomponents[5]. One connected component is defined as one cluster, and cangrow in any direction the density leads. Thus, Density-Based algorithms areable to discover clusters with arbitrary shapes, and also the stopping criteriais quite clear that as long as dense components are properly connected.The algorithm DBSCAN (Density Based Spatial Clustering of Applica-tions with Noise) [12] is one of the most common Density-Based algorithms.It requires two parameters, a given distance  and the minimum pointsneeded to form a dense region(minPts). It arbitrarily starts with an unvis-ited point, retrives its -neighhood points, and starts a cluster if minPts issatisfied. All points’ -neighhood belong to the same cluster, and this clustergrows until all density-connected points are found. Then start with a newunvisited point, repeat the process until all points are visited. DBSCAN’srunning time complexity is O(n2) without using an accelerating index struc-ture, and it needs O(n2) memory. Also DBSCAN’s clustering quality reliesmuch on the minPt- combination, which may be not appropriate for allclusters.To reduce parameter’s overall control and algorithm’s complexity, we ap-ply a more straightforward density-based algorithm to cluster. As we mea-sure similarity between triples(introduced in 3.5), we set similarity thresh-old τ instead of  between points as Connectivity parameter. Each point,10http://en.wikipedia.org/wiki/Hierarchical clustering193.7. Processing running timea triple, is isolated in the Euclidean space first, and then an edge is addedbetween two points if their similarity score is greater than τ . After checkingall edges, connected components are clusters. Density parameter minPts inour algorithm is not used to set the minimum number of nodes in a neigh-borhood to grow forward, but the number of nodes to present as a clusterat the end. We only store point pairs with similarity score greater than τin memory, thus it costs O(|E|) memory, and O(|E|) running time(|E| isthe number of edges of this graph). Pseudo Code of this algorithm is inAlgorithm 2 line 1 to line 5(see section 4.3):Line 6 in Algorithm 2 removes a triple with only one degree. This stepgets rid of noisy nodes that connects two not-that-relevant clusters together.To aid explain how to build news information network, we define triplecluster as follows:Definition 3. (triple cluster) A triple cluster ∆ is a group of similar triplesrepresenting the same action. ∆ = {Γ1,Γ2,Γ3....Γm}.3.7 Processing running timeMost of the dataset processing time is determined by Anaphora Resolution insection 3.2 and Relation Extraction in section 3.3. Even though we adoptedmulti-threading processing, it still took around seven hours to fully processWordWide dataset(210,206 news articles). As we rely on external packagesin this step, there is little we can do to speed up the process. This processis regarded as off-line preparation for the assertion graph on which we is-sue query keywords and determine densely linked and semantically relatedsubgraphs.20Chapter 4Data ModelDefinition 4. (Assertion Graph) Given n triple clusters with each triplewithin cluster labeled with its news id, timestamp and description, co-occurrencethreshold τc, the graph created based on the following rules is called an As-sertion Graph:• Each triple cluster is regarded as one node.• If node µ and ν share more than τc triples labeled with the samenews Id, add one edge between node µ and ν.4.1 NodeNode-related features are listed as follows:• Node/vertices: For each triple cluster ∆ we generated from clusteringin section 3.6, the graph has a corresponding node ν∆. We also referto it as an action. Each node has a specific id.• Node weights: We assign a weight N(ν∆) to each node ν∆ in the graph.It shows the popularity or prestige of this node. The value is equalto the frequency this triple has appeared among all raw text, whichis also the size of this triple cluster ∆. The higher the weight is, themore popular or important this node is.• Node description: We choose the best description D(ν∆) for node ν∆.One cluster contains various similar descriptions. We pick the onewith the highest frequency as this node’s description.• Node timestamp: A triple cluster contains multiple triples, and somultiple timestamps. We assume that the earliest one among themis the time this fact happens, and others are just referring to it infollowing news articles. We denote νt to specify the timestamp.214.2. Edge4.2 EdgeEdge-related features are listed in the following:• Edge: For each pair of nodes µ and ν, if they share more than τc11triples labeled with same news Ids(the number of co-occurrences innews), we create an edge between them.• Edge Weight: We use ωuv to denote the edge weight between node µand ν. ωµν is equal to how many triples with same news Id are sharedbetween µ and ν. We assume that when two facts are referred in thesame news, they are semantically related to each other. The greaterthe edge weight is, the more semantically related these nodes are.• Time-sensed Edge Weight: The time difference between node µ and νrepresents how much time lapses between two facts. If fact µ happensa few months later than ν, we say they are not as close as the pairin which one fact happens right after another one. Therefore, theless time-lapse between two nodes, the closer they are. To combinetime feature into Edge Weight, we define Time-sensed Edge Weightfor nodes µ and ν as follows:ωt(µ, ν) ={ω(µ, ν) |µt − νt| ≤ 2ω(µ,ν)log2(|µt−νt|)otherwise(4.1)4.3 Graph overviewWe build the assertion graph following Algorithm 2. Line 1 to 5 explainsthe way to cluster single triple into clusters. Line 6 removes noisy triples.Then we treat each triple cluster as one node, and connect edges betweennodes if they co-occure more than τc times(line 8 to 14). Figure 4.1 is asubgraph extracted from the assertion graph we build from dataset 1 (seetable 6.1). As explained, each rectangle represents one node with featuredescription, timestamp and frequency. The number on the edge shows thefrequency they co-occur in one article.11τc is determined experimentally224.3. Graph overviewgraph_overviewThe peninsula defending soldiers inside said Russian troops forced their way into a Ukrainian marine base(2014/03/24)-12Lavrov to hold a first meeting(2014/01/13)19Moscow had massed 20,000 soldiers(2014/03/24)-8G7 leaders met on the sidelines of the Nuclear Safety Summit(2013-03/23)-14Kiev circulated a draft resolution(2014/03/24)-8Kiev and the West denounced the annexation as illegal(2014/03/24)-12Leaders of the Group of Seven nations agreed to hold their own summit(2013/03/25)-38810488124Figure 4.1: Graph overview234.3. Graph overviewAlgorithm 2: Building assertion Graphinput : {Γ} with corresponding similarity scores, τs, τcoutput: Assertion Graph G1 Initiate a graph Gc with isolated Γ;2 for each Γ in {Γ} do3 for Γ′ in Γ’s candidates do4 if S(Γ,Γ′) ≥ τs then5 connects one edge between Γ and Γ′6 remove Γ with only fewer than three degree;7 Each connected subgraph is one group with multiple similar triples,denoted as ∆(triple cluster)8 Initiate a graph G with each ∆ as one node;9 for each node µ in G do10 for each node ν 6= µ do11 c =∑Γ1∈µ,Γ2∈ν(Γ1.newsId = Γ2.newsId);12 if c ≥ τc then13 add add edge between µ and ν;14 set ωt(µ, ν) based on equation 4.1;15 Output G as the Assertion Graph;24Chapter 5Problem Formalization andAlgorithms5.1 PreliminariesAs we discussed in section 4.1, each node is associated with an assertion,which can be divided into multiple keywords. We assume assertion graphconsists of n nodes V = {ν1, ν2, ..., νn}, and let K = {κ1, κ2, ..., κm} be auniverse of m keywords. Each node νi is associated with a set of keywordsAi ⊆ K. If keyword κj ∈ Ai, we say that node νi contains keyword κj ;otherwise node νi does not contain keyword κj ; We say that a subset ofnodesV ′ ⊆ V contain keyword κj if there exists at least one node in V ′ thatcontains keyword κj .A query Q is a subset of keywords specified by users to find connectionsbetween them in the assertion graph. We assume that Q ⊆ K as we won’tbe able to cover any keyword which is not in K. Followed [21], we denotethe cover of a set of nodes V ′ with respect to query set Q as C(V ′, Q), andC(V ′, Q) = Q∩ (∪i∈V ′Ai). This specifies the intersection of query set Q andthe union of keywords associated with node subset V ′.We denote any subgraph as G′ = (V ′, E′), where V ′ ⊆ V and E′ ⊆ E.d is the diameter (the longest shortest path between any two nodes in thegraph) of G′. W ′ =∑e∈E′ we, representing the sum of weights of all edgesE′ in subgraph G′. wmax is the maximum edge weight in G′, and |E′| is thenumber of edges. To differentiate different kinds of nodes in the subgraph,we define query node and intermediate node as follows:Definition 5. (Query node and Intermediate node) Query node is the nodewhose set of keywords contains at least one keyword in query Q. Intermedi-ate node is the node whose set of keywords does not contain any keyword inquery Q.255.2. Problem formalization5.2 Problem formalization5.2.1 Problem statementProblem 1. (Query-driven Event Search)Given a news corpus, a set ofkeywords Q, an objective function f , we seek to build an undirected, weightedgraph G = (V,E) first and find a connected subgraph G′ = (V ′, E′) such that:1. C(V ′, Q) = Q2. all intermediate nodes have to be on a path connecting query nodes3. f(G′) is maximized among all candidates for G′The objective function is defined as follows:f(G′) = α1d+ (1− α)W ′|E′|wmax(5.1)where α is the parameter trading off two parts’ importance, d is thediameter, W ′ is the total edge weight, |E′| is the number of edges, and wmaxis the maximum single edge weight of generated graph G′.Theorem 1. Problem 1 is NP-complete.Proof. We prove this by a reduction from the Team Formation (TF)problem.An instance of the TF problem contains a set of n individuals X = 1, ..., n,a graph(X,E), and task T, find X ′ ⊆ X, so that C(X ′, T ) = T , and thediameter of X ′ is minimized. It is easy to see that TF is a special case ofProblem 1 by setting α = 1, in which case, f is maximized exactly whenthe diameter is minimized. Then there is a solution for TF problem only ifthere is a solution for our problem, and vice versa.5.2.2 Discussion about objective functionThe objective function f makes a combination of structure connectivity andsemantic relatedness of a graph. To relatively explain these two parts: 1)Diameter 1d limits the distance between pairs of nodes in generated graphG′. The smaller d is, the more connected G′ is. As d is normally in therange of [1, 10], 1d is naturally in the range of (0,1].1d can also be regardedas dmind where dmin is the minimum diameter of a generated graph (exceptfor single isolated nodes). 2) In the latter part, W′|E′| represents the average265.3. Heuristic algorithmsedge weight of G′, W′|E′|wmaxthus is the ratio of average edge weight andmax edge weight of G′. As introduced before, wµν between nodes µ andν measures the degree of their semantic relatedness, which is a function ofnumber of co-occurences and their time differences as in function 4.1. It maybe unfair to compare the total edge weight of two different subgraphs as thefrequency of their nodes’ text being referred in the dataset influences edgeweight. However, measuring the ratio of average edge weight and maximumedge weight is independent on other features, thus makes it the good way toconstrain the semantic relatedness of a graph. Also W′|E′|wmaxis in the rangeof (0,1], which avoids the possibility that any part dominating the result oftheir linear combination.5.3 Heuristic algorithmsIn this section, we propose three heuristic algorithms for query-driven eventsearch problem:EdgeGreedy (Edge), EdgeGreedy star(EG*)and RarestFirst star(RF*).Another fourth algorithm RarestFirst(Rare) is baseline from [21]. Edge dif-fers from Rare in the way in which the next node is added to the graph.The change between Edge and EG* lies in two process called edge star andnode star. Details of all algorithms are explained in this section.5.3.1 RareAlgorithm 3 (Rare) is a variation of the algorithm presented in [21], weregard it as a baseline of our problem. First, for each keyword query q ∈ Q,we compute S(q), which is the set of nodes that contain q, defined as supportset. Then, we rank all support sets by size and choose one node from theminimum size support set to start building the solution subgraph. As V ′ inoutput graph G′(V ′, E′) must contain at least one node in each set S(q), weassume smaller size set has more priority. p(ν, S(q)) refers to the shortestpath between node ν and every node in set S(q), that is:p(ν, S(q)) = minµ∈S(q)p(ν, µ) (5.2)and we call the node picked in set S(q) as νcan . After fixing the node ν inS(1)(query q1), we then pick nodes with p(ν, S(q)) in all other set S(q)(q 6=q1). When adding nodes to Snode, we also add all intermediate nodes thatare along the path p(ν, νcan). After scanning all remaining support sets, alladded nodes and paths form a connected subgraph G′can; line 13 and line 14275.3. Heuristic algorithmsin algorithm 3 calculate G′can’s diameter and edge weight sum. After cal-culating all possible solutions, algorithm 3 chooses the candidate subgraphwith maximum f value. Recall that we pre-computed all pairs shortest pathand use hash tables to store specific attributes(hops, edge weight sum etc.).The running time of algorithm 3 is O(c|S(1)| × |Q| + |S(1)| × |S(node)|2)where c is the average size(around 213) of support set, |S(1)| is the size ofrarest support set, averagely numbering 11. |Q| is the size of query set, rang-ing between [4,10], |S(node)| is the average number of nodes in generatedgraph, fewer than 12. |S(1)| × |S(node)|2 is the cost for calculating diame-ter (line 14) for each candidate subgraph. Algorithm 4 calculates a graph’sdiameter by exhaustively expanding each node outwards and recording thefurtherest step.Also, we pre-compute the shortest path(minimum hops) between nodes,and the total weight of this path.There may exist multiple shortest paths(samehops) between two nodes but with different total weight, we always choosethe one with maximum total weight.5.3.2 EdgeAlgorithm 5 (Edge) ranks S(q)(q ∈ Q) based on increasing size, starts withS(1)(support set with minimum size), and keeps adding edges from othersets following increasing size sequence. When picking edges to add to solu-tion set, instead of fixing one node in algorithm 3, Edge scans all nodes inthe solution set and choose the edge with the endpoint node µ that maxi-mizes the marginal gain of f . The example below explains how algorithm 5chooses the next node and how it benefits the result.Example 1. (Edge example) Assume edges have unit edge weight. Existedsubgraph S contains nodes a, b, c and edge (a, b) and (b, c). Candidate nodesare d, e. Figure 5.1(a) shows how node d connects to S, and figure 5.1(b)shows how node e connects to S. d and e are both only one hop away fromS. Yet, f(S ∪ e ∪ (b, e)) - f(S) = 0; f(S ∪ d ∪ (a, d)) - f(S) = -0.25(α =0.5). Even though solution 1 and 2 have the same influence on total edgeweight(both are three), the ratio of average weight and maximum weight (bothare one), solution 1 increases diameter by 1, and solution 2 still stays one.Thus, Edge will add e to existed subgraph S.As each diameter calculating costs |S(node)2|, the total complexity foralgorithm 5 is O(c|S(1)| × |S(node)|2) where c is the average size of supportset, |S(1)| is the minimum size of all support sets, and |S(node)| is theaverage number of nodes in generated solution.285.3. Heuristic algorithmsAlgorithm 3: The Rare algorithminput : Graph G(V, E); Query set Q; αoutput: A subgraph G(V’, E’)1 find support set for each query in Q ;2 for each query q ∈ Q do3 S(q) = {νi|q ∈ κi}4 rank all sets based on the size, and rename them to S(1) to S(|Q|);5 initiate node solution set Snode, path solution set Spath ;6 Snode ← φ;7 Spath ← φ;8 for each ν ∈ S(1) do9 Snode ← Snode ∪ ν;10 for 1 < j ≤ |Q| do11 νcan ← arg p(ν, S(j));12 Snode ← νcan ∪ {νpath(ν,νcan)} ∪ Snode;13 Spath ← path(ν, νcan) ∪ Spath;14 d← diameter(Snode, Spath);15 W ←∑e e ∈ Spath;16 f = α 1d + (1− α)W ;17 Snode ← φ; Spath ← φ;18 ν ′ ← argmaxf ;19 G′ = ν ′ ∪ S′node ∪ S′path5.3.3 Edge starDefinition 6. (Edge star) Edge star is a process of adding edges incidenton the existing nodes which will not decrease the value of objective functionf .Edge star is the first step of post-processing generated graph. Algorithm5 can generate a subgraph G′ = (V ′, E′) covering all query keywords andmaximizes f among all candidates. However, each addition in algorithm 5is one query node and one path connecting two query nodes(along with allintermediate nodes). Thus not all edges between pairs of nodes in V ′ arecovered by E′. To make G′ more compact, we propose Edge star(Algorithm6) to add back missing edges under the condition that this edge addition willnot decrease f value. As we don’t introduce new nodes to G′, G′’s diameter295.3. Heuristic algorithmsAlgorithm 4: The diameter calculating algorithminput : Subgraph G1(V1, E1)output: Diameter of G11 /* calculate diameter of G1; record the longest hop for each node */ ;2 map← φ ; d← −1 ;3 for each node ν ∈ V1 do4 νstep = −15 for each node ν ∈ V1 do6 mapν = 0; νstep = 0;7 queue← φ; queue← queue ∪ ν;8 while !queue.IsEmpty() do9 µ← queue.poll();10 if d < µstep then11 d← µstep12 if mapν < µstep then13 mapν ← µstep14 if Nν .size() > 0 then15 for ω ∈ Nν and ωstep = −1 do16 ωstep = µstep + 1;17 queue← queue ∪ ω;18 return d as the diamter;will not increase. The only possible way to decrease f is that added edgelowers down the ratio of average edge weight and maximum edge weight inG′, and does not decrease the diameter to compensate.To add back missing edges, we first get each node’s(ν) original neigh-bor set S1(line 2) and current neighbor set S2(line 3), and then any node(µ)which is in their difference set ({S1 − S2}) and also in V ′ can be ν’scandidate neighbor. If this addition does not decrease f , algorithm 6 willadd it to G′. Let c be the average candidate size for each node in V ′(whichis less than 1), and the complexity of f value check is |V ′|2 (all for diametercalculating), then the total complexity is c|V ′|2. G′’s average |V ′| size is6(see details in section 6).305.3. Heuristic algorithmsdegree_graph1adcb(a) Candidate solution 1degree_graph2aecb(b) Candidate solution 2Figure 5.1: An example for algorithm 55.3.4 Node starDefinition 7. (Node star) Node star is a process of adding node µ and edge(µ, ν) that connects µ to existing node ν if the addition will not decrease thevalue of objective function f .Node star aims to add more information to existed subgraph G1 =(V1, E1) without ruining the structure connectivity and semantic related-ness, in other words, decreasing the value of f . As introduced in section4.3, there are two types of nodes: query and intermediate nodes. We onlyadd intermediate nodes when they are on the path of connecting two querynodes. Therefore, if node top is intermediate node, Algorithm 7 doesn’t addit first, instead it puts top’s neighbors in the queue(see line 26, 27) that cre-ated in the beginning and see if it can connect two query nodes afterwards.On the other hand, when it is a query node, Algorithm 7 checks the pathconnecting top and G1 first as top may not be the direct neighbor of V1(seeline 13-17), and then calculate if f is decreased. It is possible that querynodes are quite a few steps away and that it will be expensive to exploretoo far. Therefore Algorithm 7 breaks when as many as d steps have beenexplored.5.3.5 RF* and EG*We combine Edge star and Node star as post-process of generated graphs.Normally Edge star and Node star contribute to each other and iterativelygrow the subgraph under the condition of not decreasing objective functionvalue. We stop the process when no new edge nor node can be added. RF*is algorithm Rare plus the iterative post-process, and EG* is Edge plus thepost-process.315.3. Heuristic algorithmsAlgorithm 5: The Edge algorithm for Problem 1input : Graph G(V, E); Query set Q; α(> 0)output: A subgraph H(V’, E’)1 /* find support set for each query in Q */ ;2 for each query q ∈ Q do3 S(q) = {νi|q ∈ κi}4 // rank all sets based on the size, and rename them to S(1) to S(|Q|);5 Snode ← φ;6 Spath ← φ;7 for each ν ∈ S(1) do8 Snode ← Snode ∪ ν;9 for 1 < j ≤ |Q| do10 for each µ ∈ S(node) do11 δ(µ) = f(Snode ∪ Spath ∪ µ ∪ e(µ, Snode))− f(Snode ∪ Spath)12 νcan ← argmin δν′∈S(j)(µ);13 Snode ← Snode ∪ {νpath(ν,νcan)} ∪ νcan;14 Spath ← path(ν, νcan) ∪ Spath;15 d← diameter(Snode, Spath);16 W ←∑e e ∈ Spath;17 f = αd+ (1− α)W ;18 Snode ← φ; Spath ← φ;19 ν ′ ← argmaxf ;20 G′ = ν ′ ∪ S′node ∪ S′path21 Output subgraph G′;325.3. Heuristic algorithmsAlgorithm 6: The Edge star algorithminput : Subgraph G’(V’, E’); neighbor informationoutput: Subgraph G1(V1, E1)1 for each node ν ∈ V’ do2 S1 ← Nν ;3 S2 ← N ′ν ;4 for each node µ ∈ {S1 − S2} do5 if µ ∈ V ′ then6 if adding µ will not decrease f then7 E′ ← E′ ∪ e(ν, µ);8 else9 continue10 V1 ← V ′;E1 ← E′;335.3. Heuristic algorithmsAlgorithm 7: The Node star algorithminput : Subgraph G1(V1, E1)output: Subgraph G′(V ′, E′)1 // add all neighbors to a queue;2 d← diameter(G1); step ← 0 ;3 q ← an empty queue ;4 for each node ν ∈ V1 do5 S1 ← Nν ; S2 ← N ′ν ;6 for each node µ ∈ {S1 − S2} do7 q.offer(µ)8 while !q.IsEmpty() do9 top← q.poll();10 prev ← top.prev;11 L← φ; P ← φ;12 if top is query node then13 while ! prev ∈ V1 do14 L← L ∪ top;15 P ← P ∪ (top, prev);16 top← prev;17 prev ← prev.prev;18 δ ← f(G1 ∪ L ∪ P ) - f(G1);19 if δ ≥ 0 then20 G1 ← G1 ∪ L ∪ P ;21 else if top is intermediate node then22 step← step+ 1;23 if step == d then24 break;25 Stop ← Ntop;26 for each node µ ∈ Stop do27 q.offer(µ);28 V1 ← V ′;E1 ← E′;34Chapter 6Experimental Results6.1 DatasetsThere are news datasets available online, such as Reuters-21578 12, andNew York Times Annotated Corpus13. However, these datasets are mostlyfrom only one source, and are processed(tagged) for specific purposes suchas text categorization and summarization. Also, we aim to find interestingevents with respect to user-input queries, so it is better that all news inthe dataset are from breaking news category instead of some regular newsreport. Because of these reasons and also the intention of getting morecurrent results, we subscribed to RSS feeds to collect three datasets fromdifferent categories.We collect Worldwide news from sixteen news websites(see table 6.1)from June 1st, 2013 until May 1st, 2014. Each feed provides different num-ber of news every day, approximately ten to thirty breaking news. Wecrawled news’ headlines, introduction, main text and corresponding times-tamps. The whole dataset is in English. As we only analyze text in ourproject, we get rid of all top news contains only slideshows, gallery, picturesor video. Distribution of news articles from different web sources is indicatedin the following table 6.1. As shown in table 6.1, three of news sources arein UK, one from Canada, one from Australia, and the rest are from US. Asit turned out most breaking news we crawled falls into the political cate-gory, we then crawled another dataset from Verge to technology category(see table 6.2).The Onion is unveiled, by The Boston Globe, to be a not a legitimatenews source but a website full of satire and fake news 14. As there are only1162 out of 210,206 news articles, we think it won’t have much influence onthe assertion graph we created and following event search on it.We created an assertion graph following all steps introduced in chap-12http://www.daviddlewis.com/resources/testcollections/reuters21578/13https://catalog.ldc.upenn.edu/LDC2008T1914http://www.boston.com/culturedesk/2013/03/06/the-onion-fake-globe-uncovers/t7S8gjs3Wsmzr2L8gC6ZOO/story.html356.2. Performance evaluationNews Sources Nation Number News Sources Nation NumberBBC UK 32430 CBC Canada 9273CBS US 14733 CNN US 17729Denver Post US 3416 Fox News US 5182Guardian UK 19130 Huffington Post US 18912NBC US 10968 News.com.au Australia 30431Reuters UK 11295 TheOnion US 1162Time US 11214 USA US 18869WorldNews US 30022 Washington Post US 3259Table 6.1: Worldwide dataset news distributionType News Sources Nation NumberTech Verge US 12187Table 6.2: Verge datasetter 3 and 4 for each dataset. The statistics for each dataset’s generatedtriples and corresponding assertion graph are presented in table 6.3. Itemsin “Connected” column specifies how many dense connected subgraphs thisassertion graph contains.Type News Assertions Nodes Edges ConnectedWorldwide 210,206 3,847,640 120,507 301,696 142Tech 12,187 241,217 15,143 23,062 24Table 6.3: Statistics for worldwide and verge dataset6.2 Performance evaluationThis section evaluates four algorithms Rare, RF*, Edge, and EG* on thestructural density and semantic relatedness of generated subgraphs. We re-port sugraph’s multiple features such as number of nodes, number of edgesand so on. Yet, we evaluate performance mostly based on diameter, weigh-tratio and objective function value. All algorithms are implemented in Javaand were run on a Linux server with Intel Xeon CPU X5570 (2.93GHz) and128GB RAM.366.2. Performance evaluation6.2.1 Query set generationEach query set consists of multiple keywords between which users are in-terested in exploring connections. To acquire more meaningful events withrespect to query sets, we constrain keywords in query set satisfying the fol-lowing conditions: 1) Q = {q1, q2, q3...}; K is the universal set of keywordsof total assertion graph G. Q ⊆ K. 2) Keywords in Q are related with eachother, meaning that for each pair of keywords (qi, qj), there is at least onepath between (ν, µ) where ν ∈ Si and µ ∈ Sj . Si and Sj are respectivelysupport sets for query qi and qj (see details in section 5.1).To generate legal query sets, we collect one keyword file for each denseconnected subgraph in assertion graph G, and rank those keywords basedon frequency. We then pick top three as fixed keywords and randomly pick afew(depending on how many keywords needed) from the remaining to forma query set.6.2.2 Parameter changingFigures 6.1(a) to 6.4(a) shows how generated graph G′ changes with α in-creasing from [0,1]. We input 142 query sets(each one with six key words)for Worldwide dataset and twenty-four query sets for Verge dataset, andcalculated the average value for every feature. Horizontally comparing allfeatures, figure 6.1(a), 6.1(b) and 6.1(c) have the similar trend when α = 0or α = 1 they have greater values than those when α in between. The reasonbehind this is because only weightratio or diameter is constraining the linearfunction when α = 0 or 1, thus more nodes and edges are added. And, thetotal edge weight and number of edges of G′ are dependent on the numberof nodes.0 0.3 0.5 0.7 1.0alpha02468101214number of nodesRareEdge RF*EG*(a) Number of nodes0 0.3 0.5 0.7 1.0alpha01020304050number of edgesRareEdgeRF*EG*(b) Number of edges0 0.3 0.5 0.7 1.0alpha0100200300400500total edge weightRareEdgeRF*EG*(c) Total edge weightFigure 6.1: Statistics of G′ when α ranges between [0,1]Diameter d, weightratio r and f value are three metrics we use to measure376.2. Performance evaluationthe algorithms’ performance recall that we defined f = α 1d + (1 − α) r.Figure 6.2(a) and figure 6.2(b) plot the diameter as a function of α. Withα increasing(d is getting more power over f), every algorithm except forRare outputs denser subgraph. As we can see, in Worldwide dataset(figure6.2(a)), EG* yields the smallest diameter for each α value (excluding α = 0),leading baseline Rare with about 12% to 26% decrease. As expected, RF*outperforms Rare (about 6% to 25% decrease), and EG* performs better thanEdge (about 6% to 14% decrease) as post process node star and edge starhelp adjust generated graph. Edge outperforms Rare by about 6% to 15% as the former takes f value change into consideration when adding nextnode to existed subgraph. However, for Verge Dataset(figure 6.2(b)), eventhough EG* still performs best, the gap between approaches is smaller thanthat of Worldwide dataset, even the largest gap is 6%. We will explain thereason in the next paragraph combining weightratio and f .0 0.3 0.5 0.7 1.0alpha0. RF*EG*(a) Worldwide0 0.3 0.5 0.7 1.0alpha01234diameterRareEdge RF*EG*(b) VergeFigure 6.2: Diameter of G′ when α ranges between [0,1]Figure 6.3(a) and figure 6.3(b) shows how weightratio r changes withincreasing α. Rare still has little change as it builds the subgraph by onlymeasuring shortest path between fixed node and candidate nodes. With nosurprise, EG* still outperforms other approaches by leading runner-up Edgeby around 3%, RF* by 10%, and Rare 20% in Worldwide Dataset. Combinedwith diameter’s influence, Edge also leads in the objective function value,separately leading Edge, RF* and Rare 8%, 9 % and 25% in figure 6.4(a).In comparison, Verge Dataset has bigger gap between algorithms in weigh-tratio (see figure 6.3(b)), which are 4%, 10%, and 36% between first threeapproaches and EG*, and yet smaller gap in objective function value(3%, 6%,20%). The reason behind this is that Verge Dataset is sparser than World-386.2. Performance evaluationwide, so when running node star and edge star, Verge Dataset tends toexpand to neighboring nodes so that the weightratio can compensate theloss of diameter as there are not available edges to add to existed subgraph.0 0.3 0.5 0.7 1.0alpha0. RF*EG*(a) Worldwide0 0.3 0.5 0.7 1.0alpha0. RF*EG*(b) VergeFigure 6.3: Weightratio of G′ when α ranges between [0,1]0 0.3 0.5 0.7 1.0alpha0. valueRareEdge RF*EG*(a) Worldwide0 0.3 0.5 0.7 1.0alpha0. valueRareEdge RF*EG*(b) VergeFigure 6.4: Objective function value of G′ when α ranges between [0,1]Figure 6.5(a) and figure 6.5(b) plots the average edge weight and max-imum edge weight trend with increasing α. We don’t use these features tomeasure algorithms’ performance, yet we can observe from the trend that αdoesn’t impact these features that much(except for maximum edge weightwhen α = 1). What causes this may be no matter which part is dominatingf , no noisy edge can be added as either d or r can constrain it.396.2. Performance evaluation0 0.3 0.5 0.7 1.0alpha01234567average edge weightRareEdge RF*EG*(a) Worldwide0 0.3 0.5 0.7 1.0alpha024681012maximum edge weightRareEdgeRF*EG*(b) VergeFigure 6.5: Average and maximum edge weight of G′ when α ranges between[0,1]6.2.3 Query set size changingTo evaluate how the size of query set influences results, we conduct exper-iments for all algorithms by varying query set size from four to ten. Thenumber of nodes, number of edges and total edge weight increase as ex-pected. Yet, the increase rate decreases as query size increases, such asthe number of nodes (figure 6.6(a)) when query size almost doubles whenquery size was four, while the rate drops to 25%, and 17% when query sizechanging from six to eight and eight to ten. What’s interesting that at mosteleven nodes are among the four approaches, which makes it still readablefor users. The trend still applies to the number of edges and total edgeweight(we omit them as they have similar trends as figure 6.6(a)).Figure 6.7(a), 6.8(a) and 6.9(a) present feature diameter, weighRatioand f value of G′, which are three metrics we use to measure algorithms’performance. EG* outperforms others in each plot, having the smallest di-ameter, largest weightratio(except for query size = 4), and the most f valuein both datasets. However, the gap between diameter in Worldwide is largerthan that of Verge dataset, and the gap between weightratio is the otherway around. Same reasoning to that of 6.2.2 applies here: when more key-words are input, Verge tends to add more nodes instead of adding moreedges between existed nodes. What’s also worth mentioning is that Edgeperforms better than RF* in Verge Dataset(smaller diameter, greater weigh-tratio, greater f value) because node start and edge star do not have asmuch influence as those in Worldwide dataset.Horizontally comparing, diameter increases with query size expanding.406.2. Performance evaluation4 6 8 10query size024681012number of nodesRareEdgeRF*EG*(a) Worldwide4 6 8 10query size024681012number of nodesRareEdgeRF*EG*(b) VergeFigure 6.6: Number of nodes in G′ when query size ranges between [4, 10]4 6 8 10query size01234diameterRareEdge RF*EG*(a) Worldwide4 6 8 10query size0123456diameterRareEdge RF*EG*(b) VergeFigure 6.7: Diameter of G′ when query size ranges between [4, 10]Yet, similar to trend in figure 6.6(a), the increase rate becomes smaller andalmost flattens out when query number changes from eight to ten. Weigh-tratio is quite steady during the change. Thus f values experiences a dropfrom four to six, and stays steady afterwards. These plots illustrate that ouralgorithm is robust enough to deal with large query set sizes, still being ableto output structurally dense subgraph and semantically related subgrapheven though quite a few keywords are input.6.2.4 Running timeTable 6.4 and 6.5 report the running time and steps of node star andedge star in EG* and RF* separately for Worldwide and Verge Dataset.416.2. Performance evaluation4 6 8 10query size0. RF*EG*(a) Worldwide4 6 8 10query size0. RF*EG*(b) VergeFigure 6.8: Weightratio of G′ when query size ranges between [4, 10]Still, the running time for each approach is the average time of separatelyrunning 142 and twenty-three query sets for each dataset, and α is set to0.5. With no surprise, Rare runs fastest among all approaches as it picks thenext node only depending on shortest path while Edge needs to calculatemarginal f gain for all candidates. Items in e*(E) and n*(E) respectivelyrepresent the count of steps in which Edge star and Node star actuallyadd edges and nodes to existed subgraph in algorithm EG*, and e*(R) andn*(R) are for RF*. We can observe from both tables that EG* has muchmore edge and node addition than RF*, and when query size is four, six andeight, Verge has more node addition and less edge addition than Worldwide,which verifies the explanation in section 6.2.3 that Verge tends to add morenodes to compensate for less edge addition.size Edge EG* e*(E) n*(E) Rare RF* e*(R) n*(R)4 102 110 1.90 1.67 11 13 0.58 0.226 103 109 2.57 2.18 7 12 0.99 0.418 110 130 2.86 2.29 3 15 1.11 0.4310 152 166 3.06 2.43 4 16 1.19 0.47Table 6.4: Running time(ms.) for worldwide datasetIn summary, our extensive experiments on news datasets show that ourapproaches are effective and efficient, generating much preferred results thanthe baseline. In particular, EG* outperforms all other algorithms.426.3. Case study4 6 8 10query size0. valueRareEdgeRF*EG*(a) Worldwide4 6 8 10query size0. valueRareEdgeRF*EG*(b) VergeFigure 6.9: Objective function value of G′ when query size ranges between[4, 10]size Edge EG* e*(E) n*(E) Rare RF* e*(R) n*(R)4 47 50 1.87 2 10 14 0.42 0.266 49 51 2.47 2.37 4 8 0.76 0.758 45 47 2.47 2.40 3 6 0.93 0.8710 139 145 2.07 2 7 28 1.23 0.76Table 6.5: Running time(ms.) for verge dataset6.3 Case studyThese experiments aim to show that our problem definition and correspond-ing algorithms can produce meaningful and reasonable results. As it is diffi-cult to evaluate the quality of results in a way that does not involve graph’sinternal features but semantics, we conducted a case study in which we pro-vide results of algorithm EG* for a few breaking-news-related query sets. Tobe more convincing, we pick three breaking news from Worldwide news, andone from Technology dataset. Chosen breaking news are 1) Snowden leak-ing classified documents15; 2) 2014 Ukrainian revolution16 and 3) MalaysiaAirlines Flight 370 disappeared 17. In this case study, we seek to find ifgenerated subgraph conveys related actions and is readable for users. Wepick query keywords we are interested in exploring, and the total query sets15http://en.wikipedia.org/wiki/Edward Snowden16http://en.wikipedia.org/wiki/2014 Ukrainian revolution17http://en.wikipedia.org/wiki/Malaysia Airlines Flight 370436.3. Case studyare as follows:Q1 = { Snowden, NSA, Russia, Chinese, United States, asylum}Q2 = { Crimea, Russia, Putin, Kiev, moscow, Obama, yanukovych}Q3 = { mh370, Malaysia, search, route, Chinese, Australia, Indian}We search Q1, Q2 and Q3 in Worldwide dataset, and Q4 in Vergedataset. The generated subgraphs are shown separately in figure 6.10, 6.11and 6.12. Resulting subgraphs have small diameter, high weightratio and fvalue as presented in table 6.6. The structure density and semantic related-ness of generated graph are very much dependent on the original assertiongraph and also input queries. Thus, results may demonstrate different pat-terns. For instance, nodes in figure 6.10 have multiple paths between eachother, while in figure 6.11 nodes are connected by a key node in the middleand there is only one path between pair of nodes.Case # of nodes # of edges diameter weightratio f valueQ1 12 16 4 0.89 0.57Q2 12 11 2 0.99 0.728Q3 6 10 2 0.98 0.727Table 6.6: Statistics of generated subgraphs446.3. Case studySnowdenSnowden is wanted for felory charges(2013/06/23)-32documents be shared by former NSA contractor Edward Snowden(2014/06/02)19Russia gives Snowden asylum in doubt(2013/08/01)-10The United States charged Snowden to an authorized person(2013-06/05)-5Hong Kong is part of China(2014/06/01)-13Edward Snowden hiding in Hong Kong has sought legal representation as he prepares to fight attempts to force him home for trial(2014/06/02)-11Snowden is charged for unauthorized communication of national defense information to an authorized person (2013/06/2)-28the NSA program had hacked major Chinese telecoms companies to access text messages and targeted China top Tsinghua University(2014/06/02)-5the United States to seek Snowden extradition(2013-06/01)-13The United States and Hong Kong signed an extradition treaty in 1998 under which scores of Americans have been sent back home to face trial(2013-06/02)-8Snowden was quoted by the Post(2013-06/02)-6Figure 6.10: Generated subgraph for “snowden” scenarioUkrainea culmination of his years of grievances are a knee-jerk reaction must by Putin(2014/03/05)-24Yanukovych decision prompted the protests which began in November(2014/02/24)9loyalty have inspired protests against the Kiev fascists(2013/03/05)-27The West has also offered financial support in Kiev(2013-03/08)-6Kerry and Lavrov spoke on Thursday(2014/04/02)-93Obama confers with key European allies(2014/03/08)-10Crimea airbase Ukrainian base be stormed by Russian troops(2013/03/05)-24wealthy Putin allies Bank Rossiya favoured bank to be frozen out of the dollar Crimea crisis(2014/03/05)-24Russia agrees to OSCE monitoring mission in Ukraine(2013-03/21)-29Russian troops also stormed a Crimean border control point that the tense standoff of the past week is growing more volatile(2013-03/08)-6Moscow has so far refused following the ousting of pro-Russian President Viktor Yanukovych(2013-04/08)-5The United States and Hong Kong signed an extradition treaty in 1998 under which scores of Americans have been sent back home to face trial(2013-06/02)-81058531281175Figure 6.11: Generated subgraph for “ukraine” scenario456.3. Case studySearchA spokesman said AMSA has not received reliable information indicating that Malaysian Airlines ' flight MH370 may have approached Australia or entered the Australian search and rescue region(2014/03/15)9The flight 's 153 Chinese passengers comprised two-thirds of the 239(2014/03/16)-8Malaysia 's Prime Minister has confirmed "deliberate actions " are behind its disappearance(2014/03/16)-32a source cited by Bloomberg news agency said the last satellite transmission has been traced somewhere to the west Perth to the Indian Ocean(2014/03/15)-9The hunt initially focused along the jet 's intended route(2013-03/15)-9Malaysia 's Defence Minister said the search zone continues to be in two areas(2013-03/16)-964475876Figure 6.12: Generated subgraph for “mh370” scenario46Chapter 7Conclusion and Future workIn this paper, we studied the problem of building a query-driven event searchengine which determines a densely-linked, semantically related and query-covering subgraph from a news information network built from raw newscorpus. We first introduced the methodology to build a news informationnetwork, an undirected and weighted graph, from raw news articles, andformally defined the Query-driven event search problem and developed a newobjective function with which rank candidate subgraphs. We proved thisproblem is NP-complete and proposed a few heuristics algorithms to solveit. In experiment with real-life news dataset, we evaluate our algorithms’performance by measuring output graph’s related features such as diameter,weight, number of edges etc. We also did a few case studies by inputing a fewevent-specific keywords, and examined if generated subgraphs make sense.The experiment shows that our problem definition has practical meaning bydemonstrating users the best option of how those queries are related witheach other and form an event with multiple actions. Performance studyalso indicates that our algorithms can generate answers which combines thestructure density and semantic relatedness.In future work, we want to extend our framework to apply to othertypes of graphs to explore more applications. For instance, if applied inDBLP dataset, it can be used for beginner graduate student to narrowdown k related papers covering areas (captured by keywords) he or she isinterested in. However, processing such queries may call for adaption toour current techniques. We need to carefully think about the data model,particularly the meaning of edges and edge weights. We also want to upgradeour framework to adjust on-line update, which means adding most currentnews article to existing news information network incrementally and enablesusers to get the most updated event with respect to queries.47Bibliography[1] Charu C. Aggarwal and Haixun Wang. Managing and Mining GraphData. Springer Publishing Company, Incorporated, 1st edition, 2010.[2] Reid Andersen and Kumar Chellapilla. Finding dense subgraphs withsize bounds. In WAW, pages 25–37, 2009.[3] Albert Angel, Nick Koudas, Nikos Sarkas, and Divesh Srivastava. Densesubgraph maintenance under streaming edge weight updates for real-time story identification. CoRR, abs/1203.0060, 2012.[4] Yuichi Asahiro, Kazuo Iwama, Hisao Tamaki, and Takeshi Tokuyama.Greedily finding a dense subgraph. J. Algorithms, 34(2):203–221, Febru-ary 2000.[5] P. Berkhin. A survey of clustering data mining techniques. GroupingMultidimensional Data, pages 25–71, 2006.[6] Alexander Budanitsky and Graeme Hirst. Semantic distance in word-net: An experimental, application-oriented evaluation of five measures.Workshop on WordNet and Other Lexical Resources, Second meetingof the North American Chapter of the Association for ComputationalLinguistics, Pittsburgh, USA, 2001.[7] Moses Charikar. Greedy approximation algorithms for finding densecomponents in a graph. In Proceedings of the Third International Work-shop on Approximation Algorithms for Combinatorial Optimization,APPROX ’00, pages 84–95, London, UK, UK, 2000. Springer-Verlag.[8] James Cheng, Yiping Ke, Wilfred Ng, and Jeffrey Xu Yu. Context-aware object connection discovery in large graphs. In Yannis E. Ioanni-dis, Dik Lun Lee, and Raymond T. Ng, editors, ICDE, pages 856–867.IEEE, 2009.[9] Dipanjan Das and Andr F. T. Martins. A survey on automatic textsummarization, 2007.48Bibliography[10] Yon Dourisboure, Filippo Geraci, and Marco Pellegrini. Extraction andclassification of dense communities in the web. In Proceedings of the16th International Conference on World Wide Web, WWW ’07, pages461–470, New York, NY, USA, 2007. ACM.[11] Gu¨nes Erkan and Dragomir R. Radev. Lexrank: Graph-based lexi-cal centrality as salience in text summarization. J. Artif. Int. Res.,22(1):457–479, December 2004.[12] Martin Ester, Hans peter Kriegel, Jrg S, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases withnoise. pages 226–231. AAAI Press, 1996.[13] Christos Faloutsos, Kevin S. McCurley, and Andrew Tomkins. Fastdiscovery of connection subgraphs. In Proceedings of the Tenth ACMSIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’04, pages 118–127, New York, NY, USA, 2004. ACM.[14] David Gibson, Ravi Kumar, and Andrew Tomkins. Discovering largedense subgraphs in massive graphs. In Proceedings of the 31st Interna-tional Conference on Very Large Data Bases, VLDB ’05, pages 721–732.VLDB Endowment, 2005.[15] Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. Blinks: Rankedkeyword searches on graphs. In Proceedings of the 2007 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’07, pages305–316, New York, NY, USA, 2007. ACM.[16] Graeme Hirst and David St-Onge. Lexical chains as representations ofcontext for the detection and correction of malapropisms. In ChristianeFellbaum, editor, WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998.[17] Arvind Hulgeri and Charuta Nakhe. Keyword searching and browsingin databases using banks. In Proceedings of the 18th International Con-ference on Data Engineering, ICDE ’02, pages 431–, Washington, DC,USA, 2002. IEEE Computer Society.[18] J.J. Jiang and D.W. Conrath. Semantic similarity based on corpusstatistics and lexical taxonomy. In Proc. of the Int’l. Conf. on Researchin Computational Linguistics, pages 19–33, 1997.49Bibliography[19] Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudar-shan, Rushi Desai, and Hrishikesh Karambelkar. Bidirectional expan-sion for keyword search on graph databases. In Klemens Bhm, Chris-tian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-ke Larson, andBeng Chin Ooi, editors, VLDB, pages 505–516. ACM, 2005.[20] Gjergji Kasneci, Shady Elbassuoni, and Gerhard Weikum. Ming: Min-ing informative entity relationship subgraphs. In Proceedings of the 18thACM Conference on Information and Knowledge Management, CIKM’09, pages 1653–1656, New York, NY, USA, 2009. ACM.[21] Theodoros Lappas, Kun Liu, and Evimaria Terzi. Finding a team ofexperts in social networks. In Proceedings of the 15th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,KDD ’09, pages 467–476, New York, NY, USA, 2009. ACM.[22] C. Leacock and M. Chodorow. Combining local context and wordnetsimilarity for word sense identification. In Christiane Fellfaum, editor,MIT Press, pages 265–283, Cambridge, Massachusetts, 1998.[23] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers,Mihai Surdeanu, and Dan Jurafsky. Deterministic coreference resolu-tion based on entity-centric, precision-ranked rules. Comput. Linguist.,39(4):885–916, December 2013.[24] Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers,Mihai Surdeanu, and Dan Jurafsky. Stanford’s multi-pass sieve coref-erence resolution system at the conll-2011 shared task. In Proceedingsof the Fifteenth Conference on Computational Natural Language Learn-ing: Shared Task, CONLL Shared Task ’11, pages 28–34, Stroudsburg,PA, USA, 2011. Association for Computational Linguistics.[25] Pei Lee, Laks V. S. Lakshmanan, and Evangelos E. Milios. Keysee:supporting keyword search on evolving events in social streams. InKDD, pages 1478–1481, 2013.[26] Dekang Lin. An information-theoretic definition of similarity. In InProceedings of the 15th International Conference on Machine Learning,pages 296–304. Morgan Kaufmann, 1998.[27] Gang Luo, Chunqiang Tang, and Ying li Tian. Answering relationshipqueries on the web. In Carey L. Williamson, Mary Ellen Zurko, Peter F.50BibliographyPatel-Schneider, and Prashant J. Shenoy, editors, WWW, pages 561–570. ACM, 2007.[28] Inderjeet Mani. Multi-document summarization by graph search andmatching. In In Proceedings of the Fifteenth National Conference onArtificial Intelligence (AAAI-97, pages 622–628. AAAI, 1997.[29] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schu¨tze.Introduction to Information Retrieval. Cambridge University Press,New York, NY, USA, 2008.[30] Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and OrenEtzioni. Open language learning for information extraction. In Proceed-ings of Conference on Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning (EMNLP-CONLL),2012.[31] Yuval Merhav, Filipe Mesquita, Denilson Barbosa, Wai Gen Yee, andOphir Frieder. Extracting information networks from the blogosphere.ACM Trans. Web, 6(3):11:1–11:33, October 2012.[32] Filipe Mesquita, Jordan Schmidek, and Denilson Barbosa. Effective-ness and efficiency of open relation extraction. In Proceedings of the2013 Conference on Empirical Methods in Natural Language Process-ing, pages 447–457. Association for Computational Linguistics, October2013.[33] Ruslan Mitkov and Wolverhampton Wv Sb. Anaphora resolution: Thestate of the art. Technical report, 1999.[34] Evangelos E. Milios Pei Lee, Laks V.S. Lakshmanan. Incrementalcluster evolution tracking from highly dynamic network data. ICDE,abs/1309.7313, 2014.[35] Kira Radinsky and Eric Horvitz. Mining the web to predict futureevents. In WSDM, pages 255–264, 2013.[36] Anand Rajaraman and Jeffrey David Ullman. Mining of massivedatasets. Cambridge University Press, Cambridge, 2012.[37] Philip Resnik. Using information content to evaluate semantic similar-ity in a taxonomy. In In Proceedings of the 14th International JointConference on Artificial Intelligence, pages 448–453, 1995.51Bibliography[38] Mauro Sozio and Aristides Gionis. The community-search problem andhow to plan a successful cocktail party. In Proceedings of the 16th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’10, pages 939–948, New York, NY, USA, 2010. ACM.[39] Hanghang Tong and Christos Faloutsos. Center-piece subgraphs: Prob-lem definition and fast solutions. In Proceedings of the 12th ACMSIGKDD International Conference on Knowledge Discovery and DataMining, KDD ’06, pages 404–413, New York, NY, USA, 2006. ACM.[40] Ying Xu, Mi-Young Kim, Kevin Quinn, Randy Goebel, and Denil-son Barbosa. Open information extraction with tree kernels. In HLT-NAACL, pages 868–877, 2013.[41] Rui Yan, Liang Kong, Congrui Huang, Xiaojun Wan, Xiaoming Li, andYan Zhang. Timeline generation through evolutionary trans-temporalsummarization. In EMNLP, pages 433–443. ACL, 2011.52


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items