UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Discourse analysis of asynchronous conversations Joty, Shafiq Rayhan 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_spring_joty_shafiq.pdf [ 2.87MB ]
Metadata
JSON: 24-1.0165726.json
JSON-LD: 24-1.0165726-ld.json
RDF/XML (Pretty): 24-1.0165726-rdf.xml
RDF/JSON: 24-1.0165726-rdf.json
Turtle: 24-1.0165726-turtle.txt
N-Triples: 24-1.0165726-rdf-ntriples.txt
Original Record: 24-1.0165726-source.json
Full Text
24-1.0165726-fulltext.txt
Citation
24-1.0165726.ris

Full Text

Discourse Analysis of Asynchronous ConversationsbyShafiq Rayhan JotyB. Sc., Islamic University of Technology, 2005M. Sc., University of Lethbridge, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)December 2013c? Shafiq Rayhan Joty, 2013AbstractA well-written text is not merely a sequence of independent and isolated sentences,but instead a sequence of structured and related sentences. It addresses a particulartopic, often covering multiple subtopics, and is organized in a coherent way thatenables the reader to process the information. Discourse analysis seeks to uncoversuch underlying structures, which can support many applications including textsummarization and information extraction.This thesis focuses on building novel computational models of different dis-course analysis tasks in asynchronous conversations; i.e., conversations where par-ticipants communicate with each other at different times (e.g., emails, blogs). Ef-fective processing of these conversations can be of great strategic value for bothorganizations and individuals. We propose novel computational models for topicsegmentation and labeling, rhetorical parsing and dialog act recognition in asyn-chronous conversation. Our approaches rely on two related computational method-ologies: graph theory and probabilistic graphical models.The topic segmentation and labeling models find the high-level discourse struc-ture; i.e., the global topical structure of an asynchronous conversation. Our graph-based approach extends state-of-the-art methods by integrating a fine-grained con-versational structure with other conversational features.On the other hand, the rhetorical parser captures the coherence structure, a finerdiscourse structure, by identifying coherence relations between the discourse unitswithin each comment of the conversation. Our parser applies an optimal parsingalgorithm to probabilities inferred from a discriminative graphical model whichallows us to represent the structure and the label of a discourse tree constituentjointly, and to capture the sequential and hierarchical dependencies between theiiconstituents.Finally, the dialog act model allows us to uncover the underlying dialog struc-ture of the conversation. We present unsupervised probabilistic graphical modelsthat capture the sequential dependencies between the acts, and show how thesemodels can be trained more effectively based on the fine-grained conversationalstructure.Together, these structures provide a deep understanding of an asynchronousconversation that can be exploited in the above-mentioned applications. For eachdiscourse processing task, we evaluate our approach on different datasets, and showthat our models consistently outperform the state-of-the-art by a wide margin. Of-ten our results are highly correlated with human annotations.iiiPrefacePortions of this thesis are based on prior peer-reviewed publications by me (un-der the name Shafiq Joty) and my collaborators. In the following, I describe thepublications and the relative contributions of all contributors.Chapter 2 is based on the article Topic Segmentation and Labeling in Asyn-chronous Conversations by Shafiq Joty, Giuseppe Carenini and Raymond Ng, pub-lished in the Journal of Artificial Intelligence Research (JAIR), volume 47, June2013. Portions of this work were previously published in two conference proceed-ings: (a) Exploiting Conversation Structure in Unsupervised Topic Segmentationfor Emails by Shafiq Joty, Giuseppe Carenini, Gabriel Murray and Raymond Ng,in the proceedings of EMNLP, 2010; (b) Supervised Topic Segmentation of EmailConversations by Shafiq Joty, Giuseppe Carenini, Gabriel Murray and RaymondNg, in the proceedings of ICWSM, 2011. My main contributions include: 1) de-signing and performing an user study to create two new corpora of asynchronousconversations (i.e., email and blog) annotated with topic information; 2) proposingand implementing novel topic segmentation and labeling models for asynchronousconversations; 3) implementing metrics to measure the performance of the com-putational models and inter-annotator agreement on the topic segmentation andlabeling tasks; 4) carrying out the experiments on the two developed corpora; 5)preparing the manuscripts. Other collaborators played supervisory roles.Chapter 3 is based on two conference papers: (a) A Novel DiscriminativeFramework for Sentence-Level Discourse Analysis by Shafiq Joty, Giuseppe Careniniand Raymond Ng, published in the proceedings of EMNLP-CoNLL, 2012; (b)Combining Intra- and Multi-Sentential Rhetorical Parsing for Document-LevelDiscourse Analysis by Shafiq Joty, Giuseppe Carenini, Raymond Ng and YasharivMehdad, published in the proceedings of ACL, 2013. My main contributions in-clude: 1) preparing the datasets (i.e., RST-DT and Instructional) for experimentson discourse segmentation, sentence-level discourse parsing and document-leveldiscourse parsing; 2) proposing and implementing the computational models fordiscourse segmentation, sentence-level discourse parsing and document-level dis-course parsing; 3) implementing the metrics used to measure the performance ofthe computational models and inter-annotator agreement on the discourse seg-mentation and parsing tasks; 4) carrying out the experiments; 5) preparing themanuscripts. Other collaborators played supervisory roles throughout the project.Chapter 4 is based on the conference paper Unsupervised Modeling of Dia-log Acts in Asynchronous Conversations by Shafiq Joty, Giuseppe Carenini andChin-Yew Lin, published in the proceedings of IJCAI, 2011. My main contribu-tions include: 1) preparing the datasets (i.e., W3C email and TropAdvisor forum)by removing noise (e.g., unnecessary system messages); 2) proposing and imple-menting the computational models for dialog act recognition; 3) carrying out theexperiments; 4) preparing the manuscript. Other collaborators advised me.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Discourse and Its Structures . . . . . . . . . . . . . . . . . . . . 41.1.1 Topical Structure . . . . . . . . . . . . . . . . . . . . . . 71.1.2 Rhetorical Structure . . . . . . . . . . . . . . . . . . . . 171.1.3 Dialog Acts and Conversational Structure . . . . . . . . . 221.2 Computational Methodologies . . . . . . . . . . . . . . . . . . . 271.2.1 Graph-based Methods for NLP . . . . . . . . . . . . . . . 271.2.2 Probabilistic Graphical Models . . . . . . . . . . . . . . . 291.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 321.3.1 Topic Segmentation and Labeling . . . . . . . . . . . . . 321.3.2 Rhetorical Analysis . . . . . . . . . . . . . . . . . . . . . 331.3.3 Dialog Act Modeling . . . . . . . . . . . . . . . . . . . . 35vi2 Topic Segmentation and Labeling . . . . . . . . . . . . . . . . . . . 372.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.2.1 Topic Segmentation . . . . . . . . . . . . . . . . . . . . . 432.2.2 Topic Labeling . . . . . . . . . . . . . . . . . . . . . . . 452.2.3 Conversational Structure Extraction . . . . . . . . . . . . 472.3 Topic Models for Asynchronous Conversations . . . . . . . . . . 492.3.1 Topic Segmentation Models . . . . . . . . . . . . . . . . 502.3.2 Topic Labeling Models . . . . . . . . . . . . . . . . . . . 672.4 Corpora and Metrics . . . . . . . . . . . . . . . . . . . . . . . . 762.4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 772.4.2 Topic Annotation . . . . . . . . . . . . . . . . . . . . . . 772.4.3 Evaluation (and Agreement) Metrics . . . . . . . . . . . . 802.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862.5.1 Topic Segmentation Evaluation . . . . . . . . . . . . . . 862.5.2 Topic Labeling Evaluation . . . . . . . . . . . . . . . . . 912.5.3 Full System Evaluation . . . . . . . . . . . . . . . . . . . 972.6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . 983 Rhetorical Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.2.1 Unsupervised Approaches . . . . . . . . . . . . . . . . . 1053.2.2 Supervised Approaches . . . . . . . . . . . . . . . . . . . 1063.3 Our Rhetorical Analysis Framework . . . . . . . . . . . . . . . . 1093.4 The Discourse Parser . . . . . . . . . . . . . . . . . . . . . . . . 1113.4.1 Parsing Models . . . . . . . . . . . . . . . . . . . . . . . 1113.4.2 Parsing Algorithm . . . . . . . . . . . . . . . . . . . . . 1243.4.3 Document-level Parsing Approaches . . . . . . . . . . . . 1253.5 The Discourse Segmenter . . . . . . . . . . . . . . . . . . . . . . 1283.5.1 Segmentation Model . . . . . . . . . . . . . . . . . . . . 1283.5.2 Features Used in the Segmentation Model . . . . . . . . . 1293.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131vii3.6.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313.6.2 Evaluation (and Agreement) Metrics . . . . . . . . . . . . 1323.6.3 Discourse Segmentation Evaluation . . . . . . . . . . . . 1333.6.4 Discourse Parsing Evaluation . . . . . . . . . . . . . . . 1363.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1444 Dialog Act Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1464.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1514.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1534.3.1 Dataset Selection and Clean Up . . . . . . . . . . . . . . 1534.3.2 Dealing with Conversational Structure . . . . . . . . . . . 1544.4 Graph-theoretic Framework . . . . . . . . . . . . . . . . . . . . . 1564.4.1 Algorithm Description . . . . . . . . . . . . . . . . . . . 1564.4.2 Evaluation of the Graph-theoretic Model . . . . . . . . . 1574.5 Probabilistic Conversational Models . . . . . . . . . . . . . . . . 1584.5.1 HMM Conversational Model . . . . . . . . . . . . . . . . 1594.5.2 HMM+Mix Conversational Model . . . . . . . . . . . . . 1604.5.3 Initialization in EM . . . . . . . . . . . . . . . . . . . . . 1624.5.4 Applying Conversational Models . . . . . . . . . . . . . 1624.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1634.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 1645 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655.1 Prospective Applications . . . . . . . . . . . . . . . . . . . . . . 1685.1.1 Conversation Summarization . . . . . . . . . . . . . . . . 1685.1.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . 1695.1.3 Information Extraction and Visualization . . . . . . . . . 1705.1.4 Misc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 171Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174viiiA Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 201A.1 Metrics for Topic Segmentation . . . . . . . . . . . . . . . . . . 201A.1.1 One-to-One Metric . . . . . . . . . . . . . . . . . . . . . 201A.1.2 Lock Metric . . . . . . . . . . . . . . . . . . . . . . . . . 201A.2 Annotation Manual for Topic Segmentation and Labeling . . . . . 204A.2.1 Instructions for Finding Topics in Emails . . . . . . . . . 204A.2.2 Instructions for Finding Topics in Blogs . . . . . . . . . . 207A.3 EM for HMM+Mix model . . . . . . . . . . . . . . . . . . . . . 211A.3.1 E step: . . . . . . . . . . . . . . . . . . . . . . . . . . . 212A.3.2 M step: . . . . . . . . . . . . . . . . . . . . . . . . . . . 212ixList of TablesTable 1.1 Examples of different forms of discourse. . . . . . . . . . . . . 6Table 1.2 Discourse analysis tasks in different forms of discourse. . . . . 7Table 1.3 Summary of existing work on topic segmentation. . . . . . . . 9Table 1.4 Computational methods used for different discourse analysistasks. N-cut stands for Normalized cut, UGM stands for Undi-rected Graphical Model, and DGM stands for Directed Graphi-cal Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 1.5 Graph-based methods applied to NLP tasks. WSD stands forWord Sense Disambiguation and MT stands for Machine Trans-lation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 1.6 Families of probabilistic graphical models. . . . . . . . . . . . 30Table 1.7 Examples of graphical models used in discourse analysis. PDTBstands for Penn Discourse Treebank [164]. . . . . . . . . . . . 31Table 2.1 Performance of the classifiers using the full feature set (Ta-ble 2.2). For each training set, regularizer strength ? (or C inSVMs) was learned by 10-fold cross validation. . . . . . . . . 62Table 2.2 Features with performance on test sets (using leave-one-out). . 63Table 2.3 Statistics on three human annotations per conversation. . . . . 80Table 2.4 Annotator agreement in one-to-one and loc3 on the two corpora. 82Table 2.5 Annotator agreement in many-to-one on the two corpora. . . . 83Table 2.6 Annotator agreement in w-m-o and w-s-m-o on the two corpora. 85Table 2.7 Mean statistics of different model?s annotation . . . . . . . . . 87xTable 2.8 Segmentation performance of the two best Baselines, Humanand Models. In the Blocks of k column, k = 5 for email andk = 20 for blog. . . . . . . . . . . . . . . . . . . . . . . . . . 88Table 2.9 Mean weighted-mutual-overlap (w-m-o) scores for different val-ues of k on two corpora. . . . . . . . . . . . . . . . . . . . . . 94Table 2.10 Mean weighted-semantic-mutual-overlap scores for different val-ues of k on two corpora. . . . . . . . . . . . . . . . . . . . . . 94Table 2.11 Mean weighted-mutual-overlap (w-m-o) scores when the bestof k labels is considered. . . . . . . . . . . . . . . . . . . . . . 95Table 2.12 Examples of Human-authored and System-generated labels. . . 96Table 2.13 Examples of System-generated labels that are reasonable butget low scores. . . . . . . . . . . . . . . . . . . . . . . . . . . 96Table 2.14 Performance of the end-to-end system and human agreement. . 97Table 3.1 Features used in our intra- and multi-sentential parsing models. 118Table 3.2 Measuring parsing accuracy (P = Precision, R = Recall). . . . . 134Table 3.3 Segmentation results of different models on the two corpora.Performances significantly superior to SPADE are denoted by *. 135Table 3.4 Intra-sentential parsing results based on manual segmentation.Performances significantly superior to SPADE are denoted by *. 138Table 3.5 Parsing results using automatic segmentation. Performancessignificantly superior to SPADE are denoted by *. . . . . . . . 139Table 3.6 Parsing results of different document-level parsers using man-ual (gold) segmentation. Performances significantly superior toHILDA (with p<7.1e-05) are denoted by *. Significant differ-ences between TSP 1-1 and TSP SW (with p<0.01) are denotedby ?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140xiTable 3.7 Parsing results using different subsets of features on RST-DTtest set. Feature subsets for Intra-sentential parsing: I1 = {Dominanceset}, I2 = {Dominance set, Organizational}, I3 = {Dominanceset, Organizational, N-gram}, I4 = {Dominance set, Organiza-tional, N-gram, Contextual}, I5 (all) = {Dominance set, Organi-zational, N-gram, Contextual, Sub-structural}. Feature subsetsfor Multi-sentential parsing: M1 = {Organizational, Text struc-tural}, M2 = {Organizational, Text structural, N-gram}, M3 ={Organizational, Text structural, N-gram, Lexical chain}, M4 ={Organizational, Text structural, N-gram, Lexical chain, Con-textual}, M5 (all) = {Organizational, Text structural, N-gram,Lexical chain, Contextual, Sub-structural}. . . . . . . . . . . . 142Table 4.1 Dialog act tags and their relative frequencies in the two corpora. 154Table 4.2 One-to-one accuracy for different similarity metrics in the graph-theoretic framework. . . . . . . . . . . . . . . . . . . . . . . . 158Table 4.3 Mean one-to-one accuracy for various models on the two corpora.163xiiList of FiguresFigure 1.1 Sample email conversation from the BC3 email corpus [213]. 12Figure 1.2 The email conversation of Figure 1.1 with topic annotations. . 14Figure 1.3 Discourse tree for two sentences. Each sentence contains threeEDUs. Horizontal lines indicate text segments; satellites areconnected to their nuclei by curved arrows. . . . . . . . . . . 18Figure 1.4 Discourse tree for the first sentence of the fifth email in Figure 1.1 . 19Figure 1.5 The email conversation of Figure 1.1, now annotated with dialog acts(DAs). The right most (DA) column specifies the dialog acts for thesentences: S stands for Statement, QY stands for Yes-no question,A stands for Accept response, and AM stands for Action motivator.The Frag. column specifies the fragments in the Fragment QuotationGraph (FQG) described in Section 1.1.3. . . . . . . . . . . . . . 24Figure 1.6 (a) The email conversation of Figure 1.5 with its fragments.The real contents are abbreviated as a sequence of symbols.Arrows indicate reply-to links. (b) The corresponding Frag-ment Quotation Graph. . . . . . . . . . . . . . . . . . . . . . 26Figure 2.1 Sample truncated email conversation from our email corpus.Each color indicates a different topic. The right most columnspecifies the topic assignments for the sentences. . . . . . . . 40xiiiFigure 2.2 Sample truncated blog conversation from our blog corpus. Eachcolor indicates a different topic. The right most column (Topic)specifies the topic assignments for the sentences. The Frag-ment column specifies the fragments in the fragment quotationgraph (see Section 2.3.1). . . . . . . . . . . . . . . . . . . . . 41Figure 2.3 Graphical model for LDA in plate notation. . . . . . . . . . . 52Figure 2.4 (a) The main Article and the Comments with the fragments forthe example in Figure 2.2. Arrows indicate ?reply-to? relations.(b) The corresponding Fragment Quotation Graph (FQG). . . 54Figure 2.5 (a) Sample word network, (b) A Dirichlet Tree (DT) built fromsuch word network. . . . . . . . . . . . . . . . . . . . . . . . 59Figure 2.6 Relative importance of the features averaged over leave-one-out. 66Figure 2.7 Error rate vs. number of training conversations. . . . . . . . . 67Figure 2.8 Topic labeling framework for asynchronous conversation. . . 69Figure 2.9 Percentage of words in the human-authored labels appearingin leading sentences of the topical segments. . . . . . . . . . . 70Figure 2.10 Three sub-graphs used for co-ranking: the fragment quotationgraph GF , the word co-occurrence graph GW , and the bipartitegraph GFW that ties the two together. Blue nodes representfragments, red nodes represent words. . . . . . . . . . . . . . 73Figure 3.1 Discourse tree for two sentences in RST-DT. Each of the sen-tences contains three EDUs. The second sentence has a well-formed discourse tree, but the first sentence does not have one. 102Figure 3.2 Rhetorical analysis framework. . . . . . . . . . . . . . . . . . 109Figure 3.3 Distributions of six most frequent relations in intra-sentential andmulti-sentential parsing scenarios. . . . . . . . . . . . . . . . . . 110Figure 3.4 A chain-structured DCRF as our intra-sentential parsing model. 112Figure 3.5 Our parsing model applied to the sequences at different levelsof a sentence-level discourse tree. (a) Only possible sequenceat the first level, (b) Three possible sequences at the secondlevel, (c) Three possible sequences at the third level. . . . . . 114Figure 3.6 A CRF as a multi-sentential parsing model. . . . . . . . . . . . . 116xivFigure 3.7 Dominance set features for intra-sentential discourse parsing. . 120Figure 3.8 Correlation between lexical chains and discourse structure. (a)Lexical chains spanning paragraphs. (b) Two possible DT struc-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Figure 3.9 Extracting lexical chains. (a) A Lexical Semantic RelatednessGraph (LSRG) for five noun-tokens. (b) Resultant graph afterperforming WSD. The box at the bottom shows the lexical chains.122Figure 3.10 The S and R dynamic programming tables (left), and the cor-responding discourse tree (right). . . . . . . . . . . . . . . . . 125Figure 3.11 Two possible DTs for three sentences. . . . . . . . . . . . . . 126Figure 3.12 Extracting sub-trees for S2. . . . . . . . . . . . . . . . . . . . 127Figure 3.13 A hypothetical system-generated discourse tree for the twosentences in Figure 3.1. . . . . . . . . . . . . . . . . . . . . . 133Figure 3.14 Measuring the accuracy of a rhetorical parser. (a) The human-annotated discourse tree. (b) The system-generated discoursetree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Figure 3.15 Confusion matrix for relation labels on the RST-DT test set.Y-axis represents true and X-axis represents predicted rela-tions. The relations are Topic-Change (T-C), Topic-Comment(T-CM), TextualOrganization (T-O), Manner-Means (M-M), Com-parison (CMP), Evaluation (EV), Summary (SU), Condition(CND), Enablement (EN), Cause (CA), Temporal (TE), Ex-planation (EX), Background (BA), Contrast (CO), Joint (JO),Same-Unit (S-U), Attribution (AT) and Elaboration (EL). . . . 143Figure 4.1 Sample truncated email conversation from our BC3 corpus.The right most column (i.e., DA) specifies the dialog act as-signments for the sentences. The DA tags are defined in Ta-ble 4.1. The Fragment column specifies the fragments in theFragment Quotation Graph (FQG) (see Figure 4.2). . . . . . . 148Figure 4.2 (a) The email conversation of Figure 4.1 with the fragments.Arrows indicate ?reply-to? relations. (b) The corresponding FQG.155Figure 4.3 HMM conversational model . . . . . . . . . . . . . . . . . . 159xvFigure 4.4 HMM+Mix conversational model . . . . . . . . . . . . . . . 161Figure A.1 Computing one-to-one accuracy. . . . . . . . . . . . . . . . . 202Figure A.2 Computing loc3 accuracy. . . . . . . . . . . . . . . . . . . . 203xviAcknowledgmentsFirst and foremost, I thank Allah, the almighty, for giving me the strength to carryon my post graduate studies and for blessing me with many great people who havebeen my greatest support in both my professional and personal life.This thesis is the culmination of an exciting five-year adventure. For that I?dlike to thank my advisor Dr. Giuseppe Carenini, whose door was always open forme. I?m eternally grateful to him for being a great teacher and mentor, for inspiringme to explore new ideas and for helping me refine my incoherent ideas. My co-advisor, Dr. Raymond Ng, shared his wealth of experience and guided me to bebetter at all aspects of being a researcher. I really enjoyed my Ph.D. experience atUBC.I?d also like to thank Dr. Nando De Freitas for serving on my supervisorycommittee, my university examiners Dr. Rachel Pottinger and Dr. Carson Woo,and my external examiners Dr. Amanda Stent and Dr. Carolyn Rose for theirinsightful comments on my thesis.I?m indebted to my friend Jackie Cheung from the University of Toronto andto the fellow members of the NLP group at UBC, Gabriel Murray, Yashar Mehdad,Shima Gerani and Shama Rashid. The conversations, ideas, feedback and supportfrom them have been invaluable over the years.I?m also very grateful to NSERC Canada Graduate Scholarship (CGS-D) andNSERC BIN Strategic Network for their funding support during my Ph.D. studies.Finally, thanks to my parents Mr. and Mrs. Zaman and my sister Sonia Arju fortheir unconditional love and support. Many thanks to Sumaiya Sabira, my belovedwife, for always being there with me.xviiChapter 1IntroductionWith the ever increasing popularity of Internet technologies, it is very commonnowadays for people to discuss events, issues, tasks and personal experiences bywriting in a growing number of social media including emails, blogs, Facebook andTwitter [16, 216]. These are examples of written asynchronous conversationswhere participants communicate with each other at different times.The huge amount of textual data generated every day in these conversationscalls for automated methods of conversational text analysis. Effective processingof these conversational texts can be of great strategic value for both organizationsand individuals [36]. For instance, conversations in public blogging services likeTwitter have become invaluable sources of information. During a natural disasterlike Hurricane Sandy, affected people sent tweets to request food, shelter, medicineetc. In response to that, many people and organizations also tweeted to offer help.Humanitarian organizations (e.g., the United Nations, government officials) canmine these tweets to carry out aid activities more effectively [214].In the very different scenario of business intelligence, corporate managers mightfind the information exchanged in their email conversations and company blogs tobe extremely useful for decision auditing. If a decision turns out to be favorable,mining the relevant conversations may help in identifying effective communica-tion patterns and sources. Similarly, conversations that led to ill-advised decisionscould be mined to determine responsibility and accountability. To compete in mar-kets, business users are now facing an unprecedented need to effectively process1the online conversations taking place in various social media [217]. They need touncover not just keywords, but vital consumer sentiment and insights about theircompany, products, competitors and more.Conversations in public blogs (e.g., Twitter, Slashdot1) often get very large,containing overall thousands of comments. During major events such as a politi-cal uprising in the Arab world, relevant messages are posted by the thousands ormillions. It is simply not feasible to read all messages relevant to such an event,and so mining and summarization technologies can help to provide an overview ofwhat people are saying and what positive or negative opinions are being expressed.Mining social media can reveal a nation?s perspectives on national events and is-sues (e.g., presidential elections, abortion rights). Mining and summarizing alsoimprove indexing and searching [119].On a more personal level, an informative summary of a conversation couldgreatly support a new participant to get up to speed and join an already existingconversation. It could also help someone to quickly prepare for a follow-up dis-cussion of a conversation she was already part of, but which occurred too long agofor her to remember the details.Although the number of applications targeting conversations in social media isgrowing, most of these applications are not currently using sophisticated NaturalLanguage Processing (NLP) techniques. There could be two reasons for this. First,although NLP technologies like syntactic parsing and part-of-speech (POS) tagginghave attained performances close to that of humans, many others (e.g., discourseparsing, word sense disambiguation) are still far below the human standard, andare not sufficiently accurate to support downstream applications. Second, most ofthese technologies were originally developed for monologs (e.g., news articles) andare not as effective when applied directly to asynchronous conversations becausethe two types of genres are different in many aspects.This thesis focuses on building novel computational models of several dis-course analysis tasks in asynchronous conversations, which can support a widerange of NLP applications including conversation summarization, conversation vi-sualization, sentiment analysis and information extraction [226]. In particular, we1http://slashdot.org/2propose novel models for topic segmentation and labeling, discourse (rhetori-cal) parsing and dialog act recognition in asynchronous conversations. We doso by extending the monolog-based models to consider the conversation specificfeatures of asynchronous conversations and by addressing the key limitations ofthe existing models to improve them further. Our approaches rely on two relatedcomputational methodologies: recent graph-theoretic methods for NLP [139]and advanced probabilistic graphical models [108].Although discourse analysis has been well-studied in monolog and in syn-chronous dialog (e.g., meetings), a comprehensive study of discourse analysis inasynchronous conversation is still missing. For many discourse analysis tasks onasynchronous conversations (e.g., topic segmentation and labeling), there are nostandard corpora, no annotation guidelines and no established evaluation or agree-ment metrics available. Also, since asynchronous conversations are quite differentfrom monologs and synchronous dialogs, discourse analysis models that are suc-cessful in monologs or in synchronous dialogs may not be as effective when applieddirectly to asynchronous conversations. In this thesis, we overcome these limita-tions by developing new annotated corpora, and by proposing new computationalmodels for discourse analysis in asynchronous conversation. Additionally, in thegeneric task of discourse parsing where the performance of existing systems is stillfar away from the human standard, we substantially reduce the performance gapby addressing their key limitations.Like any other discourse, an asynchronous conversation discusses a commontopic, often covering multiple subtopics. In addition, each message in a conversa-tion locally constitutes a coherent monolog by connecting its sentences logically;i.e., the meaning of a sentence relates to the previous ones. Furthermore, being aconversation, asynchronous conversation exhibits a conversational structure.The topic segmentation and labeling models find the high-level discourse struc-ture; i.e., the global topical structure of an asynchronous conversation. Our unsu-pervised approach to topic segmentation extends state-of-the-art models by con-sidering a fine-grained structure of the conversation. Our supervised topic seg-mentation model combines lexical, conversational and topic related features in agraph-theoretic framework. Our (unsupervised) topic labeling models capture con-versation specific clues in a graph random walk framework.3On the other hand, the discourse parser captures the local coherence structure, afiner discourse structure, by identifying coherence relations between the discourseunits within each comment. Our parser applies an optimal parsing algorithm toprobabilities inferred from a discriminative undirected graphical model, which rep-resents the structure and the label of a discourse tree constituent jointly and cap-tures the sequential and hierarchical dependencies between the constituents.Finally, the dialog act model allows us to uncover the underlying dialog struc-ture of the conversation. We present unsupervised generative graphical models thatcapture the sequential dependencies between the dialog acts, and show how thesemodels can be trained more effectively based on the fine-grained conversationalstructure.Together, these structures provide a deep understanding of asynchronous con-versations that can be effectively exploited in the above-mentioned NLP applica-tions. For each analysis task, we evaluate our approach on different datasets, andshow that our models consistently outperform the state-of-the-art by a wide margin.Remarkably, our results are often highly correlated with human annotations.In the rest of this introduction, in Section 1.1 we give an overview of the dis-course analysis tasks in different forms of discourse, and identifies the technicalchallenges in performing these tasks on asynchronous conversations. In Section1.2 we discuss the computational methodologies used to tackle the technical chal-lenges. Finally, in Section 1.3 we summarize our key contributions with an outlineof the dissertation.1.1 Discourse and Its StructuresAlthough many NLP tasks treat texts at the sentence level (e.g., syntactic or seman-tic parsing [100]), sentences rarely stand on their own in an actual discourse; it issimply not enough to gather some arbitrary sentences to obtain a discourse. Rather,the relationship between sentences carry important information which allows thediscourse to express a meaning as a whole beyond the sum of its separate parts.Two complementary aspects of discourse work together to make it interpretableas a whole. The first aspect is coherence, which logically binds the sentencestogether ? the meaning of a sentence is connected to the meaning of the previous4and the following ones. Without this, a text is just a sequence of non-sequiturs. Forinstance, consider the following two examples from [87]:? John took a train from Paris to Istanbul. He has family there.? John took a train from Paris to Istanbul. He likes spinach.While it is easy to process the first text, most readers will have difficulties in un-derstanding the second one. The reader will either reject the second text simplycalling it ?incoherent? or spend some time to construct an explanation of whatliking spinach has to do with taking a train from Paris to Istanbul. By asking this,the reader is actually questioning the coherence of the text. The second aspect iscohesion, which is the usage of linguistic glue to link the textual units [79]. Cohe-sion is achieved by using word repetitions and semantically similar words, such assynonyms, hypernyms and hyponyms (known as lexical cohesion), as well as byusing other linguistic devices including coreferences (see the examples above).Table 1.1 shows examples of different forms of discourse we encounter in ourdaily life. The thesis you are reading is an example of a monolog where a writer(speaker) writes (speaks) something to be read (heard) by a reader (hearer). Theflow of information in monologs is unidirectional; i.e., from the writer to the reader.After you finish reading the thesis, you may have a face-to-face (or phone, email)conversation with your colleague about it. The conversation involves interchangingroles between being a speaker (writer) and hearer (reader). This type of discourseis called a dialog (or conversation). Unlike monologs, the communication flow indialogs is bidirectional and participants perform different types of communicativeacts: asking questions, giving answers, requesting something, and so forth. Adialog involving more than two participants is called a multi-party conversation.Conversations can be further categorized into two groups: synchronous andasynchronous. Synchronous conversations are those where participants commu-nicate with each other at the same time (e.g., phone conversations). Turns in syn-chronous conversations are usually short, containing only a few utterances. Beforethe Internet revolution, this type of conversation was the dominant form of humanconversation. However, with the rise of the Internet and web technologies, peoplenow converse by writing in a growing number of social media including emails,5Monolog Dialog (Conversation)Synchronous AsynchronousWritten articles, Instant messaging, emails, blogsbooks Internet relay chat fora, FacebookSpoken lectures, talks, face-to-face dialog, meetings, voicemailspeeches, news phone conversations, video logsbroadcast human-computer dialogTable 1.1: Examples of different forms of discourse.blogs, fora and so on. These are called asynchronous conversations because thecommunication happens at different times. In contrast to synchronous conversa-tions, replies in asynchronous conversations can be made days later. The length ofthe replies varies from one to a few hundred sentences. For example, while tweetsare limited to only 140 characters, comments in political and technology-relatedblogs (e.g., AMERICAblog2, Slashdot) are much longer.What are the structures in a discourse? Speakers in a discourse discuss a com-mon topic, often covering multiple subtopics. For example, an email conversationabout arranging a conference may discuss conference schedule, organizing com-mittee, accommodation, attractions, registration and so on. In other words, a dis-course has a topical structure. We define topic more precisely in the next section.In a discourse, speakers refer to something they have talked about before; i.e., adiscourse uses coreference (e.g., anaphora) to refer to the same thing. In addition,the sentences in a discourse are logically connected: the meaning of a sentencerelates to that of the previous ones; that is, a discourse has a coherence structure.Furthermore, in a conversational discourse, participants interact with each otherperforming different dialog acts (e.g., asking or answering questions). A conver-sation exhibits a conversational structure, which comprises the dialog acts and thereply structure (i.e., who is talking to whom in multi-party dialogs).Understanding a discourse implies uncovering these structures. Table 1.2 sum-marizes the analysis tasks for different forms of discourse. This thesis focuses onbuilding computational models to uncover the topical structure, the rhetorical struc-2http://americablog.com6Topic Seg. Discourse Reference Dialog acts & Reply& Labeling parsing resolution structure extractionMonolog X X X N/ASync. Conv. X X X XAsync. Conv. X X X XTable 1.2: Discourse analysis tasks in different forms of discourse.ture and the dialog act structure of an asynchronous conversation. In the following,we give a general overview of these three structures of discourse, the technical chal-lenges and the previous work in uncovering these structures. Readers interested inknowing about co-reference resolution are encouraged to see [198].1.1.1 Topical StructureTopical structure is the high-level structure of a discourse. According to Webberet al. [226], each topic comprises a set of entities and things being said about them.A topic can be characterized by the subject matter it addresses. For example, con-sider the following text from BBC3.? Mr. Obama faces a tough week of trying to persuade Congress to authorisemilitary action in response to Syria?s alleged use of chemical weapons. Hewill also seek public support for the action in a White House address onTuesday before the Congress finally votes.This paragraph addresses Mr. Obama?s attempts to authorize a military action inSyria. Here the entities consist of Mr. Obama, Congress, military action, Syria,chemical weapon and White House. The set of entities usually changes from topicto topic. For example, consider the following paragraph from the same documentbut which appears later in the text.? Russia restated its opposition to any strike at the G20 summit, with Mr Putinwarning that military intervention would destabilise the region. Both Russiaand China, which have refused to agree to a UN Security Council resolutionagainst Syria, insist any military action without the UN would be illegal.3www.bbc.co.uk/news/world-us-canada-239990667This talks about the response from Russia and China on the army intervention.Here the entities are Russia, G20 summit, Mr. Putin, military intervention, China,UN Security Council resolution, Syria and UN.Any discourse containing more than a few sentences is likely to cover multipletopics.4 For example, Hearst [83] reports that about 40% of the paragraphs in hercorpus of expository texts start with a new topic. Automatically determining thetopic structure involves two subtasks: topic segmentation and topic labeling. Ingeneral, topic segmentation is the task of grouping the sentences of a discourseinto a set of topical segments (or clusters). If the discourse is a monolog as inthe above example or a synchronous dialog (e.g., meetings), then this task can berephrased as: separating a discourse into a sequence of topical segments. How-ever, we will see that because of the asynchronous nature, topics in asynchronousconversations often do not change in a sequential way.Once we have the topical segments, topic labeling is the task of assigning ashort informative description to each of the topical segments to facilitate interpre-tations of the topics [36]. For instance, ideal topic labels for the two texts abovecould be Mr. Obama?s attempts to authorize a military action in Syria and Theresponse from Russia and China on the army intervention.Topic segmentation is often considered as an essential preprocessing step forother finer-level discourse analysis [14] and has been utilized in many NLP applica-tions including text summarization [56, 104] and information extraction [3]. Topiclabels provide a high-level concise summary of the discourse. However, both topicsegmentation and labeling are considered to be difficult and unsolved problems.As mentioned before, as the topics become more fine-grained, topic segmentationand labeling could be a hard task even for humans. The difficulties also vary in dif-ferent text domains. For example, dialogs pose a different set of challenges frommonologs. Inside monologs, spoken monologs (e.g., talks) are more challengingthan written edited monologs (e.g., articles), which typically come already orga-nized in sections and paragraphs reflecting the topical structure [36].4Considering that a discourse addresses a common topic, here ?topics? actually means ?subtopics?.8Topic SegmentationPrevious work on topic segmentation targeted mainly monologs and synchronousdialogs [165]. In these two forms of discourse, topic segmentation refers to the taskof separating the discourse into sequential topical segments. Several unsupervisedand supervised methods have been proposed. Table 1.3 summarizes them. Theunsupervised models exploit the strong correlation between topic and lexical usage.These models can be categorized into two broad classes based on their underlyingintuitions: similarity-based models and probabilistic generative models.Category Example models Type FeaturesLexical TextTiling [83], C?01 [44]Unsupervised Lexical similaritysimilarity LCSeg [74], M&B [121]Generative LDAs [61, 167]UnsupervisedLexical distribution,models HMMs [27, 233] cues, speakerFeature Exponential [21]SupervisedLexical, cues,based Decision tree [74] speaker, silence, etc.Table 1.3: Summary of existing work on topic segmentation.The key intuition behind similarity-based segmentation models is that sen-tences in a segment are more lexically similar to each other than to sentences in thepreceding or the following segment. The technical challenges addressed by thisclass of models are: (1) how to measure lexical similarity, and (2) how to use thesimilarity measures to perform topic segmentation.A typical method to measure lexical similarity between two texts is to firstrepresent each text as a vector, and then compute the cosine angle between thevectors. The existing approaches differ in how they represent the texts as vectors.For example, Hearst [83] uses term frequency (TF) in TextTiling, Malioutov andBarzilay [121] (M&B in Table 1.3) use TF and inverse document frequency (IDF)[178], Choi et al. [44] (C?01 in the table) use latent semantic analysis (LSA), andGalley et al. [74] use a metric computed from lexical chains [143] in LCSeg.Different models use different algorithms to perform segmentation using thelexical similarity scores. For example, Hearst [83] and Galley et al. [74] use a9cutoff threshold on the similarity valley (i.e., a plot of the lexical similarity scores),Choi et al. [44] use divisive clustering (also known as top-down clustering), andMalioutov and Barzilay [121] use minimum cut (graph-based) clustering.Note that the similarity-based segmentation models only exploit the lexicalcohesion phenomena of a discourse. However, a discourse can have other featuresthat are useful for topic segmentation. For example, cue phrases provide domain-specific clues for topic shifts in broadcast news [21] (e.g., coming up, joining usnow) and in meetings [74] (e.g., anyway, so). Later, we will see that conversationalstructure can be a very useful feature for segmenting asynchronous conversations.Probabilistic generative models form another class of unsupervised segmen-tation models, which is based on the intuition that a discourse is a hidden sequenceof topics, each of which has its own characteristic word distribution. The distri-bution changes with the change of a topic. Topic segmentation in these models isthe task to infer the most likely sequence of topics given the observed words. Vari-ants of Hidden Markov Models (HMMs) (e.g., [27, 233]) and Latent DirichletAllocations (LDAs) [26] (e.g., [61, 167]) are proposed. Although earlier gener-ative models [26, 27, 167] are based on only lexical distributions, recent modelsincorporate some domain-specific features. For example, Eisenstein and Barzilay[62] incorporate cue phrases, and Nguyen et al. [153] incorporate speaker identityfor segmenting meetings. As new forms of discourse (e.g., asynchronous conversa-tion) are created and become popular, we face new challenges of incorporating theirdistinctive features (e.g., conversational structure) into these probabilistic models.If we have enough labeled data for training (i.e., a corpus annotated with topicsegments), then a supervised approach can be used to combine a large number offeatures and optimize their relative weights. One can use a binary classifier (e.g.,Support Vector Machines (SVMs) [49]) or a sequence labeler (e.g., ConditionalRandom Fields (CRFs) [110]) to make a yes-no boundary decision between anytwo consecutive sentences. The feature set can include lexical similarity scoresused in the unsupervised models, cue phrases and other domain-specific features.In the supervised framework, the challenge is to find the right set of features andthe right way to model the problem for a particular discourse. For example, as de-scribed below, topic segmentation in asynchronous conversation cannot be reducedto a yes-no boundary decision and therefore requires a more sophisticated model.10Topic Segmentation in Asynchronous ConversationAs noted by Purver [165], it can be hard to define what exactly we mean by a topic.In some sense, the definition of topic and its granularity depend on the applicationat hand. For example, for categorization of news articles, topics could representhigh-level distinctions among politics, culture, business, sports, science and so on.For meetings, topics could represent the items on the agenda. However, often itis not easy to define the right granularity of a topic. For example, Gruensteinet al. [77] reported that annotators, when asked to identify topic shifts in the ICSImeetings [92] ? a collection of open-domain, less-structured meetings on subjectswhich the annotators were not familiar with ?, did not agree at all as the notionof topic was too fine-grained. However, Galley et al. [74] found that annotatorsagreed reasonably well when topics are coarse-grained. Banerjee and Rudnicky[13] found that with more specific guidance (e.g., a list of candidate topics fromwhich to choose), annotation agreement could be improved significantly.One of our future goals is to automatically generate summaries of any asyn-chronous conversation (e.g., email, blog). Therefore, our notion of topic is similarto that of Galley et al. [74] for meeting summarization. In particular, we consider atopic to be something about which the participants discuss or argue or express theiropinions. For example, a blog conversation in Slashdot5, that begins with a discus-sion on breaches of US army servers, also covers Iraq and Vietnam wars, hackervs. cracker and many others. An email conversation about an upcoming meetingmay discuss location and meeting agenda. Our annotators were asked to identifyas many topics as they feel most natural to convey the overall content structureof the conversation. For example, in the email conversation shown in Figure 1.1,our annotators found two different topics. The conversation starts with asking forattendance via phone to a face to face meeting, then also discusses the time to call.In topic segmentation, we are interested in identifying what portions of theconversation are about the same topic, i.e., clustering the sentences based on theirtopics. Figure 1.2 shows the same email conversation of Figure1.1 annotated withtopic information by our annotators. The right most column specifies a particulartopic segmentation by assigning the same topic ID to sentences belonging to the5http://it.slashdot.org/story/09/05/28/1952214/hackers-breached-us-army-servers11 From: Charles     To: WAI AU Guidelines        Subject: Phone connection to ftof         Date: Thu May 31 It is probable that we can arrange a telephone connection, to call in via a US bridge. Are there people who are unable to make the face to face meeting, but would like us to have this facility? Please respond as soon as possible - the decision will be made early next week. From: William  To: Charles, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Thu May 31 > Are there people who are unable to make the face to face meeting, but would like us to have this facility? At least one ?people? would. From: Phill To: Charles Subject: Subject: Re: Phone connection to ftof   Date: Mon Jun 04  > Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  Is there a good time to call in after Amsterdam lunch?  From: Charles To: Phil, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. 9am - 5pm Amsterdam time is 3am - 11am US Eastern time which is midnight to 8am pacific time.  From: Phill To William, Charles, WAI AU Guidelines  Subject Re: Phone connection to ftof   Date: Mon Jun 04  Add Phil Jenkins to the list of ?on the phone? I will not be in the office, so an 800 number would be best. >> Are there people who are unable to make the face to face meeting, but would like us to have this facility? > At least ?people? would. From: Charles To: Phill   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 I belive lunch is 12:30-1:30. >Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  >Is there a good time to call in after Amsterdam lunch?  Until now we have got 12 people who want to have a ptop connection. Anyone else want to join?  Figure 1.1: Sample email conversation from the BC3 email corpus [213].12same topic. We also use different colors to differentiate the topics.To our knowledge, no-one has studied this problem before. Therefore, there areno standard corpora and evaluation metrics available. Also, because of its asyn-chronous nature, topics in these conversations often do not change sequentially.Notice that the conversation in Figure 1.2 starts with a discussion on topic 1, thenshifts to topic 2, again revisits topic 1 in the last email. In other words, topics inasynchronous conversation are often interleaved, and the sequentiality constraintof topic segmentation in monolog and synchronous dialog does not hold anymore.As a result, the models that are successful in monolog or synchronous dialog maynot be as successful, when they are directly applied to asynchronous conversation.Furthermore, as can be noticed in Figure 1.2, writing style varies among partic-ipants, and many people tend to use informal, short and ungrammatical sentences,thus making the discourse much less structured compared to written monologs.One unique aspect of asynchronous conversation that, at first glance, may ap-pear to carry all the necessary topic information is its subject headers. However,subject headers are often not enough and could sometimes even be misleading. Forexample, in the email conversation shown in Figure 1.2, participants keep talkingabout different topics using the same subject (i.e., Phone connection to ftof ). Thisproblem is more evident in more informal interactions like public blogs.Asynchronous conversations have their own distinctive features that could beuseful for topic segmentation. One of the most important indicators that we hypoth-esize to be very informative is conversation structure. As can be seen in our emailconversation in Figure 1.2, participants often reply to a post and/or use quotationsto talk about the same topic. Notice also that the use of quotations can express aconversational structure that is at a finer level of granularity than the one revealedby reply-to relations. To leverage this key information, the first challenge we faceis capturing the conversation structure at the quotation (i.e., text fragment) level.In Section 1.1.3, we describe a method to extract this finer-grained conversationalstructure. Beside conversation structure, other conversation-specific features likesender, recipient, mentioning names could be also useful for topic segmentation.Now that we have identified the key distinctive features, the next challengewe face is incorporating these features into our topic segmentation models in aprincipled way. As described before, LCSeg [74] and LDA [26] are the two13 From: Charles     To: WAI AU Guidelines        Subject: Phone connection to ftof         Date: Thu May 31 It is probable that we can arrange a telephone connection, to call in via a US bridge. Are there people who are unable to make the face to face meeting, but would like us to have this facility? Please respond as soon as possible - the decision will be made early next week. From: William  To: Charles, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Thu May 31 > Are there people who are unable to make the face to face meeting, but would like us to have this facility? At least one ?people? would. From: Phill To: Charles Subject: Subject: Re: Phone connection to ftof   Date: Mon Jun 04  > Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  Is there a good time to call in after Amsterdam lunch?  From: Charles To: Phill, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. 9am - 5pm Amsterdam time is 3am - 11am US Eastern time which is midnight to 8am pacific time.  From: Phill To William, Charles, WAI AU Guidelines  Subject Re: Phone connection to ftof   Date: Mon Jun 04  Add Phil Jenkins to the list of ?on the phone? I will not be in the office, so an 800 number would be best. >> Are there people who are unable to make the face to face meeting, but would like us to have this facility? > At least ?people? would. From: Charles To: Phill   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 I belive lunch is 12:30-1:30. >Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  >Is there a good time to call in after Amsterdam lunch?  Until now we have got 12 people who want to have a ptop connection. Anyone else want to join?  [1 ]  [1 ]  [1 ]  [1]       [1]     [1 ]  [1 ]       [2 ]   [2]      [ 2 ]  [ 2 ]    [2 ]      [ 1 ]  [1 ]  Topic        Topic 1 (red ): Phone connection to ftof             Topic 2 (purple ): Time to call   Topic Labels  Figure 1.2: The email conversation of Figure 1.1 with topic annotations.14state-of-the-art unsupervised topic segmentation models developed for monologand synchronous dialog from two different viewpoints. In this thesis, we proposea graph-theoretic clustering framework to incorporate the fine-grained conversa-tional structure (i.e., a graph) into the LCSeg model. On the other hand, to incor-porate the conversational structure into the LDA model, we propose to replace itsstandard Dirichlet prior by an informative Dirichlet Tree prior.Although the unsupervised models have the key advantage of not requiring anylabeled data, they can be limited in their ability to learn domain-specific knowledgefrom a possible large and diverse set of features [62]. In contrast, the supervisedframework serves as a viable option to combine a large number of features andoptimize their relative weights for decision making, but relies on labeled data fortraining. The amount of labeled data required to achieve an acceptable performanceis always an important factor to consider for choosing supervised vs. unsupervised.In this thesis, we propose a supervised topic segmentation model that outperformsall the unsupervised models, even when it is trained on a small number of labeledconversations. Our model uses a discriminative classifier to combine all the impor-tant features in a graph-theoretic clustering framework.Topic LabelingA topic label should give a brief high level overview of the topic discussed inthe topical segment. For example, the labels for the two topic segments in Fig-ure 1.2 are phone connection to ftof and time to call. The task of topic labelingcan be framed as a keyphrase generation (or extraction) problem, where ourgoal is to generate (or extract) phrases that are representative of the given text. Inkeyphrase extraction, we select the phrases verbatim from the given text, whereasin keyphrase generation, we generate novel phrases based on the extracted infor-mation. Given that the human-authored topic labels are abstractive in nature (seethe examples above), keyphrase generation has more potential than keyphrase ex-traction to get closer to human-like topic labels. However, generation is compu-tationally much more complex and challenging than extraction since it requiresinference, aggregation and abstraction. Therefore, most existing approaches areextractive in nature.15Several supervised and unsupervised approaches to keyphrase extraction havebeen proposed focusing only on monologs (e.g., paper abstracts). Generally, thesemethods operate in two steps: first, they find the candidate keyphrases from thetext, then they rank (or filter) them based on the their relevancy to the text. Candi-date keyphrases can be extracted using either a NLP chunker or n-gram sequences[91, 135, 137]. Supervised methods use a number of features (e.g., TF.IDF, posi-tion) in a classifier to filter these candidates [91, 135]. Unsupervised methods usedifferent heuristics (e.g., TF.IDF, graph centrality) to rank the candidates [139].Mei et al. [137] find that using a NLP chunker to select the candidates leads topoor results due to its inaccuracies, especially when it is applied to a new domain.Finding the right value of n in the n-gram sequences is also an issue, and Mihalceaand Tarau [140] claims that including all possible n-gram sequences excessively in-creases the search space. Instead, they follow a different approach, where they firstselect the words of a certain POS and rank them using a ranking method, then inthe post-processing step, construct the keyphrases based on the co-occurrences ofthe top ranked words in the text. Their approach with a graph-based (unsupervised)ranking method (also known as random walk) for ranking the words achieves thestate-of-the-art performance outperforming supervised methods [139].Topic Labeling in Asynchronous ConversationA direct application of the ranking method proposed by Mihalcea and Tarau [140]to label a topical segment in an asynchronous conversation would consider thewords in the segment as nodes in a graph, define the edges between the nodesbased on the co-occurrence of the respective words in the text, and then run thePageRank algorithm [155] on the graph. Our hypothesis is that better topic labelscan be identified if, in addition to co-occurrence relations, we also consider aspectsthat are specific to asynchronous conversations.Our first finding is that the leading sentences of a topical segment carry infor-mative clues for the topic labels, since this is where the speakers will most likelytry to signal a topic shift and introduce the new topic. For example, in Figure 1.2,notice that in almost every case, the leading sentences of the topical segments coverthe information conveyed by the respective topic labels.16Our second clue is again the finer-grained conversational structure (i.e., agraph) used for topic segmentation. Carenini et al. [35] successfully applied thePageRank algorithm to this graph to measure the importance of a sentence. Theirfinding implies that an important node in the conversation graph is likely to coveran important aspect of the topics discussed in the conversation. Our intuition isthat, to be in the topic label, a keyword should not only co-occur with other key-words, but it should also come from an important fragment in the graph.Now, the challenge is how to incorporate these two conversation-specific fea-tures into the graph-based ranking (i.e., random walk) model. In this thesis, wepropose and evaluate a biased random walk model [139] to incorporate the cluesfrom the leading sentences of a topical segment, and a co-ranking model [238] toincorporate the fine-grained conversation structure.1.1.2 Rhetorical StructureIn addition to the high-level topical structure, a discourse exhibits a finer-levelstructure called rhetorical (or coherence) structure, which logically binds its units(i.e., clauses and sentences) together. The meaning of a discourse unit relates tothe previous and the following ones.Several formal theories of discourse have been proposed to describe the co-herence structure of a text (e.g., [10, 122, 225]). Rhetorical Structure Theory(RST) [122], one of the most influential of them, represents texts by labeled hier-archical structures, called Discourse Trees (DTs). For example, consider the DTshown in Figure 1.3 for the following text taken from the RST-DT corpus [38]:? But he added: ?Some people use the purchasers? index as a leading indicator,some use it as a coincident indicator. But the thing it?s supposed to measure? manufacturing strength ? it missed altogether last month.?The leaves of a DT correspond to contiguous atomic text spans, called elementarydiscourse units (EDUs). EDUs are clause-like units that serve as building blocks[38]. Adjacent EDUs are then related by coherence relations (e.g., Elaboration,Contrast), thereby forming larger units (represented by internal nodes), which inturn are also linked by coherence relations. Discourse units linked by a relation17But he added:"Some people use the purchasers? index as a leading indicator, some use it as a coincident indicator. But the thing it?s supposed to measure -- manufacturing strength --it missed altogether last month." <P>ElaborationSame-UnitContrastContrastAttribution(1)(2) (3)(4) (5)(6)Figure 1.3: Discourse tree for two sentences. Each sentence contains threeEDUs. Horizontal lines indicate text segments; satellites are connectedto their nuclei by curved arrows.are further distinguished based on their relative importance in the text: the nucleusbeing the central part, whereas satellites are peripheral ones. For example, inFigure 1.3, Elaboration is a relation between a nucleus (EDU 4) and a satellite(EDU 5), and Contrast is a relation between two nuclei (EDUs 2 and 3).As mentioned before, messages in asynchronous conversations can be long un-less there is a length constraint (e.g., Twitter). For example, the average lengthof a comment in our blog corpus containing conversations from Slashdot is 11.7sentences. Each message in an asynchronous conversation locally forms a coher-ent monolog, i.e., it exhibits a rhetorical structure. For example, consider the DTshown in Figure 1.4 for the first sentence of the fifth email in Figure 1.1.6 Thesecond EDU where I?ll be Elaborates the first EDU Since it will be Mountain time,and the resultant span containing the first and the second EDUs is Elaborated by thethird EDU 9am - 5pm Amsterdam time is 1am - 9am. Finally, the span containingthe first three EDUs Explains the fourth EDU so I would participate the second halfof the day. Rhetorical structure provides a useful discourse structure that has beenshown to be beneficial for a range of NLP applications including text summariza-tion and compression [54, 118, 126, 194], text generation [163], sentiment analysis[112, 189], information extraction [130, 209] and question answering [215]Rhetorical analysis in RST involves two subtasks: discourse segmentation is thetask of breaking the text into a sequence of EDUs, and discourse parsing is thetask of linking the discourse units into a labeled tree. Both discourse segmentation6We do not show the discourse tree for the entire email post to avoid visual clutter.18Figure 1.4: Discourse tree for the first sentence of the fifth email in Figure 1.1and parsing are considered to be challenging tasks as we describe below.Discourse SegmentationFor written texts, it is taken for granted that sentence boundaries are also EDUboundaries. So, discourse segmentation is the task of deciding whether thereshould be an EDU boundary after each word in a sentence except the last one. Dis-course segmentation has a strong influence on the accuracy of a discourse parser.For example, Soricut and Marcu [191] found a 29% error reduction in parsingwhen a perfect segmentation is provided to the parser. However, finding EDUs ischallenging because there is no general agreement as to what constitutes the EDUsor what their properties are [198]. For example, consider the following sentence:? John said that the journal paper won the best paper award.How many EDUs are there? Since it contains two verbs, it is reasonable to thinkthat it has two: John said and that the journal paper won the best paper award.But, at the same time we would expect an EDU to be structurally complete. Here,John said is neither syntactically nor semantically complete. There is disagreementamong researchers on this issue [198].Existing approaches to discourse segmentation can be divided into two types:rule-based and supervised learning. Rule-based approaches use hand-crafted rulesthat are based on syntactic categories and POS tags [113, 210]. On the other hand,supervised approaches use a handful of lexical and syntactic features to learn asegmentation model from the labeled data. For example, Soricut and Marcu [191]learn a generative model, Sporleder and Lapata [194] learn a boosting (ensemble)model, Fisher and Roark [72] learn a log-linear model, and Hernault et al. [84]19learn a SVM. In general, supervised segmenters perform better than rule-basedsegmenters. However, the challenge is to come up with the right set of features. Inthis thesis, we propose a discriminative model that uses mainly syntactic featuresincluding rules extracted from syntactic parse trees, POS tags and chunk tags.Discourse ParsingOnce the EDUs are identified, the discourse parsing problem is to determine whichdiscourse units (EDUs or larger units) to relate (i.e., the structure), and what re-lations (i.e., the labels) to use in the process of building the DT. Specifically, itrequires: a parsing model to explore the search space of possible structures andlabels for their nodes; and a parsing algorithm for deciding among the candidates.Often discourse connectives like but, because, although convey clear informa-tion on the kind of relation linking the two text segments. Earlier work developedrule-based (unsupervised) systems based on discourse connectives (e.g., [125]).However, this approach faces at least three serious problems. First, identifying dis-course connectives is a difficult task of its own, because depending on the usage,the same phrase may or may not signal a coherence relation. Second, often co-herence relations are not explicitly signaled by discourse cues [127, 196]. Third,discourse connectives could be ambiguous in signaling relations [198].Another line of research uses unambiguous discourse cues to automatically la-bel a large corpus with coherence relations (e.g., although for Contrast) that couldbe then used to train a supervised classifier [127, 195]. To make the classifierwork for implicit cases, they remove the connectives from the training instances.However, later studies confirm that classifiers trained on instances by stripping offthe original cue phrases do not generalize well to implicit cases [24, 196]. Fur-thermore, this approach does not attempt to solve the actual parsing (i.e., buildinghierarchical tree) problem, rather it attempts to solve a tagging (i.e., flat) prob-lem. To perform an effective and complete rhetorical parsing, one needs to employsupervised learning techniques based on human annotated data.Supervised approaches to discourse parsing can be judged based on two cri-teria: the type of model it uses as the parsing model, and the type of parsingalgorithm it uses to build the discourse tree. Marcu [124] uses a C4.5 decision tree20classifier as the parsing model in a shift-reduce (bottom-up) parsing algorithm. Inthe sentence-level discourse parser SPADE7, Soricut and Marcu [191] use a gen-erative probabilistic parsing model with a CKY-like bottom-up parsing algorithm.Subba and Di-Eugenio [201] use an ILP-based8 classifier as the parsing model ina shift-reduce parsing algorithm. In the HILDA system9, Hernault et al. [84] andFeng and Hirst [69] use two SVM classifiers ? one for the structure and one forthe label ? in a cascade as the parsing model. They employ this cascaded parsingmodel iteratively to build the discourse tree in a bottom-up fashion.The existing parsing models mentioned above disregard the structural inter-dependencies between the tree constituents. However, we hypothesize that likesyntactic parsing, discourse parsing is also a structured prediction problem, whichinvolves predicting multiple variables (i.e., the structure and the relation labels)that depend on each other [188]. Recently, Feng and Hirst [69] also found theseinter-dependencies to be critical for parsing performance. In this thesis, we useundirected conditional graphical models (i.e., CRFs) to capture structural depen-dencies between the tree constituents.Another technical question related to parsing models is whether to employa single unified model or two different models for parsing at the sentence-level(i.e., intra-sentential) and at the document-level (i.e., multi-sentential). Existingdiscourse parsers use a single model and do not discriminate between intra- andmulti-sentential parsings. This approach has the advantages that it makes the pars-ing process easier, and the model gets more data to learn from. However, thefeature set used in the parsing model may not generalize well to these two differ-ent parsing conditions, and there is empirical evidence that coherence relations aredistributed differently intra-sententially vs. multi-sententially [38].Our hypothesis is that a more effective approach could be to use two separatemodels that would also allow us to exploit the strong correlation observed betweenthe text structure and the DT structure. However, this poses an addition challengeof combining the two stages of parsing effectively and efficiently. Although, mostsentences have a well-formed discourse sub-tree in the full DT (e.g., the second7http://www.isi.edu/licensed-sw/spade/8ILP stands for Inductive Logic Programming9http://www.cs.toronto.edu/ weifeng/software.html21sentence in Figure 1.3), there are few cases where rhetorical structures violate sen-tence boundaries. For example, the first sentence in Figure 1.3 does not have awell-formed discourse sub-tree because the unit containing EDUs 2 and 3 mergeswith the next sentence and only then is the resulting unit merged with EDU 1.In this thesis, we combine our intra-sentential and multi-sentential parsers intwo different ways. Since most sentences have a well-formed discourse sub-treein the full DT, our first approach constructs a DT for every sentence using ourintra-sentential parser, and then runs the multi-sentential parser on the resultingsentence-level DTs. Clearly, this approach disregards those cases where rhetoricalstructures violate sentence boundaries. To deal with those cases, our second ap-proach builds sentence-level sub-trees by applying the intra-sentential parser on asliding window covering two adjacent sentences and consolidating the results pro-duced by overlapping windows. Then, the multi-sentential parser takes all thesesentence-level sub-trees and builds a full rhetorical parse for the whole document.Most of the existing discourse parsers apply greedy and sub-optimal parsingalgorithms to build the DT. These algorithms offer a simple and efficient solution,but are not very effective in terms of accuracy. Therefore, a challenge we addressin this thesis is to develop an optimal parsing algorithm, which is also efficient.1.1.3 Dialog Acts and Conversational StructureAs mentioned before, a conversation is a joint activity between two or more partici-pants. The participants take turns, each of which consists of one or more utterances[177]. The utterances in a turn perform certain actions (e.g., asking a question, re-questing something), which are called dialog acts (DAs) [12]. For instance, inthe second email shown in Figure 1.5, the sentence At least one ?people? wouldanswers the question posed in the first email. Two-part structures connecting twoDAs (e.g., Question-Answer, Request-Accept) are called adjacency pairs [180].Dialog acts and adjacency pairs provide useful structures for conversation anal-ysis, that have been shown to be beneficial for a range of applications includingconversation summarization [148, 151] and conversational agents [5].Now consider the following multi-party (synchronous) exchange:? John: Anyone up for lunch?22It?s 12:10? Ale: I?m inwho else is joining?? Carla: John, are you done with the report?You were supposed to finish it before lunch? Kevin: I?ll also join? John: I?m almost doneClearly, in this exchange, two simultaneous conversations are going on: one isabout going for lunch, and the other is about the report. Notice that in multi-partyconversations, it is always not obvious who is talking to whom. For example, astraightforward reading would mistakenly consider Kevin?s utterance as a responseto Carla, and John?s last utterance as a response to Kevin. Identifying which ut-terance contributes to which conversation is known as disentanglement [63] (alsocalled conversation structure extraction [1, 131] and thread detection [184]). Itis considered as an essential prerequisite for other higher-level conversation analy-sis (e.g., dialog act recognition). In the following, we give an overview of the twotasks: dialog act recognition and conversational structure extraction.Dialog Act RecognitionMost of the previous work on dialog act modeling has mainly focused on syn-chronous conversations, e.g., [102, 232] for chats, [57, 107] for meetings, [14, 200]for phone conversations. The dominant approaches are mostly supervised, and useeither simple classifiers (binary or multi-class) or more structured models like Hid-den Markov Models (HMMs), Maximum Entropy Markov Models (MEMMs), andConditional Random Fields (CRFs). Since turns in synchronous conversations oc-cur one after the other with minimal delay, the conversation flow in these conversa-tions exhibits sequential dependencies between adjacency pairs (e.g., question fol-lowed by answer, request followed by grant). Sequence labelers like HMMs andCRFs, which are capable of capturing these inter-dependencies between the dialogacts, generally perform better than the simple classifiers (e.g., MaxEnt, SVMs).23 From: Charles     To: WAI AU Guidelines        Subject: Phone connection to ftof         Date: Thu May 31 It is probable that we can arrange a telephone connection, to call in via a US bridge. Are there people who are unable to make the face to face meeting, but would like us to have this facility? Please respond as soon as possible - the decision will be made early next week. From: William  To: Charles, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Thu May 31 > Are there people who are unable to make the face to face meeting, but would like us to have this facility? At least one ?people? would. From: Phill To: Charles Subject: Subject: Re: Phone connection to ftof   Date: Mon Jun 04  > Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  Is there a good time to call in after Amsterdam lunch?  From: Charles To: Phill, WAI AU Guidelines   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 Please note the time zone difference, and if you intend to only be there for part of the time let us know which part of the time. 9am - 5pm Amsterdam time is 3am - 11am US Eastern time which is midnight to 8am pacific time.  From: Phill To William, Charles, WAI AU Guidelines  Subject Re: Phone connection to ftof   Date: Mon Jun 04  Add Phil Jenkins to the list of ?on the phone? I will not be in the office, so an 800 number would be best. >> Are there people who are unable to make the face to face meeting, but would like us to have this facility? > At least ?people? would. From: Charles To: Phill   Subject: Re: Phone connection to ftof   Date: Mon Jun 04 I belive lunch is 12:30-1:30. >Since it will be Mountain time where I'll be, 9am - 5pm Amsterdam time is 1am - 9am, so I would participate the second half of the day.  >Is there a good time to call in after Amsterdam lunch?  Until now we have got 12 people who want to have a ptop connection. Anyone else want to join?  [S ]  [QY ]  [A M ]  [AM ]       [A ]     [A ]  [S ]       [AM ]   [S ]      [ S]  [ QY ]    [S ]      [ S]  [QY ]   DA   Frag .   (a)  (b)  (c)       (d)      (e)        ( f)  (g)      (h)      (i)     (j)  Figure 1.5: The email conversation of Figure 1.1, now annotated with dialog acts(DAs). The right most (DA) column specifies the dialog acts for the sentences:S stands for Statement, QY stands for Yes-no question, A stands for Acceptresponse, and AM stands for Action motivator. The Frag. column specifies thefragments in the Fragment Quotation Graph (FQG) described in Section 1.1.3.24Dialog Act Recognition in Asynchronous ConversationUnlike synchronous conversations, consecutive turns in asynchronous conversa-tions can be far apart in time, and multiple turns can largely overlap [36]. Amongthe supervised approaches, Cohen et al. [46] [39] first use the term email speech actfor classifying emails (not sentences) based on the writers? intentions (e.g., deliver,meeting). Jeong et al. [93] propose semi-supervised boosting to tag the sentencesin email and forum conversations with DAs by adapting knowledge from anno-tated spoken conversations, i.e., meeting and phone conversations. However, theirapproach does not consider the sequential dependencies between the DA types.Among recent attempts at unsupervised DA modeling, Ritter et al. [174] pro-pose HMMs to cluster the Twitter posts into DAs. Their simple HMM model tendsto find some undesired topical clusters in addition to the DA clusters. Withoutlabeled data, distinguishing DAs from topics is in fact a common challenge, be-cause many of features used for modeling DAs are also indicators of topics. Ritteret al. [174] propose a HMM+Topic model that tries to separate the DA indicatorsfrom the topic words. Since Twitter messages are short, HMMs are able to learnmeaningful sequence dependencies from the reply-to structure of the conversation.When messages are long, capturing sequential dependencies between the acttypes for sentence-level DA tagging adds further challenges because the two com-ponents of adjacency pairs could be far apart in the sequence. Because of theasynchronous nature, the temporal order of the utterances often lacks these de-pendencies. This is true even when the reply-to structure between messages isconsidered. For instance, notice in Figure 1.5 that there are two other sentences inbetween the question asked in the first email (i.e., Are there ..) and the reply madein the second email (i.e., At least ..). This example also demonstrates that the useof quotations (see the second email) could help us in putting the components ofadjacency pairs close to each other. However, this comes with another challenge ofcapturing the conversational structure at the quotation level. In the next section, wewill present a method to capture the conversational structure at the quotation level.25Conversational Structure ExtractionThe fact that multi-party conversations are often made up of interwoven threadswas first acknowledged by Rose? et al. [176]. Aoki et al. [7] report an averageof 1.76 active conversations in a exchange involving 8 to 10 speakers. Elsner andCharniak [63] report an average of 2.75 conversations active at a time in multi-partychat. Several methods have been proposed to disentangle multi-party synchronousconversations (e.g., [63, 131, 184, 223]). We describe them briefly in Section 2.2.3.EE2 3EE 456ab deghib >>>>abcd efghij(a) (b)E 1c dbEff>h>jFigure 1.6: (a) The email conversation of Figure 1.5 with its fragments. Thereal contents are abbreviated as a sequence of symbols. Arrows indicatereply-to links. (b) The corresponding Fragment Quotation Graph.While disentanglement is necessary for many multi-party synchronous conver-sations, asynchronous media like email and blog services (e.g., Gmail, Slashdot,Twitter) generally organize comments into tree-structured threads using reply-torelations. In absence of the reply-to relations, automatic methods to uncover thethread structure have also been proposed (e.g., [220, 224]). However, the use ofquotations in asynchronous conversations can express a conversational structurethat is finer grained and can be more informative than the one revealed by reply-to relations [36]. For example, consider the relation between the new and quoted(marked with ?>?) text fragments in the second and fifth emails in Figure 1.5. Theproximity between quoted and new text fragments can represent a conversationallink that is more specific than the actual reply-to link. Carenini et al. [34] pre-sented a method to capture this finer level conversational structure in the form a26graph called Fragment Quotation Graph (FQG). For example, Figure 1.6 showsthe FQG for our email conversation in Figure 1.5. The nodes in the FQG representtext fragments and the edges represent reply relations between the fragments.Carenini et al. [34, 35] show the benefits of using a FQG in email summariza-tion. In this thesis, we generalize the FQG to any asynchronous conversation, anddemonstrate how topic models and dialog act models can benefit significantly fromthis fine conversational structure of asynchronous conversation.1.2 Computational MethodologiesTable 1.4 lists the computational methods used for different discourse analysistasks in this thesis. These methods can be classified into two broad classes: graph-based methods and probabilistic graphical models.Tasks Computational methods CommentsTopic SegmentationGraph-based clustering N-cut modelLDA with Dirichlet Tree prior Generative topic modelTopic Labeling Graph-based ranking Random walk modelDiscourse Segmentation Binary classifier (MaxEnt) Discriminative modelDiscourse Parsing Multi-layered CRFs Discriminative UGMDialog act modeling HMMs (unsupervised) Generative DGMTable 1.4: Computational methods used for different discourse analysis tasks.N-cut stands for Normalized cut, UGM stands for Undirected GraphicalModel, and DGM stands for Directed Graphical Model.1.2.1 Graph-based Methods for NLPThe adoption of a graph-theoretic framework allows us to represent linguistic unitsas diverse as words, sentences and documents as nodes in a graph, and relationsbetween them as edges in the graph. It then allows us to apply a variety of efficientalgorithms on the graph to solve a wide range of NLP applications. Table 1.5shows some widely used graph-based algorithms and their applications to variousNLP problems. In this thesis, we use graph-based clustering and ranking models27for topic segmentation and labeling in asynchronous conversation.Tasks Methods ApplicationsClustering Min cut/ Topic segmentation [121], Reference resolution [190]Max flow Chat disentanglement [63], Subjectivity detection [157]Ranking PageRank, Summarization [65], Keyphrase extraction [140]HITS, Passage retrieval [154], WSD [138],ArcRank Lexical acquisition [228]Learning Semi-supervised, Text categorization [208], Sentiment analysis [208],Manifold Dialog act tagging [202], POS tagging [203], MT [2]Matching Min-cost match Textual entailment [78]Table 1.5: Graph-based methods applied to NLP tasks. WSD stands for WordSense Disambiguation and MT stands for Machine Translation.Graph-based Clustering and RankingLet G = (V,E) be a weighted undirected graph, where the nodes V represent thelinguistic units (e.g., sentences, words) and the edge weights w(x,y) represent someform of ?similarity? between units x and y. The term ?similarity? can refer to anumber of things. For example, it can be a score of lexical similarity between twosentences x and y, it can be a confidence score of a classifier that determines howlikely is that sentences x and y belong to the same class, and so on.Graph-based clustering (also called correlation clustering [15]) aims to par-tition the nodes of the graph into disjoint groups, where, by some measure, the sim-ilarity among the nodes in a group is high and the similarity across different groupsis low. Several objectives and efficient algorithms to compute the globally optimalpartition based on those objectives have been proposed, e.g., min-cut, n-cut [185].Graph-based clustering has been used to solve a number of problems in dis-course (see Table 1.5). One can encode different discourse-related information interms of the edge weights. For example, Pang and Lee [157] encode contextualdependencies for subjectivity detection. Malioutov and Barzilay [121] use cosinesimilarity to encode lexical cohesion into their unsupervised topic segmentation28model. In this thesis, we encode lexical cohesion as well as conversational struc-ture into our unsupervised topic segmentation model. Furthermore, by definingthe edge weights based on the confidence score of a classifier, we encode a largenumber of domain-specific features into our supervised topic segmentation model.Graph-based ranking algorithms (e.g., PageRank [31], HITS [105]) are waysof measuring the importance of a node within a graph, by taking into account globalinformation computed from the entire graph, rather than relying only on local node-specific information [139]. TextRank [140] (or LexRank [65]), a version of PageR-ank for textual units, is the most popular graph-based ranking algorithm in NLP.A probabilistic interpretation of this algorithm can be given from the concept ofa random walk on a graph, where the graph represents the transition matrix of aMarkov chain, and the ranking gives the stationary distribution of the chain.The random walk framework allows us to incorporate knowledge from multiplesources as priors [208], biases [154] and co-ranking [238]. This ability enables usto incorporate different discourse phenomena in the ranking process. In this thesis,we use a biased random walk and a co-ranking framework to incorporate cluesfrom the leading sentences and the FQG respectively for the task of topic labeling.1.2.2 Probabilistic Graphical ModelsUncertainty is inevitable in many real-world scenarios because we never observethe world state fully, and even those aspects that we observe are sometimes noisy.As a consequence, we often employ probabilistic reasoning to design a real-worldsystem. Probabilistic Graphical Models (PGMs) constitute a general class ofprobabilistic models with the following key properties [108]:? A compact graph representation which is semantically intuitive.? Efficient probabilistic reasoning in a general framework.? Separation of reasoning from representation.? A general learning framework.Complex systems often involve multiple interrelated aspects (i.e., random vari-ables), some of which are observed while some are not, and some that interact with29each other directly while others do not. Probabilistic graphical models define jointdistributions over the random variables that then allow us to reason about somequery variables, possibly given observations about some others. The graph repre-sentation allows us to define arbitrary joint distributions intuitively and compactlyby exploiting the interactions among the variables. The key insight of graphicalmodeling is that a distribution over many variables can often be represented as aproduct of local functions that each depend on a much smaller subset of variables.The reasoning algorithms work directly on the graph structure and are generallyfaster than working on the joint distribution explicitly. Separating reasoning fromrepresentation allows us to develop a general class of algorithms that can be appliedto any model within a broad category, conversely, our model can be improved fora specific application without modifying the algorithms.The attributes, their interdependencies and the parameters in a graphical modelcan be either defined by domain experts or learned automatically from data. PGMsprovide a general learning framework that is very effective in practice.Table 1.6 shows examples of the two families of PGMs: Directed GraphicalModels (DGMs) and Undirected Graphical Models (UGMs). These models canbe further distinguished based on how they are trained. Generative models define ajoint distribution p(y,x|?) over inputs and outputs, then use the (learned) model ?to infer the conditional p(y|x,?). This approach has several advantages includingefficient training, unsupervised modeling and handling missing data [145]. How-ever, it also has crucial limitations. The dimensionality of x can be very large withcomplex dependencies, so modeling x can be difficult and may lead to intractablemodels, but ignoring the dependencies can lead to inaccurate models [204].Generative DiscriminativeDirected LDAs, HMMs, State space models Maximum Entropy Markov ModelUndirected Markov Random Fields Conditional Random FieldsTable 1.6: Families of probabilistic graphical models.A solution to this problem is a discriminative approach, which models the con-ditional distribution p(y|x,?) directly. The key advantage of this approach is thatdependencies involving only variables in x play no role in the conditional model.30In general, discriminative methods are more accurate than generative ones, sincethey solve an easier problem and do not ?waste effort? in modeling the observa-tions [145]. Other advantages include the ability to leverage a large number ofinput features x, and the ability to relax strong independence assumptions.PGMs have been widely used in numerous NLP tasks, because of their abilityto predict multiple interrelated variables (i.e., structured output). For example, thedependencies between words in a sentence are modeled by Markov chains (i.e.,language models) [100], the dependencies between POS tags of the words in asentence are modeled by HMMs or CRFs, the hierarchical dependencies betweensyntactic constituents in a parse tree are modeled by CRFs, and so on. Fundamentalto many discourse analysis tasks is also the ability to predict multivariate outputs.For example, the contents of a news article follow a hidden topical structure [19];the dialog acts of a conversation are interrelated [200]; discourse entities, theirattributes and their mentions are interrelated [59]; rhetorical structure and relationsare interrelated [69]. Table 1.7 shows examples of probabilistic graphical modelsused in different discourse analysis tasks.Tasks Graphical models usedContent (topic) modeling HMMs [19], LDAs [61]Dialog act modeling HMMs [200], CRFs [102]Reference resolution CRFs [59]Shallow discourse parsing in PDTB CRFs [75]Table 1.7: Examples of graphical models used in discourse analysis. PDTBstands for Penn Discourse Treebank [164].However, it is important to note that there is no universally best model ? amodel that works well in one task may work poorly in another ? this is essentiallythe no free lunch theorem [231]. As a result, there is now a growing interest indevising new graphical models for different tasks in discourse analysis. In thisthesis, as summarized in Table 1.4, we use an LDA with a Dirichlet Tree priorfor topic segmentation in asynchronous conversation, discriminative UGMs (i.e.,CRFs) as the parsing models in our discourse parser, and generative DGMs (i.e.,unsupervised HMMs) for modeling dialog acts in asynchronous conversation. As31we describe in detail in the respective chapters, our choice of model for a specifictask is driven by four considerations, namely the structure (or complexity) of thetask, the amount of labeled data available for training the model, the number offeatures we want to incorporate into the model, and the (hidden) random variableswe want to consider jointly in our model.1.3 Our ContributionsOur contributions in this thesis aim to overcome the challenges for different dis-course analysis tasks in asynchronous conversations as described in Section 1.1.Our key hypothesis is that to effectively address these technical challenges, weneed to apply sophisticated graph-based methods and probabilistic graphical mod-els described in Section 1.2. In the following, we summarize our contributions.1.3.1 Topic Segmentation and LabelingIn Chapter 2, we propose a complete computational framework for performingtopic segmentation and labeling in asynchronous conversations.10Since there was no previous study on topic segmentation and labeling in asyn-chronous conversation, there were no standard corpora and evaluation metrics avail-able for research purposes. We present two new corpora of email and blog conver-sations annotated with topics, and evaluate annotator reliability using a new set ofmetrics, which are also used to evaluate the computational models.For topic segmentation, we extend LDA [26] and LCSeg [74], two state-of-the-art unsupervised models, to incorporate a fine-grained conversational structure(i.e., the Fragment Quotation Graph (FQG)), generating two novel unsupervisedmodels LDA+FQG and LCSeg+FQG. We incorporate the FQG into LDA by re-placing its standard Dirichlet prior with an informative Dirichlet Tree prior. Onthe other hand, we propose a graph-based clustering model to incorporate the FQGinto LCSeg. In addition to that, we also propose a novel supervised segmentationmodel that combines lexical, conversational and topic features using a classifier inthe graph-based clustering framework.10Our corpora, annotation manual and source code for computational models are publicly availablefrom www.cs.ubc.ca/labs/lci/bc3.html32For topic labeling, we propose to generate topic labels using an unsupervisedextractive approach that identifies the most representative phrases in the text. Specif-ically, we propose two novel random walk models that respectively capture twoforms of conversation specific information: (i) the fact that the leading sentencesin a topical cluster often carry the most informative clues, and (ii) the fine-grainedconversational structure of the conversation, i.e., the FQG.We evaluated our framework in a series of experiments. Experimental resultsfor the topic segmentation task demonstrate that both LDA and LCSeg benefit sig-nificantly when they are extended to consider the FQG, with LCSeg+FQG be-ing the best unsupervised model. The comparison of the supervised segmentationmodel with the unsupervised models shows that the supervised method outper-forms the unsupervised ones even using a limited number of labeled conversations,being the best segmentation model overall. Remarkably, the segmentation deci-sions of LCSeg+FQG and the supervised models are also highly correlated withhuman annotations. The experiment on the topic labeling task reveals that the ran-dom walk model performs better when it exploits conversation specific clues fromthe leading sentences and the conversational structure. The evaluation of the end-to-end system also shows promising results in both corpora, when compared withhuman annotations.1.3.2 Rhetorical AnalysisIn Chapter 3, we propose a complete probabilistic discriminative framework forrhetorical analysis comprising both a discourse segmenter and a discourse parser.11For discourse segmentation, we propose a novel discriminative model thatachieves state-of-the-art performance using fewer features than the existing mod-els. Our main contribution is to come up with the right set of features includingrules extracted from syntactic parse trees, POS tags and chunk tags.For discourse parsing, our contributions aim to address the key limitations ofthe existing parsers (see Section 1.1.2), as we summarize them below:Existing discourse parsers model the structure and the labels of a discourse tree(DT) separately, and do not capture the sequential dependencies between the DT11The source code and an online demo of our rhetorical analysis framework is available athttp://alt.qcri.org/discourse/Discourse-Parser-Demo/33constituents. To address this, we propose a novel discourse parser based on prob-abilistic discriminative parsing models, expressed as Conditional Random Fields(CRFs) [205], to infer the probability of all possible DT constituents. The CRFmodels effectively represent the structure and the label of a DT jointly, and when-ever possible, capture the sequential dependencies between the constituents.While existing discourse parsers do not discriminate between intra-sententialand multi-sentential parsings, we believe that distinguishing between these two canresult in a more effective parsing method, which can exploit the strong correlationobserved between the text structure and the DT structure. Furthermore, two sepa-rate parsing models ? one for intra-sentential and one for multi-sentential ? couldleverage their own informative feature sets and the fact that coherence relations aredistributed differently intra-sententially vs. multi-sententially.In order to develop a complete and robust discourse parser, we combine ourintra-sentential and multi-sentential parsers in two different ways. Since most sen-tences have a well-formed discourse sub-tree in the full DT (e.g., the second sen-tence in Figure 1.3), our first approach constructs a tree for every sentence usingour intra-sentential parser, and then runs the multi-sentential parser on the resultingsentence-level trees. However, this approach would disregard those cases whererhetorical structures violate sentence boundaries (e.g., the first sentence in Fig-ure 1.3). Our second approach, in the attempt of dealing with these cases, buildssentence-level sub-trees by applying the intra-sentential parser on a sliding windowcovering two adjacent sentences and by then consolidating the results producedby overlapping windows. After that, the multi-sentential parser takes all thesesentence-level sub-trees and builds a full rhetorical parse for the whole document.Finally, while existing discourse parsers apply greedy and sub-optimal parsingalgorithms to build the discourse tree for a document, we apply an optimal parsingalgorithm to find the most probable discourse tree.While previous studies tested their approach only on one corpus, we evaluateour discourse segmenter and parser on two very different genres: news articles andinstructional manuals. The results show that our approach to discourse parsingsignificantly outperforms the state-of-the-art, often by a wide margin.341.3.3 Dialog Act ModelingIn Chapter 4, we describe our unsupervised approaches to dialog act modelingin asynchronous conversation. In particular, we investigate a graph-theoretic de-terministic framework and two probabilistic conversational models for clusteringsentences in an asynchronous conversation based on their dialog act types.The graph-theoretic framework clusters sentences based on their lexical andstructural similarity, but ignores the sequential dependencies between the acts. Onthe other hand, probabilistic conversational models frame the task as an unsuper-vised sequence labeling problem that we solve using variations of HMMs.As described before, unlike synchronous conversations, asynchronous conver-sations often lack sequential dependencies between the act types in the temporalorder of the utterances. We argue that the sequences extracted from the conver-sational structure allow a more effective learning of the sequential dependencies.We show that this is the case for both email and forum conversations, in particularwhen for email we use the fine-grained conversational structure, i.e., the FQG.Similar to Ritter et al. [174], we also observe that simple unsupervised HMMs,when applied to cluster sentences based on their dialog acts, tends to also find someirrelevant topical clusters. Ritter et al. [174] address this problem by proposingan HMM+Topic model which tries to separate the topic words from the dialogact indicators. In our work, we propose an HMM+Mix model, which not onlyseparates the topics, but also improves the emission distribution by defining it as amixture model.We evaluate our models on two different datasets: email and forum posts. Toour knowledge, we are the first to perform a quantitative evaluation of unsuperviseddialog act models for asynchronous conversation. The empirical results demon-strate that (i) the graph-theoretic framework is not the right model for this task, (ii)the probabilistic conversational models learn better sequence dependencies whenthey are trained on the sequences extracted from the graph structure of the con-versation rather than when they are trained on the temporal sequences, and (iii)HMM+Mix is a better conversational model than the simple HMM model.Finally, in Chapter 5, we conclude the thesis with a summary of our contributions35and directions for future work. We note that the exposition throughout this dis-sertation assumes that our readers are familiar with basic probabilistic statisticalmodels, though not with the particular discourse processing tasks we study.36Chapter 2Topic Segmentation and LabelingIn this chapter, we study the task of automatically identifying the high-level dis-course structure, i.e., the topical structure of an asynchronous conversation. Wepresent two new corpora of email and blog conversations annotated with topics, andevaluate annotator reliability for topic segmentation and labeling tasks. We proposea complete computational framework for performing topic segmentation and label-ing in asynchronous conversations. Our approach extends state-of-the-art methodsby considering the fine-grained structure of an asynchronous conversation, alongwith other important features. We do that by applying recent graph-based meth-ods for NLP. For topic segmentation, we propose two novel unsupervised modelsthat exploit the fine-grained conversational structure, and a novel graph-theoreticsupervised model that combines lexical, conversational and topic features. Fortopic labeling, we propose two novel (unsupervised) random walk models that re-spectively capture conversation specific clues from two different sources: the lead-ing sentences and the fine-grained conversational structure. Empirical evaluationshows that the segmentation and labeling performed by our best models outperformthe state-of-the-art, and are highly correlated with human annotations.11This chapter is based on the journal article Joty et al. [98] (JAIR-2013). Portions of thiswork were also previously published in two peer-reviewed conference proceedings: Joty et al. [94](EMNLP-2010) and Joty et al. [96] (ICWSM-2011).372.1 IntroductionWhat makes a topic in asynchronous conversation? As mentioned before, definingthe term topic is not a trivial task, and by large it depends on the target applica-tion. Since our ultimate goal is to be able to automatically generate summariesof asynchronous conversations (e.g., email, blog), our notion of topic is similar tothat of Galley et al. [74] for meeting summarization. In particular, we considera topic to be something about which the participants discuss or argue or expresstheir opinions. In other words, a topic in asynchronous conversation emulates anitem (or issue) in the meeting agenda. For example, an email conversation about anupcoming meeting may discuss location, agenda and who should attend. A blogconversation in Slashdot2, that begins with a discussion on breaches of US armyservers, also covers Iraq and Vietnam wars, hacker vs. cracker and many others.Multiple topics seem to occur naturally in social interactions, whether syn-chronous (e.g., meetings, chats) or asynchronous. In the naturally occurring ICSImulti-party meetings [92], Galley et al. [74] report an average of 7.5 topical seg-ments per conversation. In multi-party chat, Elsner and Charniak [63] report anaverage of 2.75 discussions active at a time. In the email and blog (asynchronous)conversational corpora that we present in this chapter, annotators found an averageof 2.5 and 10.77 topics per email and blog conversation, respectively.Topic segmentation refers to the task of grouping the sentences of an asyn-chronous conversation into a set of coherent topical clusters (or segments)3, andtopic labeling is the task of assigning a short description to each of the topicalclusters to facilitate interpretations of the topics [165]. For example, in the sampletruncated email conversation from our corpus shown in Figure 2.1, the majorityof our three annotators found three different topics (or clusters). Likewise, in thetruncated blog conversation shown in Figure 2.2, our annotators found six differenttopics. The right most column in each figure specifies a particular segmentationby assigning the same topic ID (or cluster ID) to sentences belonging to the sametopic. The topics in each figure are also differentiated using different colors. Thetopic labels assigned by the annotators are listed below each conversation (e.g.,2http://it.slashdot.org/story/09/05/28/1952214/hackers-breached-us-army-servers3In this chapter, we use the terms topical cluster and topical segment interchangeably.38?Telecon cancellation?, ?Tag document?, ?Responding to I18N? in Figure 2.1).Topic segmentation and labeling is often considered an essential prerequisitefor higher-level conversation analysis [14] and has been shown to be useful in manylanguage processing applications including text summarization [56, 80, 104], textgeneration [19], information extraction [3], and conversation visualization [117].While extensive research has been conducted in topic segmentation for monolog(e.g., articles) and synchronous dialog (e.g., meetings), no-one has studied theproblem of segmenting and labeling asynchronous conversations. Therefore, thereare no reliable annotation scheme, no standard corpus, and no agreed-upon metricsavailable. Also, it is our key observation that, because of its asynchronous nature,and the use of quotations [51], topics in these conversations are often interleavedand do not change in a sequential way. That is, if we look at the temporal or-der of the sentences in a conversation, the discussion of one topic may appear tointersect with the discussion of others. As can be seen in Figure 2.1, after a dis-cussion of topic 3 in the second and third email, topics 1 and 2 are revisited in thefourth email, then topic 3 is again brought back in the fifth email. Therefore, thesequentiality constraint of topic segmentation in monolog and synchronous dialogdoes not hold in asynchronous conversation. As a result, we do not expect modelswhich have proved successful in monolog or synchronous dialog to be as effectivewhen directly applied to asynchronous conversation.Our contributions in this work aim to remedy these problems. First, we presenttwo new corpora of email and blog conversations annotated with topics, and eval-uate annotator reliability for the topic segmentation and labeling tasks using a newset of metrics, which are also used to evaluate the performance of the computa-tional models. To our knowledge, these are the first such corpora that will be madepublicly available. Second, we present a complete topic segmentation and labelingframework for asynchronous conversations. Our approach extends state-of-the-artmethods (for monologs and synchronous dialogs) by considering the fine-grainedstructure of the asynchronous conversation along with other conversational fea-tures. In doing so, we apply recent graph-based methods for NLP [139] such asmin-cut and random walk on paragraph, sentence or word graphs.For topic segmentation, we propose two novel unsupervised models that ex-ploit, in a principled way, the fine-grained conversational structure beyond the lex-39 From: Brian  To: rdf core Subject: 20030220 telecon Date: Tue Feb 17 13:52:15  I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed. I would like to get moving on comments on the TAG architecture document.  Jan  ?are you still up for reviewing? Can we aim to get other comments in by the end of this week and agreement by email next week? From: Jeremy To: Brian Subject: Re: 20030220 telecon Date: Wed Feb 18 05:18:10  > I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed.  > ?.. a??ee?ent ?? e?ail ne?t week  ?I think that means we will not formally respond to I18N on the charmod comments, shall I tell them  that we do not intend to, but that the e-mail discussion has not shown any disagreement. e.g. I have informed the RDF Core WG of your decisions, and no one has indicated unhappiness  - however we have not formally discussed these issues;  and are not likely to.  From: Jeremy To: Brian Subject: Re: 20030220 telecon Date: Thu Feb 19 05:42:21   > Ah. Is this a problem. Have I understood correctly they are going through last call again anyway.  Yes  ?I could change my draft informal response to indicate that if we have any other formal response it will be included in our LC review comments on their new documents.  > When is the deadline? > I'm prepared to decide by email so we can formally respond by email. Two weeks from when I received the message ....i.e. during Cannes -I suspect that is also the real deadline, in that I imagine they want to make their final decisions at Cannes. I am happy to draft a formal response that is pretty vacuous, for e-mail vote. is pretty vacuous, for e-mail vote. From: Brian To: Jeremy Subject: Re: 20030220 telecon Date: Wed Feb 18 13:16:21   > I think that means we will not formally respond to I18N on the charmod comments, shall  > I tell them that we do not intend to, but that the e-mail discussion has not shown any disagreement. Ah,  Is this a problem. Have I understood correctly they are going through last call again anyway.  > e.g. I have informed the RDF Core WG of your decisions, and no one has indicated unhappiness  > - however we have not formally discussed these issues; and are not likely to.  When is the deadline? I'm prepared to decide by email so we can formally respond by email. From: Pat To: Brian Subject: Re: 20030220 telecon Date: Wed Feb 18 16:56:26  > I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed. Im assuming that they are all cancelled unless I hear otherwise. Maybe that should be our default? > I would like to get moving on comments on the TAG architecture document. I still plan to write a rather long diatribe on this if I can find the time. I doubt if the rest of the WG will endorse all of it but I will send it along asap, hopefully some time next week.   Topic 1 (green): Telecon cancellation ,  Topic 2  (magenta): TAG document ,  Topic 3  (blue): Respondin g to I18N .  Topic Labels  Topic  [1 ]  [2]  [2]  [2]        [3]   [3]        [ 3 ]   [3]  [ 3 ]     [ 1 ]  [ 1 ]   [2]  [ 2 ]     [ 3 ]    [ 3 ]  [3]  [ 3 ]  Figure 2.1: Sample truncated email conversation from our email corpus.Each color indicates a different topic. The right most column specifiesthe topic assignments for the sentences.40 Frag ment  Author:  Soulskill  Title: Bethesda Releases Daggerfall For Free Type:  A rticle On Thursday, Bethesda announced that for the 15th anniversary of the Elder Scrolls series, they were releasing  The Elder Scrolls II: Daggerfall for free. They aren't providing support for the game anymore, but they posted a detailed description of how to get the game running in DOSBox.  Fans of the series can now easily relive the experience of getting completely lost in those enormous dungeons.  Save often. Author:  Datamonstar Title:  Nice nice nice nice... Comment id:  1  Parent id:  None Type:  C omment >Fans of the series can now easily relive the experience of getting completely lost in those enormous dungeons. >Save often. ... well not really, since this game is soooo old, but still its a huge HUGE gameworld.  Really, It's big. Can't wait to play it. It makes Oblivion look  like Sesame Street. Author: ElrondHubbard   Title : Rest well this night -- Comment id:  5  Parent id:  None Type:  C omment -- for tomorrow, you sail for the kingdom... of Daggerfall. Many, many enjoyable hours I spent playing this game when I could (should) have been working on my thesis . Chief complaint: The repetitive dungeons, stitched together seemingly near-randomly from prefabbed bits and pieces that were repeated endlessly. Still, a great game.  Author : Anonymous    Title:  Re:Rest well this night -- Comment id:  6   Parent id:  5 Type:  C omment >Many, many enjoyable hours I spent playing this game when I could (should) have been working on my thesis  So, how did your thesis go? >Chief complaint: The repetitive dungeons, stitched together seemingly near-?ando?l? ?a ??eat game I also think this is a great game. Author:  Freetardo Title:  Re: Nice nice nice nice... Comment id:  2  Parent id:  1 Type:  C omment Yes it is big, but most of it is just the same thing over and over again.  It was quite monotonous at times, really.  Author:  gbarules2999 Title:  Re: Nice nice nice nice... Comment id:  3 Parent id:  1 Type:  C omment Randomly generated HUGE isn't nearly as good as designed small.  Back to Morrowind, folks.  Author : drinkypoo  Title : Re: Nice nice nice nice... Comment id:  4  Parent id:  3 Type:  C omment >Randomly generated HUGE isn't nearly as good as designed small.  The solution is obviously to combine both approaches. That way a single game will satisfy both types of players.  Topic 1 (green): Free release of Daggerfall and reaction,  Topic 2 (purple): Game contents or size,  Topic 3 (orange): Bugs or faults,  Topic 4 (magenta): Game design,  Topic 5 (blue): Other gaming options,  Topic 0 (red): `OFF - TOPIC'.  Topic Labels  Topic  [1 ]  [1]  [2]  [2]    [2]  [2]  [2]  [2]    [3]  [3]   [4]  [5]    [4]  [4]   [1]  [1]  [3]  [ 3 ]  [1]   [0]  [1]  (a)   (b)       (c)     (d)   (e)  (f)    (g)   (h)  (i)  (j)     (k)  (l)  Figure 2.2: Sample truncated blog conversation from our blog corpus. Eachcolor indicates a different topic. The right most column (Topic) specifiesthe topic assignments for the sentences. The Fragment column specifiesthe fragments in the fragment quotation graph (see Section 2.3.1).41ical information. We also propose a novel graph-theoretic supervised segmentationmodel that combines lexical, conversational, and topic features. For topic labeling,we propose to generate labels using an unsupervised extractive approach that iden-tifies the most representative phrases in the text. Specifically, we propose two novelrandom walk models that respectively capture two forms of conversation specificinformation: (i) the fact that the leading (i.e., first few) sentences in a topical clus-ter often carry the most informative clues, and (ii) the fine-grained conversationalstructure. To our knowledge, this is also the first comprehensive study to addressthe problem of topic segmentation and labeling in asynchronous conversation.Our framework was tested in a series of experiments. Experimental results inthe topic segmentation task show that the unsupervised segmentation models bene-fit when they consider the finer conversational structure of asynchronous conversa-tions. A comparison of the supervised segmentation model with the unsupervisedmodels reveals that the supervised method, by optimizing the relative weights ofthe features, outperforms the unsupervised ones even using only a few labeled con-versations. Remarkably, the segmentation decisions of the best unsupervised andthe supervised models are also highly correlated with human annotations. As forthe experiments on topic labeling, they show that the random walk model performsbetter when it exploits the conversation specific clues from the leading sentencesand the conversational structure. The evaluation of the end-to-end system alsoshows promising results in both corpora, when compared with human annotations.The remainder of this chapter is structured as follows. After discussing relatedwork in Section 2.2, we present our topic segmentation and labeling models inSection 2.3. We then describe our corpora and evaluation metrics in Section 2.4.The experiments and analysis are presented in Section 2.5. We summarize ourcontributions and consider directions for future work in Section 2.6.2.2 Related WorkThree research areas are directly related to our study: topic segmentation, topiclabeling, and extracting the conversation structure of asynchronous conversations.422.2.1 Topic SegmentationTopic segmentation has been extensively studied both for monologs and synchronousdialogs where the task is to divide the discourse into topically coherent sequentialsegments. Several unsupervised and supervised methods have been proposed. Seethe survey paper by Purver [165] for an excellent overview. The unsupervisedmodels primarily exploit the strong correlation between topic and lexical usage.These models can be categorized into two broad classes based on their underlyingintuitions: similarity-based models and probabilistic generative models.The key intuition behind similarity-based models is that sentences in a seg-ment are more lexically similar to each other than to sentences in the preceding orthe following segment. These approaches differ in: (1) how they measure lexicalsimilarity, and (2) how they use the similarity measures to perform segmentation.One such early approach is TextTiling [83], which still forms the baseline formany recent advancements. It operates in three steps: tokenization, lexical scoredetermination, and depth score computation. In the tokenization step, it formspseudo-sentences, which are fixed length sentences, each containing n stemmedwords. Then it considers blocks of k pseudo-sentences, and for each gap betweentwo consecutive pseudo-sentences it measures the cosine-based lexical similaritybetween the adjacent blocks by representing them as term frequency (TF) vectors.Finally, it measures the depth of the similarity valley for each gap, and assigns thetopic boundaries at the appropriate sentence gaps based on a threshold.When similarity is computed only on the basis of TF vectors, it can cause prob-lems because of sparseness, and because it treats the terms independently. Choiet al. [44] use Latent Semantic Analysis (LSA) to measure the sentence similarityand show that LSA-based similarity outperforms TF-based similarity. Unlike Text-Tiling, which uses a threshold to decide on topic boundaries, Choi et al. [44] usedivisive clustering to find topical segments. We use similarity measures based onboth TF and LSA as features in our supervised segmentation model.Another variation of the cohesion-based approach is LCSeg [74], which useslexical chains [143]. LCSeg first finds the lexical chains based on term repetitions,and weights those based on term frequency and chain length. The cosine similaritybetween two adjacent blocks? lexical chain vectors is then used as a measure of lex-43ical cohesion in a TextTiling-like algorithm to find the segments. LCSeg achievesresults comparable to the previous approaches (e.g., [44]) in both monolog (newsarticle) and synchronous dialog (meeting). Galley et al. [74] also propose a su-pervised model for segmenting meeting transcripts. They use a C4.5 probabilisticclassifier with lexical and conversational features and show that it outperforms theunsupervised method (LCSeg).Hsueh et al. [90] apply the models of [74] to both the manual transcripts andthe ASR (automatic speech recognizer) output of meetings. They perform segmen-tation at both coarse (topic) and fine (subtopic) levels. At the topic level, they getsimilar results as [74]? the supervised model outperforming LCSeg. However, atthe subtopic level, LCSeg surprisingly outperforms the supervised model indicat-ing that finer topic shifts are better characterized by lexical similarity alone.In our work, we initially show how LCSeg performs poorly, when applied tothe temporal ordering of asynchronous conversation. This is because, as we men-tioned earlier, topics in asynchronous conversations often do not change sequen-tially following the temporal order of the sentences. To address this, we propose anovel extension of LCSeg that leverages the fine conversational structure of asyn-chronous conversations. We also propose a novel supervised topic segmentationmodel for asynchronous conversation that achieves even higher segmentation ac-curacy by combining lexical, conversational, and topic features.Malioutov and Barzilay [121] use a minimum cut clustering model to seg-ment spoken lectures (i.e., spoken monolog). They form a weighted undirectedgraph where the nodes represent sentences and the weighted edges represent theTF.IDF-based cosine similarity between the sentences. Then the segmentation canbe solved as a graph partitioning problem with the assumption that the sentences ina segment should be similar, while sentences in different segments should be dis-similar. They optimize the normalized cut criterion [185] to extract the segments.In general, the minimization of the normalized cut criterion is NP-complete. How-ever, the sequentiality constraint of topic segmentation in monolog allows them tofind an exact solution in polynomial time. Their approach performs better than [44]in the corpus of spoken lectures. Since the sequentiality constraint does not holdin asynchronous conversation, we implement this model without this constraint byapproximating the solution, and compare it with our models.44Probabilistic generative models form another class of unsupervised segmen-tation models, which are based on the intuition that a discourse is a hidden sequenceof topics, each of which has its own characteristic word distribution. The distribu-tion changes with the change of a topic. Topic segmentation in these models is thetask to infer the most likely sequence of topics given the observed words. Variantsof Hidden Markov Models (HMMs) and Latent Dirichlet Allocations (LDAs)[26] are proposed for topic segmentation in monolog and synchronous dialog.Blei and Moreno [27] propose an aspect Hidden Markov Model (AHMM) toperform topic segmentation in written and spoken (i.e., transcribed) monologs, andshow that the AHMM model outperforms the HMM for this task. Purver et al.[167] propose a variant of LDA for segmenting meeting transcripts, and use thetop words in the topic-word distributions as topic labels. However, their approachdoes not outperform LCSeg. Eisenstein and Barzilay [62] propose another vari-ant of LDA by incorporating cue words into the (sequential) segmentation model.In a follow-up work, Eisenstein [61] proposes a constrained LDA model that usesmulti-scale lexical cohesion to perform hierarchical topic segmentation. Nguyenet al. [153] successfully incorporate speaker identity into a hierarchical nonpara-metric model for segmenting multi-party synchronous conversations (e.g., meet-ings, debates). In our work, we demonstrate how the general LDA model performsfor topic segmentation in asynchronous conversation and propose a novel extensionof LDA that exploits the fine conversational structure.2.2.2 Topic LabelingIn the first comprehensive approach to topic labeling, Mei et al. [137] proposemethods to label a multinomial topic model (e.g., the topic-word distributions re-turned by LDA). Crucial to their approach is how they measure the semantic sim-ilarity between a topic-word distribution and a candidate label extracted from thesame corpus. They perform this task by assuming another word distribution for thelabel and deriving the Kullback-Leibler (KL) divergence between the two distribu-tions. It turns out that this measure is equivalent to the weighted point-wise mutualinformation (PMI) of the topic-words with the candidate label, where the weightsare actually the probabilities in the topic-word distribution. They use Maximum45Marginal Relevance (MMR) [33] to select the labels which are relevant, but notredundant. When labeling multiple topic-word distributions, to find discriminativelabels, they adjust the semantic similarity scoring function such that a candidate la-bel which is also similar to other topics gets a lower score. In our work, we also useMMR to promote diversity in the labels for a topic. However, to get distinguishablelabels for different topical segments in a conversation, we rank the words so that ahigh scoring word in one topic should not have high scores in other topics.Recently, Lau et al. [111] propose methods to learn topic labels from Wikipediatitles. They use the top-10 words in each topic-word distribution to extract the can-didate labels from Wikipedia. Then they extract a number of features to representeach candidate label. The features are actually different metrics used in previousstudies to measure the association between the topic words and the candidate label(e.g., PMI, t-test, ?2 test). They use Amazon Mechanical Turk to get human an-notators to rank the top-10 candidate labels, and use the average scores (given bythe annotators) to learn a (supervised) regression model. In a related work, Fenget al. [68] consider the task of classifying coarse discussions based on topics. Theyinduce a topic profile (i.e., a list of candidate topics) from the coarse textbook anduse a Rocchio-based classifier [123] to classify the discussion threads.Zhao et al. [236] addresses the problem of topical keyphrase extraction fromTwitter. Initially they use a modified Twitter-LDA model [237], which assumes asingle topic assignment for a tweet, to discover the topics (i.e., topic-word distri-butions) in a corpus of Twitter conversations. Then, they use a PageRank [155] torank the words in each topic-word distribution. Finally, they perform a bi-gram testto generate keyphrases from the top ranked words in each topic.While most of the above studies try to mine topics from the whole corpus, ourproblem is to find the topical segments and label those for a given conversation,where topics are closely related and distributional variations are subtle (e.g., ?Gamecontents or size?, ?Game design? in Figure 2.2). Therefore, statistical associationmetrics like PMI, the t-test or the chi-square test may not be reliable in our casebecause of data scarcity. Also at the conversation-level, the topics are so specific toa particular discussion (e.g., ?Telecon cancellation?, ?TAG document?, ?Respondingto I18N? in Figure 2.1) that exploiting external knowledge bases like Wikipedia asa source of candidate labels is not a reasonable option for us. In fact, none of46the human-authored labels in our developement set appears in Wikipedia as a title.Therefore, we propose to generate topic labels using a keyphrase extraction methodthat finds the most representative phrase(s) in the given text.Several supervised and unsupervised methods have been proposed for keyphraseextraction (see [134] for a comprehensive overview). The supervised models (e.g.,[91, 135]) follow the same two-stage framework. First, candidate keyphrases areextracted using n-gram sequences or a shallow parser (chunker). Second, a clas-sifier filters the candidates. This strategy has been quite successful, but it is do-main specific and labor intensive. Every new domain may require new annotations,which at times becomes too expensive and unrealistic. In contrast, our approach isto adopt an unsupervised paradigm, which is more robust across new domains, butstill capable of achieving comparable performance to the supervised methods.Mihalcea and Tarau [140] use a graph-based (unsupervised) random walk modelto extract keyphrases from journal abstracts and achieve state-of-the-art perfor-mance [139].4 However, this model is generic and not designed to exploit proper-ties of asynchronous conversations. We propose two novel random walk models toincorporate conversation specific information. Specifically, our models exploit in-formation from two different sources: (i) from the leading sentences of the topicalsegments, and (ii) from the fine conversational structure of the conversation.2.2.3 Conversational Structure ExtractionAs mentioned earlier, there could be several simultaneous conversations (threads)going on in a multi-party interaction. Recent work on synchronous conversationshas been focused on disentangling multi-party chat which has a linear structure.Shen et al. [184] and Wang and Oard [223] follow a similar approach. They applya single-pass clustering method which takes a new utterance and measures its dis-tance to the threads already detected, and based on a predefined threshold assignsthe new utterance either to the closest thread or to a newly created thread. To mea-sure the distance, they augment the traditional TF.IDF-based cosine similarity withadditional features computed from labeled data. Shen et al. [184] include two ad-ditional linguistic features, namely utterance type (e.g., declarative, interrogative)4This research was published before in [140].47and usage of personal pronouns, and Wang and Oard [223] include temporal (i.e.,time) and social (i.e., mentioning names) contexts in their measures of distance.Elsner and Charniak [63] propose a two-step method for disentangling multi-party chats. In the first step, a binary classifier determines how likely is that apair of utterances belong to the same conversation. In the second step, a (graph-based) correlation clustering method [15] finds the conversations by optimizinga criterion that tries to make sure that pairs of utterances likely to belong to thesame conversation, end up in the same conversation, while pairs of utterances thatare likely to be in different conversations, end up in different conversations. Theyuse three types of features in their classifier: chat-specific, discourse and content.In their follow-up work [64], they experiment with various local coherence models(e.g., entity grid [18]) for the disentanglement task and improve on their prior work.Mayfield et al. [131] propose methods to extract a hierarchical structure frommulti-party chat. The structure consists of information at three different levels:utterance, sequence and thread. They use a classifier to tag utterances as eithergiving or receiving information. For detecting the sequences and threads, they usea number of classifiers with constraints in an Integer Linear Programming frame-work. Eshghi and Healey [66] show the existence of fine-grained dialog contexts ina conversation. They present evidence that shows how these fine-grained dialoguecontexts are distinguished not in terms of topics but in terms of active participants.While disentanglement is necessary for many multi-party synchronous con-versations, asynchronous media like email and social media services (e.g., Gmail,Slashdot, Twitter) generally organize comments into tree-structured threads usingreply-to relations. In absence of the reply-to relations, automatic methods to un-cover the thread structure have also been proposed. Wang et al. [224] proposeunsupervised methods based on lexical similarity and proximity relations betweenmessages to uncover the hidden thread structure of newsgroup discussions. Morerecent work proposes supervised methods using classifiers and sequence labelers(e.g., Decision Trees, CRFs) with a number of useful features [11, 220].While the above approaches attempt to uncover the reply-to relations betweenmessages, the use of quotations in asynchronous conversations can express a con-versational structure that is finer grained and can be more informative than the onerevealed by reply-to relations [36]. For example, consider the relation between the48new text fragments and the quoted text fragments (i.e., marked with the quotationmark ?>?) in figures 2.1 and 2.2. The proximity between quoted and new text frag-ments can represent a conversational link between the two (i.e., they talk about thesame topic) that would not appear by only looking at the reply-to relations.Carenini et al. [34] previously presented a novel method to capture an emailconversation at this finer level by analyzing the embedded quotations in emails.A Fragment Quotation Graph (FQG) was formed, which was shown to be bene-ficial for email summarization [35] and dialog act modeling [95]. In this work,we generalize the FQG to any asynchronous conversation and demonstrate thattopic segmentation and labeling models can also benefit significantly from this fineconversational structure of asynchronous conversation.2.3 Topic Models for Asynchronous ConversationsDeveloping topic segmentation and labeling models for asynchronous conversa-tions is challenging partly because of the specific characteristics of these media.As mentioned earlier, unlike monolog (e.g., articles) and synchronous dialog (e.g.,meetings), topics in asynchronous conversations may not change in a sequentialway, with topics being interleaved. Furthermore, as can be noticed in figures 2.1and 2.2, writing style varies among participants, and many people tend to use in-formal, short and ungrammatical sentences, thus making the discourse much lessstructured. One aspect of asynchronous conversation that at first glance may appearto help topic modeling is that each message comes with a header. However, oftenheaders do not convey much topical information and sometimes they can even bemisleading. For example, in the blog conversation (Figure 2.2), participants keeptalking about different topics using the same title (i.e., ?nice nice nice?), whichdoes not convey any topic information. Arguably, all these unique properties ofasynchronous conversations limit the application of state-of-the-art techniques thathave been successful in monolog and synchronous dialog. Below, we first describethese techniques and then we present how we have extended them to effectivelydeal with asynchronous conversations.492.3.1 Topic Segmentation ModelsWe are the first to study the problem of topic segmentation in asynchronous conver-sation. Therefore, we first show how existing models, which are originally devel-oped for monolog and synchronous dialog, can be naively applied to asynchronousconversations. Then, by pointing out their limitations, we propose our novel topicsegmentation models for asynchronous conversations.Existing ModelsLCSeg [74] and LDA [26] are two state-of-the-art (unsupervised) models for topicsegmentation in monolog and synchronous dialog [165]. In the following, webriefly describe these models and how they can be directly applied to asynchronousconversations.Lexical Cohesion-based Segmenter (LCSeg)LCSeg is a sequential segmentation model originally developed for segmentingmeeting transcripts. It exploits the linguistic property called lexical cohesion, andassumes that topic changes are likely to occur where strong word repetitions startand end. It first computes lexical chains [143] for each non-stop word based onword repetitions.5 Then the chains are weighted according to their term frequencyand the chain length. The more populated and compact chains get higher scores.The algorithm then works with two adjacent analysis windows, each of a fixedsize k, which is empirically determined. At each sentence boundary, it computesthe cosine similarity (or lexical cohesion function) between the two windows byrepresenting each window as a vector of chain-scores of its words. Specifically,the lexical cohesion between windows (X and Y ) is computed with:LexCoh(X ,Y ) = cos sim(X ,Y ) =?Ni=1 wi,X .wi,Y??Ni=1 w2i,X .?Ni=1 w2i,Y(2.1)where N is the number of chains and5One can also consider other lexical semantic relations (e.g., synonym, hypernym) in lexicalchaining but the best results account for only repetition.50wi,? ={rank(Ci) if chain Ci overlaps ? ? {X ,Y}0 otherwiseA sharp change at local minima in the resulting similarity (or lexical cohesion)curve signals a high probability of a topic boundary. The curve is smoothed, and foreach local minimum a segmentation probability is computed based on its relativedepth below its nearest peaks on either side. Points with the highest segmentationprobability are then selected as hypothesized topic boundaries. This method is verysimilar to TextTiling [83] except that the similarity is computed based on the scoresof the chains instead of term frequencies.LCSeg can be directly applied to an asynchronous conversation by arrangingcomponents (e.g., emails, posts, tweets) based on their arrival time (i.e., their tem-poral order) and running the algorithm to get the topic boundaries.Latent Dirichlet Allocation (LDA)LDA is a generative model that relies on the fundamental idea that documentsare admixtures of topics, and a topic is a multinomial distribution over words. Itspecifies the following distribution over words within a document:P(xi j) =K?k=1P(xi j|zi j = k,bk)P(zi j = k|pi i) (2.2)where K is the number of topics, P(xi j|zi j = k,bk) is the probability of word xi j indocument i for topic k, and P(zi j = k|pi i) is the probability that the kth topic wassampled for the word token xi j. We refer to the multinomial distributions bk andpi i as topic-word and document-topic distributions, respectively. Figure 2.3 showsthe resultant graphical model in plate notation for N documents, K topics and Mitokens in each document i. Note that, ? and ? are the standard Dirichlet priors onpi i and bk , respectively. Variational EM can be used to estimate pi and b [26]. Onecan also use Gibbs sampling to directly estimate the posterior distribution over z,i.e., P(zi j = k|xi j); namely, the topic assignments for word tokens [199].This model can be directly applied to an asynchronous conversation by consid-ering each comment as a document. By assuming the words in a sentence occur51? piizi, j?bkxi, j KMiNFigure 2.3: Graphical model for LDA in plate notation.independently we can estimate the topic assignments for each sentence s as:P(zm = k|s) = ?xm?sP(zm = k|xm) (2.3)Finally, the topic for sentence s can be assigned by:k? = argmaxk P(zm = k|s) (2.4)Limitations of Existing ModelsThe main limitation of the two models discussed above is that they make the bag-of-words (BOW) assumption, ignoring facts that are specific to a multi-party, asyn-chronous conversation. LCSeg considers only term frequency and how closelythese terms occur in the temporal order of the sentences. If topics are interleavedand do not change sequentially in the temporal order, as is often the case in asyn-chronous conversations, then LCSeg would fail to find the topic segments correctly.On the other hand, the only information relevant to LDA is term frequency.Several extensions of LDA over the BOW approach have been proposed. For ex-ample, Wallach [219] extends the model beyond BOW by considering n-gram se-quences. Griffiths et al. [76] present an extension that is sensitive to word-orderand automatically learns the syntactic as well as semantic factors that guide wordchoice. Boyd-Graber and Blei [29] describe another extension to consider the syn-tax of the sentences.52We argue that these models are still inadequate for finding topical segments cor-rectly in asynchronous conversations especially when topics are closely related andtheir distributional variations are subtle (e.g., ?Game contents or size? and ?Gamedesign?). To better identify the topics one needs to consider the features specific toasynchronous conversations (e.g., conversation structure, speaker, recipient). In thefollowing, we propose our novel unsupervised and supervised topic segmentationmodels that incorporate these features.Proposed Unsupervised ModelsOne of the most important indicators for topic segmentation in asynchronous con-versation is its conversation structure. As can be seen in the examples (figures 2.1and 2.2), participants often reply to a post and/or use quotations to talk about thesame topic. Notice also that the use of quotations can express a conversationalstructure that is at a finer level of granularity than the one revealed by reply-torelations. In our corpora, we found an average quotation usage of 9.85 per blogconversation and 6.44 per email conversation. Therefore, we need to leverage thiskey information to get the best out of our models. Specifically, we need to cap-ture the conversation structure at the quotation (i.e., text fragment) level, and toincorporate this structure into our segmentation models in a principled way.In the following, we first describe how we can capture the conversation struc-ture at the fragment level. Then we show how the unsupervised segmentationmodels LCSeg and LDA can be extended to take this conversation structure intoaccount, generating two novel unsupervised models for topic segmentation.Extracting Finer-level Conversation StructureSince consecutive turns in asynchronous conversations can be far apart in time,when participants reply to a post or comment, a quoted version of the originalmessage is often included (specially in email) by default in the draft reply in orderto preserve context. Furthermore, people tend to break down the quoted message sothat different questions, requests or claims can be dealt with separately. As a result,each message, unless it is at the beginning, will contain a mix of quoted and novelparagraphs (or fragments) that may well reflect a reply-to relationship between53paragraphs that is at a finer level of granularity than the one explicitly recordedbetween messages. Carenini et al. [34] propose a method to capture this finer-level conversation structure in the form of a graph called Fragment QuotationGraph (FQG). Below, we demonstrate how to build a FQG for the sample blogconversation shown in Figure 2.2 following [34].Figure 2.4: (a) The main Article and the Comments with the fragments forthe example in Figure 2.2. Arrows indicate ?reply-to? relations. (b) Thecorresponding Fragment Quotation Graph (FQG).Figure 2.4 (a) shows the same blog conversation, but for the sake of illustration,instead of showing the real content, we abbreviate it as a sequence of labels (e.g.,a,b), each label corresponding to a text fragment (see the Fragment column inFigure 2.2). Building a FQG is a two-step process.? Node creation: Initially, by processing the whole conversation, we identifythe new and the quoted fragments of different depth levels. The depth levelof a quoted fragment is determined by the number of quotation marks (e.g.,>, >>). For instance, comment C1 contains a new fragment c and a quotedfragment b of depth level 1. C6 contains two new fragments k and l, and twoquoted fragments i and j of depth level 1, and so on. Then in the secondstep, we compare the fragments with each other and based on their lexicaloverlap we find the distinct fragments. If necessary, we split the fragmentsin this step. For example, e f in C3 is divided into distinct fragments e and54f when compared with the fragments of C4. This process gives 12 distinctfragments which constitute the nodes of the FQG shown in Figure 2.4(b).? Edge creation: We create edges to represent likely replying relationships be-tween fragments assuming that any new fragment is a potential reply to itsneighboring quotations of depth level 1. For example, for the fragments ofC6 in Figure 2.4(a), we create two edges from k (i.e., (k,i),(k,j)) and one edgefrom l (i.e., (l,j)) in Figure 2.4(b). If a comment does not contain any quo-tation, then its fragments are linked to the new fragments of the comment towhich it replies, capturing the original reply-to relation between comments.Note that the FQG is only an approximation of the reply relations betweenfragments. In some cases, proximity may not indicate any connection and in othercases a connection can exist between fragments that are never adjacent in any com-ment. Furthermore, this process could lead to less accurate conversational struc-ture when quotation marks (or cues) are not present. Nonetheless, Carenini et al.[35] showed that considering the FQG can be beneficial for email summarization,and recently, we showed its benefits in unsupervised dialog act modeling [95] (seeChapter 4). In this chapter, we show that topic segmentation (this section) andlabeling (Section 2.3.2) models can also benefit significantly from this fine conver-sational structure. Minimizing the noise in FQGs is left as future work.LCSeg with FQG (LCSeg+FQG)If we examine the FQG carefully, the paths (considering the fragments of the firstcomment as root nodes) can be interpreted as subconversations, and topic shifts arelikely to occur along the pathway as we walk down a path. We incorporate FQGinto LCSeg in three steps.? Path extraction: First, we extract all the paths of a FQG. For example, forthe FQG in Figure 2.4(b), we extract the paths < a, j, l >, < b,c,e,g >,< b,c,d >, and so on.? LCSeg application: We then run the LCSeg algorithm on each of the ex-tracted paths separately and collect the segmentations. For example, when55we apply LCSeg to < b,c,e,g > and < b,c,d > paths in Figure 2.4(b)separately, we may get the following segmentations? < b,c | e,g > and< b,c | d >, where ?|? denotes a segment boundary.6 Notice that a fragmentcan be in multiple paths (e.g., b, c) which will eventually cause its sentencesto be in multiple segments. So, in the final step, we need a consolidationmethod.? Consolidation: Our intuition is that sentences in a consolidated segmentshould appear together in a segment more often when LCSeg is applied instep 2, and if they do not appear together in any segment, they should atleast be similar. To achieve this, we construct a weighted undirected graphG(V,E), where the nodes V represent the sentences and the edge weightsw(x,y) represent the number of segments in which sentences x and y appeartogether; if x and y do not appear together in any segment, then their cosinesimilarity is used as edge weight. More formally,w(x,y)=???n, if x and y appear together in n segments and n>0cos sim(x,y), if n = 0We measure the cosine similarity between sentences x and y as follows:cos sim(x,y) =?w?x,y t fw,x.t fw,y??xi?x t f2xi,x.??yi?y t f2yi,y(2.5)where t fa,s denotes the term frequency of term a in sentence s. The cosinesimilarity (0? cos sim(x,y)? 1) provides informative edge weights for thesentence pairs that are not directly connected by LCSeg segmentation de-cisions.7 The consolidation problem can be formulated as a k-way-mincutgraph partitioning problem with the normalized cut (Ncut) criterion [185]:6For convenience, we are showing the segmentations at the fragment level, but the segmentationsare actually at the sentence level.7Our work presented in [94] did not consider the cosine similarity when two sentences do not ap-pear together in any of the segments. However, later we found out that including the cosine similarityoffers more than 2% absolute gain in segmentation performance.56Ncutk(V ) =cut(A1,V ?A1)assoc(A1,V )+cut(A2,V ?A2)assoc(A2,V )+ ? ? ?+cut(Ak,V ?Ak)assoc(Ak,V )(2.6)where A1,A2 ? ? ?Ak form a partition (i.e., disjoint sets of nodes) of the graph,and V ?Ak is the set difference between V (i.e., set of all nodes) and Ak. Thecut(A,B) measures the total edge weight from the nodes in set A to the nodesin set B, and assoc(A,V ) measures the total edge weight from the nodes inset A to all nodes in the graph. More formally:cut(A,B) = ?u?A,v?Bw(u,v) (2.7)assoc(A,V ) = ?u?A,t?Vw(u, t) (2.8)Note that the partitioning problem can be solved using any correlation clus-tering method (e.g., [15]). Previous work on graph-based topic segmentation[121] has shown that the Ncut criterion is more appropriate than just the cutcriterion, which accounts only for total edge weight connecting A and B, andtherefore, favors cutting small sets of isolated nodes in the graph. However,solving Ncut is NP-complete. Hence, we approximate the solution followingthe method proposed in [185], which is time efficient and has been success-fully applied to image segmentation in computer vision.Notice that this approach makes a difference only if the FQG of the conver-sation contains more than one path. In fact, in our corpora we found an averagenumber of paths of 7.12 and 16.43 per email and blog conversations, respectively.LDA with FQG (LDA+FQG)A key advantage of probabilistic Bayesian models, such as LDA, is that they allowus to incorporate multiple knowledge sources in a coherent way in the form ofpriors (or regularizer). To incorporate FQG into LDA, we propose to regularizeLDA so that two sentences in the same or adjacent fragments are likely to appear57in the same topical cluster. The first step towards this aim is to regularize thetopic-word distributions (i.e., b in Figure 2.3) with a word network such that twoconnected words get similar topic distributions.For now, assume that we are given a word network as an undirected graphG(V,E), with nodes V representing the words and the edges (u,v)? E representingthe links between words u and v. We want to regularize the topic-word distribu-tions of LDA such that two connected words u and v in the word network havesimilar topic distributions (i.e., b(u)k ? b(v)k for k = 1 . . .K). The standard conjugateDirichlet prior Dir(bk |? ), however does not allow us to do that, because here allwords share a common variance parameter, and are mutually independent exceptfor the normalization constraint [141]. Recently, Andrzejewski et al. [6] describea method to encode must-links and cannot-links between words using a DirichletForest prior. Must-link between two words enforces the words to have similar dis-tributions over topics. On the other hand, cannot-link between two words enforcesthe words not to both have large probability within any topic, although it is allowedfor one to have a large probability and the other small, or both small [6].Our goal is just to encode the must-links. Therefore, we reimplemented theirmodel with its capability of encoding just the (must-)links. Must-links betweenwords such as (a,b),(b,c), or (x,y) in Figure 2.5(a) can be encoded into LDAusing a Dirichlet Tree (DT) prior. Like the traditional Dirichlet, the DT prior isalso a conjugate to the multinomial, but under a different parameterization. Insteadof representing a multinomial sample as the outcome of a K-sided die, in the treerepresentation (e.g., Figure 2.5(b)), a sample (i.e., a leaf in the tree) is representedas the outcome of a finite stochastic process. The probability of a leaf (i.e., a wordin our case) is the product of branch probabilities leading to that leaf. A DT prioris the distribution over leaf probabilities.Let ?n be the edge weight leading into node n, C(n) be the children of noden, L be the leaves of the tree, I be the internal nodes, and L(n) be the leaves inthe subtree under node n. We generate a sample bk from DT(?) by drawing amultinomial at each internal node i ? I from Dir(?C(i)) (i.e., the edge weights fromnode i to its children). The probability density function of DT(bk |?) is given by:58DT (bk |?) ?(?l?Lbk?l?1l)???i?I(?j?L(i)bkj)?(i)?? (2.9)where ?(i) = ? i?? j?C(i)? j, the difference between the in-degree and out-degreeof an internal node i. Notice when ?(i) = 0 for all i ? I, the Dirichlet tree distribu-tion reduces to the standard Dirichlet distribution.abcpxy(a)a??b??c??3?p?x??y??2?(b)Figure 2.5: (a) Sample word network, (b) A Dirichlet Tree (DT) built fromsuch word network.Suppose we are given the word network as shown in Figure 2.5(a). The net-work can be decomposed into a collection of chains (e.g., (a,b,c), (p), and (x,y)).For each chain containing multiple elements (e.g., (a,b,c), (x,y)), there is a sub-tree in the DT (Figure 2.5(b)), with one internal node (blank in Figure) and thewords of the chain as its leaves. The weight from the internal node to each of itsleaves is ?? , where ? is the regularization strength and ? is the parameter of thestandard symmetric Dirichlet prior on bk . The root node of the DT then connectsto the internal nodes with |L(i)|? weight. The leaves (words) for the single ele-ment chains (e.g, (p)) are then connected to the root of the DT directly with weight? . Notice that when ? = 1, ?(i) = 0, and it reduces to the standard LDA (i.e., noregularization). By tuning ? we control the strength of the regularization.At this point what is left to be explained is how we construct the word network.To regularize LDA with a FQG, we construct a word network where a word islinked to the words in the same or adjacent fragments in the FQG. Specifically, ifword wi ? f ragx and word w j ? f ragy and wi 6=w j, we create a link (wi,w j) if x= yor (x,y) ? E f qg, where E f qg is the set of edges in the FQG. This implicitly compels59two sentences in the same or adjacent fragments to have similar topic distributions,and to appear in the same topical segment.Proposed Supervised ModelAlthough the unsupervised models discussed in the previous section have the keyadvantage of not requiring any labeled data, they can be limited in their abilityto learn domain-specific knowledge from a possibly large and diverse set of fea-tures [62]. Beside discourse cohesion, which captures changes in content, thereare other important domain-specific distinctive features which signal topic change.For example, discourse markers (or cue phrases) (e.g., okay, anyway, now, so) andprosodic cues (e.g., longer pauses) directly provide clues about topic change, andhave been shown to be useful features for topic segmentation in monolog and syn-chronous dialog [74, 159]. We hypothesize that asynchronous conversations canalso feature their own distinctive characteristics for topic shifts. For example, fea-tures like sender and recipient are arguably useful for segmenting asynchronousconversations, as different participants can be more or less active during the dis-cussion of different topics. Therefore, as a next step to build an even more accuratetopic segmentation model for asynchronous conversations, we propose to combinedifferent sources of possibly useful information in a principled way.The supervised framework serves as a viable option to combine a large numberof features and optimize their relative weights for decision making, but relies onlabeled data for training. The amount of labeled data required to achieve an accept-able performance is always an important factor to consider for choosing supervisedvs. unsupervised. In this work, we propose a supervised topic segmentation modelthat outperforms all the unsupervised models, even when it is trained on a smallnumber of labelled conversations.Our supervised model is built on the graph-theoretic framework which hasbeen used in many NLP tasks, including coreference resolution [190] and chatdisentanglement [63]. This method works in two steps.? Classification: A binary classifier which is trained on a labeled dataset markseach pair of sentences of a conversation as same or different topics.? Graph partitioning: A weighted undirected graph G = (V,E) is formed,60where the nodes V represent the sentences in the conversation and the edge-weights w(x,y) denote the probability (given by the classifier) of the twosentences x and y to appear in the same topic. Then an optimal partition isextracted from the graph.Sentence pair classificationThe classifier?s accuracy in deciding whether a pair of sentences x and y is in thesame or different topics is crucial for the model?s performance. Note that sinceeach sentence pair of a conversation defines a data point, a conversation containingn sentences produces 1 + 2 + . . .+ (n? 1) = n(n?1)2 = O(n2) training examples.Therefore, a training dataset containing m conversations produces ?mi=1ni(ni?1)2training examples, where ni is the number of sentences in the ith conversation.This quadratic expansion of training examples enables the classifier to achieve itsbest classification accuracy with only a few labeled conversations.By pairing up the sentences of each email conversation in our email corpus,we got a total of 14,528 data points of which 58.8% are in the same class (i.e.,same is the most likely in email), and by pairing up the sentences of each blogconversation in our blog corpus, we got a total of 572,772 data points of which86.3% are in the different class (i.e., different is the most likely in blog).8 To selectthe best classifier, we experimented with a variety of classifiers with the full featureset (Table 2.2). Table 2.1 shows the performance of the classifiers averaged overa leave-one-out procedure, i.e., for a corpus containing m conversations, train onm?1 conversations and test on the rest.K-Nearest Neighbor (KNN) performs very poorly. Logistic Regression (LR)with l2 regularization delivers the highest accuracy on both datasets. Support Vec-tor Machines (SVMs) [49] with linear and rbf kernels perform reasonably well, butnot as well as LR. The Ridged Multinomial Logistic Regression (RMLR) [109], akernelized LR, extremely overfits the data. We opted for the LR with l2 regulariza-tion because it not only delivers the best performance in term of accuracy, but it isalso very efficient. The limited memory BFGS (L-BFGS) fitting algorithm used inLR is efficient in terms of both time (quadratic convergence rate; fastest among the8See Section 2.4 for a detailed description of our corpora. The class labels are produced by takingthe maximum vote of the three annotators.61Classifier Type Regularizer Accuracy (Blog) Accuracy (Email)Train Test Train TestKNN non-parametric - 62.7% 61.4 % 54.6% 55.2%LR parametric l2 90.8% 91.9% 71.7% 72.5%LR parametric l1 86.8% 87.6% 69.9% 67.7%RMLR (rbf) non-parametric l2 91.7% 82.0% 91.1% 62.1%SVM (lin) parametric - 76.6% 78.7 % 68.3% 69.6%SVM (rbf) non-parametric - 80.5% 77.9% 75.9% 67.7%Majority class - - 86.3% (different) 58.8% (same)Table 2.1: Performance of the classifiers using the full feature set (Table 2.2).For each training set, regularizer strength ? (or C in SVMs) was learnedby 10-fold cross validation.listed models) and space (O(mD), where m is the memory parameter of L-BFGSand D is the number of features).Table 2.2 summarizes the full feature set and the mean test set accuracy (usingleave-one-out) achieved with different types of features in our LR classifier. 9Lexical features encode similarity between two sentences x and y based ontheir raw contents. Term frequency-based similarity is a widely used feature inprevious work (e.g., TextTiling [83]). We compute this feature by considering twoanalysis windows, each of fixed size k. Let X be the window including sentence xand the preceding k?1 sentences, and Y be the window including sentence y andthe following k? 1 sentences. We measure the cosine similarity between the twowindows by representing them as vectors of TF.IDF [178] values of the words.Another important domain specific feature that proved to be useful in previousresearch (e.g., [74]) is cue words (or discourse markers) that signal the presence ofa topic boundary (e.g., ?coming up?, ?joining us? in news). Since our work concernsconversations (not monologs), we adopt the cue word list derived automaticallyfrom a meeting corpus by Galley et al. [74]. If y answers or greets x then it is likelythat they are in the same topic. Therefore, we use the Question Answer (QA) pairsand greeting words as two other lexical features.9We believe that discourse feature like co-reference resolution would be helpful for topic seg-mentation. However, to our knowledge, there is no publicly available co-reference resolution systemfor asynchronous conversation. The existing co-reference resolution system that are developed formonolog are also limited in terms of accuracy.62Lexical Accuracy: 86.8 Precision: 62.4 Recall: 4.6 (Blog)Accuracy: 59.6 Precision: 59.7 Recall: 99.8 (Email)T FIDF1 TF.IDF-based similarity between x and y with window size k=1.T FIDF2 TF.IDF-based similarity between x and y with window size k=2.Cue Words Either x or y contains a cue word.QA x asks a question explicitly using ? and y answers it using anyof (yes, yeah, okay, ok, no, nope).Greet Either x or y has a greeting word (hi, hello, thanks, thx, tnx, thank.)Conversation Accuracy: 88.2 Precision: 81.6 Recall: 20.5 (Blog)Accuracy: 65.3 Precision: 66.7 Recall: 85.1 (Email)Gap The gap between y and x in number of sentence(s).Speaker x and y have the same sender (yes or no).FQG1 Distance between x and y in FQG in terms of fragment id.(i.e., | f rag id(y)? f rag id(x)|).FQG2 Distance between x and y in FQG in terms of number of edges.FQG3 Distance between x and y in FQG in number of edges but thistime considering it as an undirected graph.Same/Reply whether x and y are in the same comment or one is a reply tothe other.Name x mentions y?s speaker or vice versa.Topic Accuracy: 89.3 Precision: 86.4 Recall: 17.3 (Blog)Accuracy: 67.5 Precision: 68.9 Recall: 76.8 (Email)LSA1 LSA-based similarity between x and y with window size k=1.LSA2 LSA-based similarity between x and y with window size k=2.LDA LDA segmentation decision on x and y (same or different).LDA+FQG LDA+FQG segmentation decision on x and y (same or different).LCSeg LCSeg segmentation decision on x and y (same or different).LCSeg+FQG LCSeg+FQG segmentation decision on x and y (same or different).LexCoh Lexical cohesion between x and y.Combined Accuracy: 91.9 Precision: 78.8 Recall: 25.8 (Blog)Accuracy: 72.5 Precision: 70.4 Recall: 81.5 (Email)Table 2.2: Features with performance on test sets (using leave-one-out).63Conversational features capture conversational properties of an asynchronousconversation. Time gap and speaker are commonly used features for segment-ing synchronous conversation [63, 74]. We encode similar information in asyn-chronous media by counting the number of sentences between x and y (in their tem-poral order) as the gap, and their senders as the speakers. The strongest baselineSpeaker (see Section 2.5.1) also proves its effectiveness in asynchronous domains.The results in Section 2.5.1 also suggest that fine conversational structure in theform of FQG can be beneficial when it is incorporated into the unsupervised seg-mentation models. We encode this valuable information into our supervised seg-mentation model by computing three distance features on the FQG: F QG1,F QG2and F QG3 . State-of-the-art email and blog systems use the reply-to relation togroup comments into threads. If y?s comment is the same as or reply to x?s com-ment, then it is likely that the two sentences talk about the same topic. Participantssometimes mention each other?s name in multi-party conversations to make disen-tanglement easier [63]. We also use this as a feature in our supervised segmentationmodel.Topic features are complex and encode topic information from existing seg-mentation models. Choi et al. [44] used Latent Semantic Analysis (LSA) to mea-sure the similarity between two sentences and showed that the LSA-based similar-ity yields better results than the direct TF.IDF-based similarity since it surmountsthe problems of synonymy (e.g., car, auto) and polysemy (e.g., money bank, riverbank). To compute LSA, we first construct a word-document matrix W for a con-versation, where Wi, j = the frequency of word i in comment j ? the IDF scoreof word i. We perform truncated Singular Value Decomposition (SVD) of W :W ? Uk?kV Tk , and represent each word i as a k dimensional10 vector ?ki . Eachsentence is then represented by the weighted sum of its word vectors. Formally,the LSA representation for sentence s is ?s = ?i?s t fsi .?ki , where t fsi = the termfrequency of word i in sentence s. Then just like the TF.IDF-based similarity, wecompute the LSA-based similarity between sentences x and y, but this time byrepresenting the corresponding windows (i.e., X and Y ) as LSA vectors.The segmentation decisions of the LDA, LDA+FQG, LCSeg and LCSeg+FQG10The value of k was empirically set to 14?number of comments based on our development set.64models described in the previous section are also encoded as topic features.11 Asdescribed in Section 2.3.1, LCSeg computes a lexical cohesion (LexCoh) functionbetween two consecutive windows based on the scores of the lexical chains. Galleyet al. [74] show a significant improvement when this function is used as a featurein the supervised (sequential) topic segmentation model for meetings. However,since our problem of topic segmentation is not sequential, we want to compute thisfunction for any two given windows X and Y (not necessary consecutive). To dothat, we first extract the lexical chains with their scores and spans (i.e., beginningand end sentence numbers) for the conversation. The lexical cohesion function isthen computed with the method described in Equation 2.1.We describe our classifier?s performance in terms of raw accuracy (correct de-cisions/total), precision and recall of the same class for different types of featuresaveraged over a leave-one-out procedure (Table 2.2). Among the feature types,topic features yield the highest accuracy and same-class precision in both corpora(p < 0.01). Conversational features also have proved to be important and achievehigher accuracy than lexical features (p < 0.01). Lexical features have poor ac-curacy, only slightly higher than the majority baseline that always picks the mostlikely class. However, when we combine all the features, we get the best per-formance (p < 0.005). These results demonstrate the importance of topical andconversational features beyond the lexical features used by the existing segmenta-tion models. When we compare the performance on the two corpora, we noticethat while in blog the accuracy and the same-class precision are higher than inemail, the same-class recall is much lower. Although this is reasonable given theclass distributions in the two corpora (i.e., 13.7% and 58.8% examples are in thesame-class in blog and email, respectively), surprisingly, when we tried to dealwith this problem by applying the bagging technique [30], the performance doesnot improve significantly. Note that some of the classification errors occurred in thesentence-pair classification phase are recovered in the graph partitioning step (seebelow). The reason is that the incorrect decisions will be outvoted by the nearbysentences that are clustered correctly.11The work presented in [96] did not include the segmentation decisions of the LDA+FQG andLCSeg+FQG models as features. However, including these features improves both classificationaccuracy and segmentation accuracy.65TF.IDF 1TF.IDF 2Cue QAGreetGapSpeakerFQG1FQG2FQG3Same/ReplyNameLSA1LSA2LDALDA+FQGLexCohLCSegLCSeg+FQG05 ?10?20.10.150.20.250.30.35 EmailBlogFigure 2.6: Relative importance of the features averaged over leave-one-out.We further analyze the contribution of individual features. Figure 2.6 showsthe relative importance of the features based on the absolute values of their coeffi-cients in our LR classifier. The segmentation decision of LCSeg+FQG is the mostimportant feature in both domains. The Same/Reply feature is also an effectivefeature, especially in blog. In blog, the Speaker feature also plays an importantrole. The FQG2 feature (i.e., distance in number of edges in the directed FQG) isalso effective in both domains, especially in email. The other two features on FQG(i.e., FQG1, FQG3) are also very relevant in email.Finally, in order to determine how many annotated conversations we need toachieve the best segmentation performance, Figure 2.7 shows the classification er-ror rate (incorrect decisions/total), tested on 5 randomly selected conversations andtrained on an increasing number of randomly added conversations. Our classifierappears to achieve its best performance with a small number of labeled conversa-tions. For blog, the error rate flattens with only 8 conversations, while for email,this happens with about 15. This is not surprising since blog conversations aremuch longer (average of 220.55 sentences) than email conversations (average of26.3 sentences), generating a similar number of training examples with only a fewconversations (a conversation with n sentences produces O(n2) training examples).Graph partitioning662 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 345101520253035404550Number of training conversationsClassificationerrorrate EmailBlogFigure 2.7: Error rate vs. number of training conversations.Given a weighted undirected graph G = (V,E), where the nodes V represent thesentences and the edge-weights w(x,y) denote the probability (given by our classi-fier) of the two sentences x and y to appear in the same topic, we again formulatethe segmentation task as a k-way-mincut graph partitioning problem with the in-tuition that sentences in a segment should discuss the same topic, while sentencesin different segments should discuss different topics. We optimize the normalizedcut criterion (i.e., Equation 2.6) to extract an optimal partition as was done beforefor consolidating various segments in LCSeg+FQG.2.3.2 Topic Labeling ModelsNow that we have methods to automatically identify the topical segments in anasynchronous conversation, the next step in the pipeline is to generate one or moreinformative descriptions or labels for each segment to facilitate interpretations ofthe topics. We are the first to address this problem in asynchronous conversation.Ideally, a topic label should be meaningful, semantically similar to the un-derlying topic, general and discriminative (when there are multiple topics) [137].Traditionally, the top k words in a multinomial topic model like LDA are used todescribe a topic. However, as pointed out by Mei et al. [137], at the word-level,topic labels may become too generic and impose cognitive difficulties on a user tointerpret the meaning of the topic by associating the words together. For example,in Figure 2.2, without reading the text, from the words {release, free, reaction,67Daggerfall}, it may be very difficult for a user to understand that the topic is aboutDaggerfall?s free release and people?s reaction to it. On the other hand, if the la-bels are expressed at the sentence-level, they may become too specific to coverthe whole theme of the topic [137]. Based on these observations, recent studies[111, 137] advocate for phrase-level topic labels, which are also consistent withthe monolog corpora built as a part of the Topic Detection and Tracking (TDT)project12. Note that we also observe a preference for phrase-level labels withinour own asynchronous conversational corpora in which annotators without specificinstructions spontaneously generated topic labels at the phrase-level. Consideringall this, we treat phrase-level as our target level of granularity for a topic label.Our problem is no different from the problem of keyphrase indexing [134]where the task is to find a set of keyphrases either from a given text or from acontrolled vocabulary (i.e., domain-specific terminologies) to describe the topicscovered in the text. In our setting, we do not have such a controlled vocabulary.Furthermore, exploiting generic knowledge bases like Wikipedia as a source of de-vising such a controlled vocabulary [134] is not a viable option in our case sincethe topics are very specific to a particular discussion (e.g., ?Free release of Dag-gerfall and reaction?, ?Game contents or size? in Figure 2.2). In fact, none of thehuman-authored labels in our development set appears verbatim in Wikipedia. Wepropose to generate topic labels using a keyphrase extraction approach that iden-tifies the most representative phrase(s) in the given text. We adapt a graph-basedunsupervised ranking framework, which is domain independent, and without re-lying on any labeled data achieves state-of-the-art performance on keyphrase ex-traction [139]. Figure 2.8 shows our topic labeling framework. Given a (topically)segmented conversation, our system generates k keyphrases to describe each topicin the conversation. Below we discuss the different components of the system.PreprocessingIn the preprocessing step, we tokenize the text and apply a syntactic filter to se-lect the words of a certain part-of-speech (POS). We use the state-of-the-art Illinoistagger13 to tokenize the text and annotate the tokens with their POS tags. We ex-12http://projects.ldc.upenn.edu/TDT/13Available at http://cogcomp.cs.illinois.edu/page/software68input conversation Preprocessor wordsSegment-level RankingConversation-level RankingWord RankingPhrase GenerationConversation-level Phrase Rerankingtop M wordstop M wordsSegment-level phrases conversation-level phrasesRedundancyChecking relevant phrasesk output phrasesSegmentedFigure 2.8: Topic labeling framework for asynchronous conversation.perimented with five different syntactic filters. They select (i) nouns, (ii) nounsand adjectives, (iii) nouns, adjectives and verbs, (iv) nouns, adjectives, verbs andadverbs, and (v) all words, respectively. The filters also exclude stopwords. Thesecond filter, that selects only nouns and adjectives, achieves the best performanceon our development set, which is also consistent with the finding of [140]. There-fore, this syntactic filter is used in our system.Word RankingThe words selected in the preprocessing step correspond to the nodes in our wordgraph. A direct application of the ranking method described in [140] would definethe edges based on the co-occurrence relation between the respective words, andthen apply the PageRank [155] algorithm to rank the nodes. We argue that co-occurrence relations may be insufficient for finding topic labels in asynchronousconversations. To better identify the labels one needs to consider aspects that arespecific to asynchronous conversations. In particular, we propose to incorporatetwo different forms of conversation specific information into our graph-based rank-ing model: (1) informative clues from the leading sentences of a topical segment,and (2) the fine-grained conversational structure (i.e., the Fragment QuotationGraph (FQG)) of the conversation. In the following, we describe these two novelextensions in turn.69Incorporating Information from the Leading SentencesIn general, the leading sentences of a topic segment carry informative clues for thetopic labels, since this is where the speakers will most likely try to signal a topicshift and introduce the new topic. Our key observation is that this is especially truefor asynchronous conversations, in which topics are interleaved and less structured.For example, in Figure 2.2, notice that in almost every case, the leading sentencesof the topical segments cover the information conveyed by the respective labels.This property is further confirmed in Figure 2.9, which shows the percentage ofnon-stopwords in the human-authored labels that appear in leading sentences of thesegments in our development set. The first sentence covers about 29% and 38% ofthe words in the gold labels in the blog and email corpora, respectively. The firsttwo sentences cover around 35% and 45% of the words in the gold labels in blogand email, respectively. When we consider the first three sentences, the coverageincreases to 39% and 49% for blog and email, respectively. The increment is lessas we add more sentences.One Two Three Four Five2530354045505538.0844.8648.95 50.6552.9428.7434.7338.9842.0144.57EmailBlogFigure 2.9: Percentage of words in the human-authored labels appearing inleading sentences of the topical segments.To leverage this useful information in our ranking model, we propose the fol-lowing biased random walk model, where P(w|Uk), the score of a word w given aset of leading sentences Uk in topic segment k, is expressed as a convex combina-tion of its relevance to the leading sentences Uk (i.e., ?(w|Uk)) and its relatednesswith other words in the segment:70P(w|Uk) = ??(w|Uk)?z?Ck ?(z|Uk)+(1?? ) ?y?Cke(y,w)?z?Ck e(y,z)P(y|Uk) (2.10)where the value of ? (0 ? ? ? 1), which we call the bias, is a trade-off betweenthe two components and should be set empirically. For higher values of ? , wegive more weight to the word?s relevance to the leading sentences compared to itsrelatedness with other words in the segment. Here, Ck is the set of words in segmentk, which represents the nodes in the graph. The denominators in both componentsare for normalization. We define ?(w|Uk) as:?(w|Uk) = log(t f Ukw +1).log(t f kw +1) (2.11)where t f Ukw and t fkw are the number of times word w appears in Uk and segmentk, respectively. A similar model has proven to be successful in measuring therelevance of a sentence to a query in query-based sentence retrieval [4].Recall that when there are multiple topics in a conversation, a requirement forthe topic labels is that labels of different topics should be discriminative (or distin-guishable) [137]. This implicitly indicates that a high scoring word in one segmentshould not have high scores in other segments of the conversation. Keeping this inmind, we define the (undirected) edge weights e(y,w) in Equation 2.10 as follows:e(y,w) = t f kw,y? logK0.5+ t f k?w,y(2.12)where K denotes the number of topics (or topic segments) in the conversation, andt f kw,y and t fk?w,y are the number of times words w and y co-occur in a window of sizes in segment k and in segments except k in the conversation, respectively. Noticethat this measure is similar in spirit to the TF.IDF metric [178], but it is at the co-occurrence level. The co-occurrence relationship between words captures syntacticdependencies and lexical cohesion in a text, and is also used in [140].14Equation 2.10 above can be written in matrix notation as:14Mihalcea and Tarau [140] use an unweighted graph for key phrase extraction. However, in ourexperiments, we get better results with a weighted graph.71pi = [?Q+(1?? )R]Tpi = ATpi, (2.13)where Q and R are square matrices such that Qi, j =?( j|Uk)?z?Ck ?(z|Uk)for all i, andRi, j =e(i, j)? j?Ck e(i, j), respectively. Notice that A is a stochastic matrix (i.e., all rowsadd up to 1), therefore, it can be treated as the transition matrix of a Markov chain.If we assume each word is a state in a Markov chain, then Ai, j specifies the transi-tion probability from state i to state j in the corresponding Markov chain. Anotherinterpretation of A can be given by a biased random walk on the graph. Imagineperforming a random walk on the graph, where at every time step, with probability? , a transition is made to the words that are relevant to the leading sentences andwith probability 1? ? , a transition is made to the related words in the segment.Every transition is weighted according to the corresponding elements of Q and R.The vector pi we are looking for is the stationary distribution of this Markov chainand is also the (normalized) eigenvector of A for the eigenvalue 1. A Markov chainwill have a unique stationary distribution if it is ergodic [182]. We can ensure theMarkov chain to have this property by reserving a small probability for jumping toany other state from the current state15 [155]. For larger matrices, pi can be effi-ciently computed by an iterative method called the power method.Incorporating Conversational StructureIn Section 2.3.1, we described how the fine conversation structure in the form of aFragment Quotation Graph (FQG) can be effectively exploited in our topic segmen-tation models. We hypothesize that our topic labeling model can also benefit fromthe FQG. In previous work on email summarization, Carenini et al. [35] appliedPageRank to the FQG to measure the importance of a sentence and demonstratedthe benefits of using a FQG. This finding implies that an important node in theFQG is likely to cover an important aspect of the topics discussed in the conver-sation. Our intuition is that, to be in the topic label, a keyword should not onlyco-occur with other keywords, but it should also come from an important fragment15For simplicity, we do not make this random jump component explicit in our equations. But,readers should keep in mind that all the transition matrices described in this chapter contain thiscomponent.72GFGWFigure 2.10: Three sub-graphs used for co-ranking: the fragment quotationgraph GF , the word co-occurrence graph GW , and the bipartite graphGFW that ties the two together. Blue nodes represent fragments, rednodes represent words.in the FQG. We believe there is a mutually reinforcing relationship between theFQG and the Word Co-occurrence Graph (WCG) that should be reflected in therankings. Our proposal is to implement this idea as a process of co-ranking [238]in a heterogeneous graph, where three random walks are combined together.Let G = (V,E) = (VF ?VW ,EF ? EW ? EFW ) be the heterogeneous graph offragments and words. As shown in Figure 2.10, it contains three sub-graphs.First, GF = (VF ,EF) is the unweighted directed FQG, with VF denoting the setof fragments and EF denoting the set of directed links between fragments. Second,GW = (VW ,EW ) is the weighted undirected WCG, where VW is the set of words inthe segment and EW is the set of edge-weights as defined in Equation 2.12. Third,GFW = (VFW ,EFW ) is the weighted bipartite graph that ties GF and GW togetherrepresenting the occurrence relations between the words and the fragments. Here,VFW = VF ?VW , and weighted undirected edges in EFW connect each fragmentv f ? VF to each word vw ? VW , with the weight representing the number of timesword vw occurs in fragment v f .The co-ranking framework combines three random walks, one on GF , one onGW and one on GFW . Let F and W denote the transition matrices for the (intra-73class) random walks in GF and GW , respectively, and f and w denote their re-spective stationary distributions. Since, GFW is a bipartite graph, the (inter-class)random walk on GFW can be described by two transition matrices, FW|VF |?|VW | andWF|VW |?|VF |. One intra-class step changes the probability distribution from (f, 0) to(FT f, 0) or from (0, w) to (0, W T w), while one inter-class step changes the dis-tribution from (f, w) to (WFT w, FW T f) (see [238] for details). The coupling isregulated by a parameter ? (0 ? ? ? 1) that determines the extent to which theranking of words and the ranking of fragments depend on each other. Specifically,the two update steps in the power method are:f t+1 = (1?? ) (FT f t)+? WFT (FW TWFT )wt (2.14)wt+1 = (1?? ) (W T wt)+? FW T (WFT FW T ) f t (2.15)We described the co-ranking framework above assuming that we have a WCGand its corresponding FQG. However, recall that while the WCG is built for atopic segment, the FQG described so far (Figure 2.4) is based on the whole con-versation. In order to construct a FQG for a topic segment in the conversation,we take only those fragments (and the edges) from the conversation-level FQGthat include only the sentences of that segment. This operation has two conse-quences. One, some conversation-level fragments may be pruned. Two, somesentences in a conversation-level fragment may be discarded. For example, theFQG for topic (segment) ID 1 in Figure 2.2 includes only the fragments a,h, i, j,and l, and the edges between them. Fragment j, which contains three sentences inthe conversation-level FQG, contains only one sentence in the FQG for topic ID 1.Phrase GenerationOnce we have a ranked list of words for describing a topical segment, we select thetop M keywords for constructing the keyphrases (labels) from these keywords. Wetake a similar approach to [140]. Specifically, we mark the M selected keywordsin the text, and collapse the sequences of adjacent keywords into keyphrases. Forexample, consider the first sentence, ?.. 15th anniversary of the Elder Scrolls series..? in Figure 2.2. If ?Elder?, ?Scrolls? and ?series? are selected as keywords, since74they appear adjacent in the text, they are collapsed into one single keyphrase ?ElderScrolls series?. The score of a keyphrase is then determined by taking the maximumscore of its constituents (i.e., keywords).Rather than constructing the keyphrases in the post-processing phase, as wedo, an alternative approach is to first extract the candidate phrases using either n-gram sequences or a chunker in the preprocessing, and then rank those candidates[91, 134]. However, determining the optimal value of n in the n-gram sequenceis an issue, and including all possible n-gram sequences for ranking excessivelyincreases the problem size. Mei et al. [137] also show that using a chunker leads topoor results due to the inaccuracies in the chunker, especially when it is applied toa new domain like ours.Conversation-level Phrase Re-rankingSo far, we have extracted phrases only from the topic segment ignoring the rest ofthe conversation. This method fails to find a label if some of its constituents appearoutside the segment. For example, in our Blog corpus, the phrase server security inthe human-authored label server security and firewall does not appear in its topicalsegment, but appears in the whole conversation. In fact, in our development set,about 14% and 8% words in the blog and email labels, respectively, come fromparts of the conversation that are outside the topic segment. Thus, we propose toextract informative phrases from the whole conversation, re-rank those with respectto the individual topics (or segments) and combine only the relevant conversation-level phrases with the segment-level ones.We rank the words of the whole conversation by applying the ranking modelsdescribed in Section 2.3.2 and extract phrases using the same method describedin Section 2.3.2. Note that when we apply our biased random walk model to thewhole conversation, there is no concept of leading sentences and no distinctionbetween the topics. Therefore, to apply to the whole conversation, we adjust ourbiased random walk model (Equation 2.10) as follows:P(w) = ?y?Cke(y,w)?z?Ck e(y,z)P(y) (2.16)where e(y,w) = t fw,y, is the number of times words w and y co-occur in a window75of size s in the conversation. On the other hand, the co-ranking framework, whenapplied to the whole conversation, combines two conversation-level graphs: theFQG of the conversation, and the WCG built for all words in the conversation.To re-rank the phrases extracted from the whole conversation with respect to aparticular topic in the conversation, we reuse the score of the words in that topicsegment (given by the ranking models in Section 2.3.2). As before, the score of a(conversation-level) phrase is determined by taking the maximum (segment-level)score of its constituents (words). If a word does not occur in the topic segment, itsscore is assumed to be 0.Redundancy CheckingOnce we have the ranked list of labels (keyphrases), the last step is to produce thefinal k labels as output. When selecting multiple labels for a topic, we expect thenew labels to be diverse without redundant information to achieve broad coverageof the topic. We use the Maximum Marginal Relevance (MMR) [33] criterion toselect the labels that are relevant, but not redundant. Specifically, we select thelabels one by one, by maximizing the following MMR criterion each time:l? = argmaxl?W?S[? Score(l)? (1??) maxl??S Sim(l?, l)] (2.17)where W is the set of all labels and S is the set of labels already selected as output.We define the similarity between two labels l? and l as: Sim(l?, l) = no/nl , where nois the number of overlapping (modulo stemming) words between l? and l, and nl isthe number of words in l. The parameter ? (0 ? ? ? 1) quantifies the amount ofredundancy allowed.2.4 Corpora and MetricsDue to the lack of publicly available corpora of asynchronous conversations anno-tated with topics, we developed the first corpora annotated with topic information.762.4.1 Data CollectionFor email, we selected our publicly available BC3 email corpus [213] which con-tains 40 email conversations from the World Wide Web Consortium (W3C) mail-ing list.16 The BC3 corpus, previously annotated with sentence-level speech acts,subjectivity, extractive and abstractive summaries, is one of a growing number ofcorpora being used for email research [36]. This corpus has an average of 5 emailsper conversation and a total of 1024 sentences after excluding the quoted sentences.Each conversation also provides the thread structure based on reply-to relations.For blog, we manually selected 20 conversations of various lengths, all shortenough to still be feasible for humans to annotate, from the popular technology-related news website Slashdot17. Slashdot was selected because it provides reply-to links between comments, allowing accurate thread reconstruction, and since thecomments are moderated by the users of the site, they are expected to have a decentstandard. A conversation in Slashdot begins with an article (i.e., a short synopsisparagraph possibly with a link to the original story), and is followed by a lengthydiscussion section containing multiple threads of comments and single comments.This is unlike an email conversation which contains a single thread of emails. Themain article is assumed to be the root in the conversation tree (based on reply-to),while the threads and the single comments form the sub-trees in the tree. In ourblog corpus, we have a total of 4,411 sentences. The total number of commentsper blog conversation varies from 30 to 101 with an average of 60.3, the numberof threads per conversation varies from 3 to 16 with an average of 8.35 and thenumber of single comments varies from 5 to 50 with an average of 20.25.2.4.2 Topic AnnotationAs noted by Purver [165], topic segmentation and labeling in general is a non-trivial and subjective task even for humans, particularly when the text is uneditedand less organized. The conversation phenomenon called ?Schism? makes it evenmore challenging for conversations. During a schism, a new conversation takesbirth from an existing one, not necessarily because of a topic shift but because16http://research.microsoft.com/en-us/um/people/nickcr/w3c-summary.html17http://slashdot.org/77some participants refocus their attention onto each other, and away from whoeverheld the floor in the parent conversation [177]. In the example email conversa-tion shown in Figure 2.1, a schism takes place when the participants discuss thetopic ?responding to I18N?. Not all our annotators agree on the fact that the topic?responding to I18N? swerves from the topic ?TAG document?.To properly design an effective annotation manual and procedure, we per-formed a two-phase pilot study before carrying out the actual annotation. Ourinitial annotation manual was inspired by the AMI annotation manual used fortopic segmentation of ICSI meeting transcripts.18 For the pilot study, we selectedtwo blog conversations from Slashdot and five email conversations from the W3Ccorpus. Note that these conversations were not picked from our corpora. Later inour experiments we use these conversations as our development set for tuning dif-ferent parameters of the computational models. In the first phase of the pilot studyfive computer science graduate students volunteered to do the annotation, generat-ing five different annotations for each conversation. We then revised our annotationmanual based on their feedback and a detailed analysis of possible sources of dis-agreement. In the second phase, we tested our procedure with a university postdocdoing the annotation. See Appendix A.2 for details on our annotation manual.We prepared two different annotation manuals ? one for email and one for blog.We chose to do so for two reasons. (i) As discussed earlier, our email and blogconversations are structurally different and have their own specific characteristics.(ii) The email corpus already had some annotations (e.g., abstract summaries) thatwe could reuse for topic annotation, whereas our blog corpus is brand new withoutany existing annotation.For the actual annotation we recruited and paid three cognitive science fourthyear under-graduates, who are native speakers of English and also Slashdot blog-gers. On average, they took about 7 and 28.5 hours to annotate the 40 email and20 blog conversations, respectively. In all, we have three different annotations foreach conversation in our corpora. For blog conversations, the task of finding topicswas carried out in four steps:1. The annotators read the whole conversation (i.e., article, threads of com-18http://mmm.idiap.ch/private/ami/annotation/TopicSegmentationGuidelinesNonScenario.pdf78ments and single comments) and wrote a short summary (? 3 sentences)only for the threads.2. They provided short high-level descriptions for the topics discussed in theconversation (e.g., ?Game contents or size?, ?Bugs or faults?). These descrip-tions serve as reference topic labels in our work. The target number oftopics and their labels were not given in advance and the annotators wereinstructed to find as many or as few topics as needed to convey the overallcontent of the conversation.3. They assigned the most appropriate topic to each sentence. However, if asentence covered more than one topic, they labeled it with all the relevanttopics according to their order of relevance. They used the predefined topic?OFF-TOPIC? if the sentence did not fit into any topic. Wherever appropriatethey also used two other predefined topics: ?INTRO? (e.g., ?hi X?) and ?END?(e.g., ?Best, X?).4. The annotators authored a single high-level 250 word summary of the wholeconversation. This step was intended to help them remember anything theymay have forgotten and to revise the annotations in the previous three steps.For each email conversation in BC3, we already had three human-authoredsummaries. So, along with the actual conversations, we provided the annotatorswith those summaries to give them a brief overview of the discussion. After readinga conversation and the associated summaries, they performed tasks 2 and 3 as in thesame procedure they followed for annotating blogs. The annotators carried out thetasks on paper. We created the hierarchical thread view of the conversation basedon the reply-to relations between the comments (or emails) using indentations andprinted each participant?s information in a different color as in Gmail.In the email corpus, the three annotators found 100, 77 and 92 topics respec-tively (269 in total), and in the blog corpus, they found 251, 119 and 192 topicsrespectively (562 in total). Table 2.3 shows some basic statistics computed on thethree annotations of the conversations.19 On average, we have 26.3 sentences and19We got 100% agreement on the two predefined topics ?INTRO? and ?END?. Therefore, in all ourcomputations we excluded the sentences marked as either ?INTRO? or ?END?.792.5 topics per email conversation, and 220.55 sentences and 10.77 topics per blogconversation. On average, a topic in email conversations contains 12.6 sentences,and a topic in blog conversations contains 27.16 sentences. The average numberof topics active at a time are 1.4 and 5.81 for email and blog conversations, respec-tively. The average entropy which corresponds to the granularity of an annotation(as described in the next section) is 0.94 for email conversations and 2.62 for blogconversations. These statistics (i.e., the number of topics and the topic density)indicate that there is a substantial amount of segmentation (and labeling) to do.Mean Max MinEmail Blog Email Blog Email BlogNumber of sentences 26.3 220.55 55 430 13 105Number of topics 2.5 10.77 7 23 1 5Average topic length 12.6 27.16 35 61.17 3 11.67Average topic density 1.4 5.81 3.1 10.12 1 2.75Entropy 0.94 2.62 2.7 3.42 0 1.58Table 2.3: Statistics on three human annotations per conversation.2.4.3 Evaluation (and Agreement) MetricsIn this section we describe the metrics used to compare different annotations. Thesemetrics measure both how much our annotators agree with each other, and howwell our models and various baselines perform. For a given conversation, differentannotations can have different numbers of topics, different topic assignments ofthe sentences (i.e., the clustering) and different topic labels. Below we describethe metrics used to measure the segmentation performance followed by the metricsused to measure the labeling performance.Metrics for Topic SegmentationAs different annotations can group sentences into different clusters, agreementmetrics widely used in supervised classification, such as the ? statistic and F1 score,are not applicable. Again, our problem of topic segmentation in asynchronous con-versation is not sequential in nature. Therefore, the standard metrics widely used in80sequential topic segmentation in monolog and synchronous dialog, such as Pk [21]and WindowDi f f (WD) [162], are also not applicable. Rather, the one-to-one andlocal agreement metrics described in [63] are more appropriate for our segmenta-tion task.The one-to-one metric measures global agreement between two annotations bypairing up topical segments from the two annotations in a way (i.e., by computingthe optimal max-weight bipartite matching that maximizes the total overlap), andthen reports the percentage of overlap. The local agreement metric lock measuresagreement within a context of k sentences. To compute the loc3 score for the m-thsentence in the two annotations, we consider the previous 3 sentences: m-1, m-2and m-3, and mark them as either ?same? or ?different? depending on their topicassignment. The loc3 score between two annotations is the mean agreement onthese ?same? or ?different? judgments, averaged over all sentences. See AppendixA.1 for a detailed description of these metrics with concrete examples.We report the annotators? agreement found in one-to-one and loc3 metrics inTable 2.4. For each human annotation, we measure its agreement with the two otherhuman annotations separately, and report the mean agreements. For email, we gethigh agreement in both metrics, though the local agreement (average of 83%) is alittle higher than the global one (average of 80%). For blog, the annotators havehigh agreement in loc3 (average of 80%), but they disagree more in one-to-one(average of 54%). A low one-to-one agreement in blog is quite acceptable sinceblog conversations are much longer and less focused than email conversations (seeTable 2.3). By analyzing the two corpora we also noticed that in blogs, peopleare more informal and often make implicit jokes (see Figure 2.2), which makesthe discourse even more unstructured. As a result, the segmentation task in blogsis more challenging for humans as well as for our models. Note that in a similarannotation task for chat disentanglement, Elsner and Charniak [63] report an aver-age one-to-one score of 53%. Since the one-to-one score for naive baselines (seeSection 2.5.1) is much lower than the human agreement, this metric differentiateshuman-like performance from naive baselines. Therefore, computing one-to-onecorrelation with the human annotations is a legitimate evaluation for our models.When we analyze the source of disagreement in the annotation, we find that byfar the most frequent reason is the same as the one observed by Elsner and Charniak81Mean Max MinEmail Blog Email Blog Email Blogone-to-one 80.4 54.2 100.0 84.1 31.3 25.3loc3 83.2 80.1 100.0 94.0 43.7 63.3Table 2.4: Annotator agreement in one-to-one and loc3 on the two corpora.[63] for the chat disentanglement task; namely, some annotators are more specific(i.e., fine) than others (i.e., coarse). To determine the level of specificity in an an-notation, similarly to [63], we use the information-theoretic concept of entropy. Ifwe consider the topic of a randomly picked sentence in a conversation as a randomvariable X , its entropy H(X) measures the level of detail in an annotation. Fortopics k each having length nk in a conversation of length N, we compute H(X) as:H(X) =?K?k=1nkNlog2nkN(2.18)where K is the total number of topics in the conversation. The entropy gets higheras the number of topics increases and the topics are evenly distributed in a conver-sation. In our corpora, it varies from 0 to 2.7 in email conversations and from 1.58to 3.42 in blog conversations (Table 2.3). These variations demonstrate the differ-ences in specificity for different annotators, but do not determine their agreementon the general structure. To quantify this, we use the many-to-one metric proposedby Elsner and Charniak [63]. It maps each of the source clusters to the single targetcluster with which it gets the highest overlap, then computes the total percentageof overlap. This metric is asymmetrical, and not to be used for performance eval-uation.20 However, it provides some insights about the annotation specificity. Forexample, if one splits a cluster of another annotator into multiple sub-clusters then,the many-to-one score from fine to coarse annotation is 100%. In our corpora, bymapping from fine (high-entropy) to coarse (low-entropy) annotation we get highmany-to-one scores, with an average of 95% in email conversations and an averageof 72% in blog conversations (Table 2.5). This suggests that the finer annotationshave mostly the same scopic boundaries as the coarser ones.20One can easily optimize it by assigning a different topic to each of the source sentences.82Mean Max MinEmail Blog Email Blog Email Blogmany-to-one 94.9 72.3 100 98.2 61.1 51.4Table 2.5: Annotator agreement in many-to-one on the two corpora.Metrics for Topic LabelingRecall that we extract keyphrases from the conversation as topic labels. Tradi-tionally keyphrase extraction is evaluated using precision, recall and F-measurebased on exact matches between the extracted keyphrases and the human-assignedkeyphrases [135, 140]. However, it has been noted that this approach based onexact matches underestimates the performance [212]. For example, when com-pared with the reference keyphrase ?Game contents or size?, a credible candidatekeyphrase ?Game contents? gets evaluated as wrong in this metric. Therefore, re-cent studies [101, 234] suggest to use the n-gram-based metrics that account fornear-misses, similar to the ones used in text summarization (e.g., ROUGE [114])and machine translation (e.g., BLEU [158]).Kim et al. [101] evaluated the utility of different n-gram-based evaluation met-rics for keyphrase extraction and showed that the metric which we call mutual-overlap (m-o), correlates most with human judgments.21 Therefore, one of themetrics we use for evaluating our topic labeling models is m-o. Given a referencekeyphrase pr of length (in words) nr, a candidate keyphrase pc of length nc, andno being the number of overlapping (modulo stemming) words between pr and pc,mutual-overlap is formally defined as:mutual?overlap(pr, pc) =nomax(nr,nc)(2.19)This metric gives full credit to exact matches and morphological variants, andpartial credit to two cases of overlapping phrases: (i) when the candidate keyphraseincludes the reference keyphrase, and (ii) when the candidate keyphrase is a part21Kim et al. [101] call this metric R-precision (R-p), which is different from the actual definitionof R-p for keyphrase evaluation given by [234]. Originally, R-p is the precision measured when thenumber of candidate keyphrases equals the number of gold keyphrases.83of the reference keyphrase. Notice that m-o as defined above evaluates a singlecandidate keyphrase against a reference keyphrase. In our setting, we have a singlereference keyphrase (i.e., topic label) for each topical cluster, but as mentioned be-fore, we may want our models to extract the top k keyphrases. Therefore, we mod-ify m-o to evaluate a set of k candidate keyphrases Pc against a reference keyphrasepr as follows, calling it weighted-mutual-overlap (w-m-o):weighted?mutual?overlap(pr,Pc) =k?i=1nomax(nr,nic)S(pic) (2.20)where S(pic) is the normalized score (i.e., it satisfies 0? S(pic)? 1 and ?ki=1 S(pic)=1) of the i-th candidate phrase pic ? Pc. For k = 1, this metric is equivalent tomutual-overlap, and for higher values of k, it takes the sum of k mutual-overlapscores, each weighted by its normalized score.The w-m-o metric described above only considers word overlap and ignoresother semantic relations (e.g., synonymy, hypernymy) between words. However,annotators when writing the topic descriptions, may use words that are not di-rectly from the conversation, but are semantically related. For example, given areference keyphrase ?meeting agenda?, its lexical semantic variants like ?meetingschedule? or ?meeting plan? should be treated as correct. Therefore, we also con-sider a generalization of w-m-o that incorporates lexical semantics. We defineweighted-semantic-mutual-overlap (w-s-m-o) as follows:weighted?semantic?mutual?overlap(pr,Pc) =k?i=1?tr?pr ?tc?pic ?(tr, tc)max(nr,nic)S(pic)(2.21)where ?(tr, tc) is the semantic similarity between the nouns tr and tc. The valueof ?(tr, tc) is between 0 and 1, where 1 denotes notably high similarity and 0 de-notes little-to-none. Notice that, since this metric considers semantic similaritybetween all possible pairs of nouns, the value of this measure can be greater than100% (when presented in percentage). We use the metrics (e.g., lin similarity,wup similarity) provided in the WordNet::Similarity package [161] for computing84WordNet-based similarity, and always choose the most frequent sense for a noun.The results we get are similar across the similarity metrics. For brevity, we justdiscuss lin similarity here.Metrics for End-to-End EvaluationJust like the human annotators, our end-to-end system takes an asynchronous con-versation as input, finds the topical segments in the conversation, and then assignsshort descriptions (topic labels) to each of the topical segments. It would be fairlyeasy to compute agreement on topic labels based on mutual overlap, if the numberof topics and topical segments were fixed across the annotations of a given con-versation. However, since different annotators (system or human) can identify adifferent number of topics and different clusterings of sentences, measuring anno-tator (model or human) agreement on the topic labels is not a trivial task. To solvethis, we first map the clusters of one annotation (say A1) to the clusters of another(say A2) by the optimal one-to-one mapping described in the previous section. Af-ter that, we compute the w-m-o and w-s-m-o scores on the labels of the mapped(or paired) clusters. Formally, if l1i is the label of cluster c1i in A1 that is mapped tothe cluster c2j with label l2j in A2, we compute w-m-o(l1i , l2j ) and w-s-m-o(l1i , l2j ).Table 2.6 reports the human agreement for w-m-o and w-s-m-o on the twocorpora. Similar to segmentation, we get higher agreement on labeling for bothmetrics on email. Plausibly, the reasons remain the same; the length and the char-acteristics (e.g., informal, less focused) of blog conversations make the annotatorsdisagree more. However, note that these measures are computed based on one-to-one mappings of the clusters and may not reflect the agreement one would get ifthe annotators were asked to label the same segments.Mean Max MinEmail Blog Email Blog Email Blogw-m-o 36.8 19.9 100.0 54.2 0.0 0.0w-s-m-o 42.5 28.2 107.3 60.8 0.0 5.2Table 2.6: Annotator agreement in w-m-o and w-s-m-o on the two corpora.852.5 ExperimentsIn this section we present our experimental results. First, we show the performanceof the topic segmentation models. Then we show the performance of the topic la-beling models based on manual segmentation. Finally, we present the performanceof the end-to-end system, i.e., when the topic labeler uses automatic segmentation.2.5.1 Topic Segmentation EvaluationThis section presents the experiments on the topic segmentation task.Experimental Setup for Topic SegmentationWe ran six different topic segmentation models on our corpora presented in Sec-tion 2.4. Our first model is the graph-based unsupervised segmentation model pre-sented by Malioutov and Barzilay [121]. Since the sequentiality constraint of topicsegmentation in monolog and synchronous dialog does not hold in asynchronousconversation, we implement this model without this constraint. Specifically, thismodel (call it M&B) constructs a weighted undirected graph G(V,E), where thenodes V represent the sentences and the edge weights w(x,y) represent the co-sine similarity (Equation 2.5) between sentences x and y. It then finds the topicalsegments by optimizing the normalized cut criterion (Equation 2.6). Thus, M&Bconsiders the conversation globally, but models only lexical similarity.The other five models are LDA, LDA+FQG, LCSeg, LCSeg+FQG and the Su-pervised model (SUP) as described in Section 2.3. The tunable parameters of thedifferent models were set based on their performance on our developement set.The hyperparameters ? and ? in LDA were set to their default values (?=50/K,?=0.01) as suggested in [199].22 The regularization strength ? in LDA+FQG wasset to 20. The parameters of LCSeg were set to their default values since this set-ting delivers the best performance on the development set. For a fair comparison,we set the same number of topics per conversation in all of the models. If at leasttwo of the three annotators agree on the topic number, we set that number, other-wise we set the floor value of the average topic number. The mean statistics of thesix model annotations are shown in Table 2.7. Comparing with the statistics of the22The performance of LDA does not seem to be sensitive to the values of ? and ? .86human annotations in Table 2.3, we can notice that these numbers are within thebounds of the human annotations.23M&B LDA LDA+FQG LCSeg LCSeg+FQG SUPEmailTopic number 2.41 2.10 1.90 2.41 2.41 2.41Topic length 12.41 13.3 15.50 12.41 12.41 12.41Topic density 1.90 1.83 1.60 1.01 1.39 1.42Entropy 0.99 0.98 0.75 0.81 0.93 0.98BlogTopic number 10.6 10.65 10.65 10.65 10.65 10.65Topic length 20.42 20.32 20.32 20.32 20.32 20.32Topic density 7.38 9.39 8.32 1.00 5.21 5.30Entropy 2.54 3.33 2.37 2.85 2.81 2.85Table 2.7: Mean statistics of different model?s annotationWe also evaluate the following baselines, which any useful model should out-perform.? All different Each sentence in the conversation constitutes a separate topic.? All same The whole conversation constitutes a single topic.? Speaker The sentences from each participant constitute a separate topic.? Blocks of k (= 5, 10, 15, 20, 25, 30): Each consecutive group of k sentencesin the temporal order of the conversation constitutes a separate topic.Results for Topic SegmentationTable 2.8 presents the human agreement and the agreement of the models withthe human annotators on our corpora. For each model annotation, we measure itsagreement with the three human annotations separately using the metrics describedin Section 2.4.3, and report the mean agreements. In the table, we also show theperformance of the two best baselines? the Speaker and Blocks of k.Most of the baselines perform rather poorly. All different is the worst baselineof all with mean one-to-one scores of only 0.05 and 0.10, and mean loc3 scores23Although the topic numbers per conversation are fixed for different models, LDA andLDA+FQG may find fewer topics (see Equation 2.3 and 2.4).87Baselines Models HumanSpeaker Blocksof kM&B LDA LDA+FQGLCSeg LCSeg+FQGSUPEmailMean 1-to-1 51.8 38.3 62.8 57.3 61.5 62.2 69.3 72.3 80.4Max 1-to-1 94.3 77.1 100.0 100.0 100.0 100.0 100.0 100.0 100.0Min 1-to-1 23.4 14.6 36.3 24.3 24.0 33.1 38.0 42.4 31.3Mean loc3 64.1 57.4 62.4 54.1 60.6 72.0 72.7 75.8 83.2Max loc3 97.0 73.1 100.0 100.0 100.0 100.0 100.0 100.0 100.0Min loc3 27.4 42.6 36.3 38.1 38.4 40.7 40.6 40.4 43.7BlogMean 1-to-1 33.5 32.0 30.0 25.2 28.0 36.6 46.7 48.5 54.2Max 1-to-1 61.1 46.0 45.3 42.1 56.3 53.6 67.4 66.1 84.1Min 1-to-1 13.0 15.6 18.2 15.3 16.1 23.7 26.6 28.4 25.3Mean loc3 67.0 52.8 54.1 53.0 55.4 56.5 75.1 77.2 80.1Max loc3 87.1 68.4 64.3 65.6 67.1 76.0 89.0 96.4 94.0Min loc3 53.4 42.3 45.1 38.6 46.3 43.1 56.7 63.2 63.3Table 2.8: Segmentation performance of the two best Baselines, Human andModels. In the Blocks of k column, k = 5 for email and k = 20 for blog.of only 0.47 and 0.25 in the blog and email corpus, respectively. Blocks of 5 isone of the best baselines in email, but it performs poorly in blog with mean one-to-one of 0.19 and mean loc3 of 0.54. On the contrary, Blocks of 20 is one ofthe best baselines in blog, but performs poorly in email. This is intuitive sincethe average number of topics and topic length in blog conversations (10.77 and27.16) are much higher than those of email (2.5 and 12.6). All same is optimalfor conversations containing only one topic, but its performance rapidly degradesas the number of topics increases. It has mean one-to-one scores of 0.29 and 0.28and mean loc3 scores of 0.53 and 0.54 in the blog and email corpora, respectively.Speaker is the strongest baseline in both domains.24 In several cases it beats someof the under-performing models.In the email corpus, in the one-to-one metric, generally the models agree withthe annotators more than the baselines do, but less than the annotators agree witheach other. We observe a similar trend in the local metric loc3, however on thismetric, some models fail to beat the best baselines. Notice that human agreementfor some of the annotations is quite low (see the Min scores), even lower than themean agreement of the baselines. As explained before, this is due to the fact thatsome human annotations are much more fine-grained than others.24There are many anonymous authors in our blog corpus. We treat them as separate authors.88In the blog corpus, the agreements on the global metric (one-to-one) are muchlower than that on the email corpus. The reasons were already explained in Section2.4.3. We notice a similar trend in both metrics? some under-performing modelsfail to beat the baselines, while others perform better than the baselines, but worsethan the human annotators.The comparison among the models reveals a general pattern. The probabilisticgenerative models LDA and LDA+FQG perform disappointingly on both corpora.A reason could be the limited amount of data available for training. In our corpora,the average number of sentences per blog conversation is 220.55 and per email con-versation is 26.3, which might not be sufficient for the LDA models [145]. An erroranalysis further reveals that nearby sentences in the segmentations performed byLDA models gets excessively distributed over topics. A likely explanation is thatthe independence assumption made by these models when computing the distribu-tion over topics for a sentence from the distributions of its words causes the localcontext to be excessively distributed over topics. If we compare the performance ofLDA+FQG with the performance of LDA, we get a significant improvement withLDA+FQG in both metrics on both corpora (p<0.01).25 The regularization withthe FQG prevents the local context from being excessively distributed over topics.The unsupervised graph-based model M&B performs better than the LDAmodels in most cases (i.e., except loc3 in blog) (p < 0.001). However, its per-formance is still far below the performance of the top performing models likeLCSeg+FQG and the supervised model. The reason is that even though, by con-structing a complete graph, this method considers the conversation globally, it onlymodels lexical similarity and disregards other important features of asynchronousconversation like the fine conversation structure and the speaker.Comparison of LCSeg with LDAs and M&B reveals that LCSeg in general isa better model. LCSeg outperforms LDA by a wide margin in one-to-one on twodatasets and in loc3 on email (p < 0.001). The difference between LCSeg and LDAin loc3 on blog is also significant with p < 0.01. LCSeg also outperforms M&B inmost cases (p < 0.01) except in one-to-one on email. Since LCSeg is a sequentialmodel it extracts the topics keeping the context intact. This helps it to achieve high25All tests of statistical significance were performed using paired t-tests.89loc3 agreement for shorter conversations like email conversations. But, for longerconversations like blog conversations, it overdoes this (i.e., extracts larger chunksof sentences as a topic segment) and gets low loc3 agreement. This is unsurpris-ing if we look at its topic density in Table 2.7 on the two datasets? the density isvery low in the blog corpus compared to the annotators and other well performingmodels. Another reason of its superior performance over LDAs and M&B couldbe its term weighting scheme. Unlike LDAs and M&B, which consider only repe-tition, LCSeg also considers how tightly the repetition happens. However, there isstill a large gap in performance between LCSeg and other top performing models(LCSeg+FQG, SUP). As explained earlier, topics in an asynchronous conversationmay not change sequentially in the temporal order of the sentences. If topics areinterleaved then LCSeg fails to identify them correctly. Furthermore, LCSeg doesnot consider other important features beyond lexical cohesion.When we incorporate FQG into LCSeg, we get a significant improvement inone-to-one on both corpora and in loc3 on blog (p<0.0001). Even though theimprovement in loc3 on email is not significant, the agreement is quite high com-pared to other unsupervised models. Overall, LCSeg+FQG is the best unsupervisedmodel. This supports our claim that sentences connected by reply-to relations inthe FQG usually refer to the same topic.Finally, when we combine all the features into our graph-based supervised seg-mentation model (SUP in Table 2.8), we get a significant improvement over LC-Seg+FQG in both metrics across both domains (p<0.01). Beside the features, thisimprovement might also be due to the fact that, by constructing a complete graph,this model considers relations between all possible sentence pairs in a conversation,which we believe is a key requirement for topic segmentation in asynchronous con-versations. The agreements achieved by the supervised model are also much closerto those of the human annotators.An error analysis of the segmentations performed by LCSeg+FQG and the su-pervised model shows that although these models are accurate in clustering sen-tences that are long, they fail to do so for some short sentences, especially whenthe sentence does not contain any informative word (e.g., excuse me?). These er-rors are more frequent in the blog corpus because participants in blogs often makeimplicit jokes and use more informal language. Some of these errors are also due90to inaccuracies in the sentence segmenter which mistakenly finds short sentences.We understand that while most of the lexical and topic features are inactive (inthe supervised model) for these short and uninformative sentences, conversationalfeatures should at least give some hints through the edge weights in the graph forthese instances. However, note that the most informative conversational feature ?the FQG ? could be too noisy in some cases to be informative. Therefore, webelieve reducing the noise in the FQG could help us tackle some of the errors madeby LCSeg+FQG and the supervised model.2.5.2 Topic Labeling EvaluationIn this section we present the experimental evaluation of the topic labeling modelswhen the models are provided with manual (or gold) segmentation. This allows usto judge their performance independently of the topic segmentation task.Experimental Setup for Topic LabelingAs mentioned in Section 2.4, in the email corpus, the three annotators found 100,77 and 92 topics (or topical segments) respectively (269 in total), and in the blogcorpus, they found 251, 119 and 192 topics respectively (562 in total). The annota-tors wrote a short high-level description for each topic. These descriptions serve asreference topic labels in our evaluation.26 The goal of the topic labeling models isto automatically generate such informative descriptions for each topical segment.We compare our approach with two baselines. The first baseline FreqBL ranksthe words in a topical segment according to their frequencies. The second baselineLeadBL, expressed by Equation 2.11, ranks the words based on their relevanceonly to the leading sentences in a topical segment.We also compare our model with two state-of-the-art keyphrase extractionmethods. The first one is the unsupervised general TextRank model proposed byMihalcea and Tarau [140] (call it M&T) that does not incorporate any conversa-tion specific information. The second one is the supervised model Maui proposedin [135]. Briefly, Maui first extracts all n-grams up to a maximum length of 3 as26Notice that in our setting, for each topic segment we have only one reference label to comparewith. Therefore, we do not show the human agreement on the labeling task in tables 2.9 and 2.10.91candidate keyphrases. Then a bagged decision tree classifier filters the candidatesusing nine different features. Due to the lack of labeled training data in asyn-chronous conversations, we train Maui on the human-annotated dataset released aspart of the SemEval-2010 task 5 on automatic keyphrase extraction from scientificarticles [103]. This dataset contains 244 scientific papers from the ACM digitallibrary, each comes with a set of author-assigned and reader-assigned keyphrases.The total number of keyphrases assigned to the 244 articles by both the authors andthe readers is 3705.We experimented with two different versions of our biased random walk modelthat incorporates informative clues from the leading sentences. One, BiasRW,does not include any conversation-level phrase (Section 2.3.2), and the other one,BiasRW+ does. The parameter Uk, the set of leading sentences, was empiricallyset to the first two sentences and the bias parameter ? was set to 0.85 based on ourdevelopment set.We experimented with four different versions of the co-ranking framework de-pending on what type of random walk is performed on the word co-occurrencegraph (WCG) and whether the model includes any conversation-level phrases. LetCorGen denote the co-ranking model with a general random walk on WCG, andCorBias denote the co-ranking model with a biased random walk on WCG. Thesetwo models do not include any conversation-level phrases while CorGen+ andCorBias+ do. The coupling strength ? and the co-occurrence window size s wereempirically set to 0.4 and 2, respectively, based on the development set. The dump-ing factor was set to its default value 0.85.Note that all the models (except Maui) and the baselines follow the same pre-processing and post-processing (i.e., phrase generation and redundancy checking)steps. The value of M in phrase generation was set to 25% of the total number ofwords in the cluster, and ? in redundancy checking was set to 0.35 based on thedevelopment set.Results for Topic LabelingWe evaluate the performance of different models using the metrics described inSection 2.4.3. Tables 2.9 and 2.10, respectively, show the mean weighted-mutual-92overlap (w-m-o) and weighted-semantic-mutual-overlap (w-s-m-o) scores of dif-ferent models (in percentage) for different values of k (i.e., number of output labels)on the two corpora.Both the baselines have proved to be strong, beating the existing models inalmost every case. This tells us that the frequency of the words in the topic segmentand their occurrence in the leading sentences carry important information for topiclabeling. Generally speaking, LeadBL is a better baseline for email, while for blogFreqBL is better than LeadBL.The supervised model Maui is the worst performer in both metrics on the twocorpora. Its performance is also consistently low across the corpora for any partic-ular value of k. A possible explanation is that Maui was trained on a domain (scien-tific articles), which is rather different from asynchronous conversations. Anotherreason may be that Maui does not consider any conversational features.The general random walk model M&T also delivers poor performance on ourcorpora, failing to beat the baselines in both measures. This indicates that therandom walk model based on only co-occurrence relations between the words isnot sufficient for finding topic labels in asynchronous conversations. It needs toconsider conversation specific information.By incorporating clues from the leading sentences, our biased random walkmodel BiasRW gives improved performance over the baselines in both metrics forall the values of k on the two corpora (p<0.05). This demonstrates the usefulnessof considering the leading sentences as an information source for topic labeling inasynchronous conversation.The general co-ranking model CorGen, by incorporating the conversation struc-ture, outperforms the baselines in both metrics for all k on blog (p<0.05), but failsto do so in many cases on email. On blog, there is also no significant differencebetween BiasRW and CorGen in w-m-o for all k (Table 2.9), but CorGen outper-forms BiasRW in w-s-m-o (Table 2.10) for higher values of k (2,3,4,5) (p<0.05).On the other hand, on email, BiasRW always outperforms CorGen in both metricsfor all k (p<0.05). So we can conclude that, on blog, exploiting the conversationstructure seems to be more beneficial than the leading sentences, whereas on email,we observe the opposite. The reason could be that the topic segments in blog aremuch longer than those of email (average length 27.16 vs. 12.6). Therefore, the93k=1 k=2 k=3 k=4 k=5Email Blog Email Blog Email Blog Email Blog Email BlogBaselines FreqBL 22.86 19.05 17.47 16.17 14.96 13.83 13.17 13.45 12.06 12.59LeadBL 22.41 18.17 18.94 15.95 15.92 13.75 14.36 12.61 13.76 11.93ModelsM&T 15.87 18.23 12.68 14.31 10.33 12.15 9.63 11.38 9.07 11.03Maui 10.48 10.03 9.86 9.56 9.03 9.23 8.71 8.90 8.50 8.53BiasRW 24.77 20.83 19.78 17.28 17.38 15.06 16.24 14.53 15.80 14.26BiasRW+ 24.91 23.65 20.36 19.69 18.09 17.76 16.20 16.78 15.78 15.86CorGen 17.60 20.76 15.32 17.64 15.14 15.78 14.23 15.03 14.08 14.75CorGen+ 18.32 22.44 15.86 19.65 15.46 18.01 14.89 16.90 14.45 16.13CorBias 24.84 20.96 19.88 17.73 17.61 16.22 16.99 15.64 16.81 15.38CorBias+ 25.13 23.83 20.20 19.97 18.21 18.33 17.15 17.28 16.90 16.55Table 2.9: Mean weighted-mutual-overlap (w-m-o) scores for different val-ues of k on two corpora.FQGs of blog segments are generally larger and capture more information than theFQGs of email segments. Besides, email discussions are more focused than blogdiscussions. The leading sentences in email segments carry more informative cluesthan those of blog segments. This is also confirmed in Figure 2.9, where the lead-ing sentences in email cover more of the human-authored words than they do inblog.k=1 k=2 k=3 k=4 k=5Email Blog Email Blog Email Blog Email Blog Email BlogBaselines FreqBL 23.36 23.52 20.50 21.03 19.82 20.18 18.47 19.58 17.81 19.27Lead-BL 24.99 21.19 21.69 20.61 20.40 19.49 19.57 18.98 19.17 18.71ModelsM&T 18.71 22.08 16.25 19.59 14.62 17.91 14.29 17.27 14.06 16.92Maui 14.79 14.14 13.76 13.67 13.03 12.87 12.69 12.10 11.73 11.52BiasRW 28.87 24.63 24.76 22.51 22.48 21.36 21.67 20.95 21.28 20.78BiasRW+ 27.96 24.51 24.71 23.05 22.56 22.88 21.19 22.08 20.82 21.73CorGen 23.66 24.69 21.97 23.83 21.51 22.86 20.98 22.37 20.44 22.22CorGen+ 23.50 24.30 22.09 24.35 21.96 23.89 21.36 23.42 20.90 23.00CorBias 28.44 25.66 26.39 24.15 24.47 23.18 23.70 22.76 23.56 22.67CorBias+ 27.97 25.26 26.34 24.19 24.69 23.60 23.65 23.44 23.23 23.20Table 2.10: Mean weighted-semantic-mutual-overlap scores for different val-ues of k on two corpora.By combining the two forms of conversation specific information into a sin-gle model, CorBias delivers improved performance over CorGen and BiasRW inboth metrics. On email, CorBias is significantly better than CorGen for all k withboth metrics (p<0.01). On blog, CorBias gets significant improvement over Bi-94asRW for higher values of k (3,4,5) with both metrics (p<0.05). The two sourcesof information are complementary and help each other to overcome the domain-specific limitations of the respective models. Therefore, one should exploit bothinformation sources to build a generic domain-independent system.When we include the conversation-level phrases (+ versions), we get a signif-icant improvement in w-m-o on blog (p<0.01), but not on email. This may bebecause blog conversations have many more topical segments than email conver-sations (average topic number 10.77 vs. 2.5). Thus, there is little information forthe label of a topical segment outside that segment in email conversations. How-ever, note that including conversation-level phrases does not hurt the performancesignificantly in any case.k=2 k=3 k=4 k=5Email Blog Email Blog Email Blog Email BlogBaselines FreqBL 27.02 23.69 29.79 24.29 31.12 24.88 31.25 25.58LeadBL 28.72 21.69 30.86 23.14 31.99 24.19 31.99 25.33ModelsM&T 21.45 21.70 23.12 23.18 25.23 23.82 25.45 24.07Maui 14.00 14.85 15.57 17.33 17.15 19.23 18.40 20.03BiasRW 29.34 24.92 31.42 25.18 32.58 25.89 32.97 26.64BiasRW+ 29.47 25.88 31.43 27.38 32.96 28.47 33.87 29.17CorGen 23.45 25.05 28.44 25.72 30.10 26.40 30.33 27.10CorGen+ 24.56 25.87 28.46 26.61 31.14 27.63 32.91 28.50CorBias 28.98 25.27 30.90 26.41 32.24 27.14 33.25 27.65CorBias+ 29.76 25.96 31.04 27.65 33.61 28.63 35.35 29.58Table 2.11: Mean weighted-mutual-overlap (w-m-o) scores when the best ofk labels is considered.To further analyze the performance of the models, Table 2.11 shows the meanw-m-o scores when only the best of k output labels is considered. This allowsus to judge the models? ability to generate the best label in the top k list. Theresults are much clearer here. Generally speaking, among the models that donot include conversation-level phrases, CorBias is the best model, while includ-ing conversation-level phrases improves the performance further.Table 2.12 shows some of the examples from our test set where the system-generated (i.e., CorBias+) labels are very similar to the human-authored ones.There are also many cases like the ones in Table 2.13, where the system-generated95Human-authored System-generated (top 5)EmailDetails of Bristol meeting Bristol, face2face meeting, England, OctoberNashville conference Nashville conference, Courseware developers, mid October, eventMeeting agenda detailed agenda, main point, meetings, revision, wcag meetingsDesign guidelines general rule, design guidelines, accessible design, absolutes, forbidContact with Steven Steven Pemberton, contact, charter, status, schedule w3cBlog faster than light (FTL) travel FTL travel, need FTL, limited FTL, FTL drives, long FTLDr. Paul Laviolette Dr. Paul Laviolette, bang theory, systems theory, extraterrestial beacons, laughVietnam and Iraq warfare Vietnam war, incapable guerrilla war, war information, war ii, vietnamese warPulsars mean pulsars, pulsars slow time, long pulsars, relative pulsars, set pulsarsLinux distributions linux distro, linux support, major linux, viable linuxTable 2.12: Examples of Human-authored and System-generated labels.labels are reasonable, although they get low w-m-o and w-s-m-o scores when com-pared with the human-authored labels.Human-authored System-generatedMeeting time and place October, mid October, timing, w3c timing issues, OttawaArchaeology religious site, burial site, ritual site, barrows tombBio of Al Al Gilman, standards reformer, repair interest group, ER IG, ER teamsBudget Constraints budget, notice, costs, smaller companies, travelFood choice roast turkey breast, default choices, small number, vegetable rataouille, lunchTable 2.13: Examples of System-generated labels that are reasonable but getlow scores.This is because most of the human-authored labels in our corpora are abstrac-tive in nature. Annotators often write their own labels rather than simply copyingkeyphrases from the text. In doing so, they rely on their expertise and generalworld knowledge that may go beyond the contents of the conversation. In fact,although annotators reuse many words from the conversation, only 9.81% of thehuman-authored labels in blog and 12.74% of the human-authored labels in emailappear verbatim in their respective conversations. Generating human-like labelswill require a deeper understanding of the text and robust textual inference, forwhich our extractive approach can provide some useful input. For an example, seeour recent paper [136], which describes such an abstractive approach.962.5.3 Full System EvaluationIn this section we present the performance of our end-to-end system. We firstsegment a given asynchronous conversation using our best topic segmenter (thesupervised model), and then feed its output to our best topic labeler (the CorBias+model). Table 2.14 presents the human agreement and the agreement of our sys-tem with the human annotators based on the best of k outputs. For each systemannotation we measure its agreement in w-m-o and w-s-m-o with the three humanannotations using the method described in Section 2.4.3.System Humank=1 k=2 k=3 k=4 k=5EmailMean w-m-o 19.19 23.62 26.19 27.06 28.06 36.84Max w-m-o 100.0 100.0 100.0 100.0 100.0 100.0Mean w-s-m-o 24.98 32.08 34.63 36.92 38.95 42.54Max w-s-m-o 108.43 108.43 108.43 108.43 108.43 107.31BlogMean w-m-o 9.71 11.71 14.55 15.83 16.72 19.97Max w-m-o 26.67 26.67 35.00 35.00 35.00 54.17Mean w-s-m-o 15.46 19.77 23.35 25.57 26.23 28.22Max w-s-m-o 47.10 47.28 47.28 48.54 48.54 60.76Table 2.14: Performance of the end-to-end system and human agreement.Notice that in email, our system gets 100% agreement in w-m-o metric forsome conversations. However, there is a substantial gap between the mean andthe max w-m-o scores. Similarly, in w-s-m-o, our system achieves a maximumof 108% agreement, but the mean varies from 25% to 39% depending on differentvalues of k. In blog, the w-m-o and w-s-m-o scores are much lower. The maximumscores achieved in w-m-o and w-s-m-o metrics in blog are only 35% and 49% (fork = 5), respectively. The mean w-m-o score varies from 10% to 17%, and the meanw-s-m-o score varies from 15% to 28% for different values of k. This demonstratesthe difficulties of topic segmentation and labeling tasks in blog conversations.Comparing with Table 2.11, we can notice that inaccuracies in the topic seg-menter affects the overall performance. However, our results are encouraging.Even though for lower values of k there is a substantial gap between our resultsand the human agreement, as the value of k increases, our results get closer to the97human agreement, especially in w-s-m-o.2.6 Conclusion and Future DirectionsIn this work we presents two new corpora of email and blog conversations anno-tated with topics, which, along with the proposed metrics, will allow researchersto evaluate their work quantitatively. We also present a complete computationalframework for topic segmentation and labeling in asynchronous conversation. Ourapproach extends state-of-the-art methods by considering a fine-grained structureof the asynchronous conversation, along with other conversational features. We dothis by applying recent graph-based methods for NLP such as min-cut and randomwalk on paragraph, sentence or word graphs.27For topic segmentation, we extend LDA and LCSeg, two state-of-the-art un-supervised models, to incorporate a fine-grained conversational structure (i.e., theFQG), generating two novel unsupervised models LDA+FQG and LCSeg+FQGfor asynchronous conversation. We incorporate the FQG into LDA by replacingits standard Dirichlet prior with an informative Dirichlet Tree prior. On the otherhand, we propose a graph-based clustering model to incorporate the FQG into LC-Seg. In addition to that, we also propose a novel supervised segmentation modelthat combines lexical, conversational and topic features using a classifier in thegraph-based clustering framework. For topic labeling, we propose two novel ran-dom walk models that extract the most representative keyphrases from the text, byrespectively capturing conversation specific clues from two different sources: theleading sentences and the fine conversational structure i.e., the FQG.Experimental results in the topic segmentation task demonstrate that both LDAand LCSeg benefit significantly when they are extended to consider the FQG, withLCSeg+FQG being the best unsupervised model. The comparison of the super-vised segmentation model with the unsupervised models shows that the supervisedmethod outperforms the unsupervised ones even using only a few labeled conversa-tions, being the best segmentation model overall. The outputs of LCSeg+FQG andthe supervised model are also highly correlated with human annotations in both27Our conversational corpora annotated with topics, annotation manual and source code for all thesystems are publicly available from www.cs.ubc.ca/labs/lci/bc3.html98local and global metrics. The experiment on the topic labeling task reveals that therandom walk models perform better when they exploit the conversation specificclues and the best results are achieved when all the sources of clues are exploited.The evaluation of the complete end-to-end system also shows promising resultswhen compared with human performance.Error analysis in the topic segmentation task reveals that as the conversation be-comes more informal, less focused and less structured, it imposes more difficultiesto humans as well as to automatic systems. The application of probabilistic gen-erative models like LDAs tends to excessively split local contexts across multipletopics. On the other hand, the similarity-based sequential segmentation model LC-Seg tends to extract larger chunks of sentences as segments for long conversations.The incorporation of FQG helps both unsupervised models to tackle many of theseerrors. Combining all the important features in a graph-based supervised frame-work further reduces the errors. However, we believe even further error reductionis possible by reducing noise in the FQG.A comparison between system-generated and human-authored topic labels tellsus that extractive approaches may not be enough to generate human-like labels.Generating human-like labels will require a deeper understanding of the text androbust textual inference, for which our approach can provide some useful input.This work can be extended in many ways. Given that most of the human-authored labels are abstractive in nature, we plan to extend our labeling frameworkto generate more abstract human-like labels that could better synthesize the infor-mation expressed in a topic segment. A promising approach would be to rely onmore sophisticated methods for information extraction, combined with more se-mantics (e.g., phrase entailment) and data-to-text generation techniques. Anotherinteresting avenue for future work is to perform a more extrinsic evaluation of ourmethods. Instead of testing them with respect to a human gold standard, it wouldbe extremely interesting to see how effective they are when used to support otherNLP tasks, such as summarization and visualization. We are also interested inthe future to transfer our approach to other similar domains by domain adaptationmethods. We plan to work on both synchronous and asynchronous domains.99Chapter 3Rhetorical ParsingChapter 2 presented our complete computational framework for finding the high-level discourse structure i.e., the global topic structure of an asynchronous conver-sation. In addition to the global topic structure of a conversation, each comment (ormessage) locally exhibits a finer discourse structure called rhetorical structure,which logically binds the discourse units (i.e., clauses, sentences) together. In thischapter, we study rhetorical analysis, which seeks to capture this finer structure(a tree) of discourse. We propose a complete probabilistic discriminative frame-work for performing rhetorical analysis. Our framework comprises a discourse seg-menter and a discourse (rhetorical) parser.1 First, the discourse segmenter, which isbased on a binary classifier, identifies the elementary discourse units in a given text.Then, the discourse parser builds a discourse tree by applying an optimal parsingalgorithm to probabilities inferred from two Conditional Random Fields: one forintra-sentential parsing and the other for multi-sentential parsing. We present twoapproaches to combine these two stages of discourse parsing effectively. A seriesof empirical evaluations over two different datasets demonstrates that our discourseparser significantly outperforms the state-of-the-art, often by a wide margin.21In this chapter, we use the terms discourse parsing and rhetorical parsing interchangeably.2Portions of this work were previously published in two conference proceedings: Joty et al. [97](EMNLP-2012) and Joty et al. [99] (ACL-2013).1003.1 IntroductionA discourse of any kind is not formed of isolated and unrelated textual units, butof collocated, related and structured units. In Chapter 2, we have seen that anasynchronous conversation addresses a common topic, often covering a numberof subtopics. In addition, each message in a conversation locally forms a coher-ent monolog by binding its sentences logically ? each sentence follows smoothlyfrom the ones before and leads into the ones which come afterwards. In otherwords, the author ensures a consistent coherence structure to make the text inter-pretable as a whole. Without this, a text is just a sequence of non-sequiturs. Forexample, consider the following two texts.? It rained heavily during the night. The game is postponed.? It rained heavily during the night. I like maths.In the first text, the first sentence provides an Explanation for the second sentence,which makes the text interpretable as a whole. However, it is hard to establish sucha relation between the two sentences in the second text, and most readers will havedifficulties in understanding it. The reader will either reject it, simply calling it?incoherent? or spend some time to construct an explanation of what liking mathshas to do with raining heavily. By asking this, the reader is actually questioningthe coherence of the text.In a coherent text, discourse units (i.e., clauses, sentences) must be linked bymeaningful connections, connections like Explanation that are called coherencerelations. In rhetorical analysis, we seek to uncover this coherence structureunderneath the text, which has been shown to be beneficial for many NLP appli-cations including text summarization [54, 118, 126], sentence compression [194],text generation [163], sentiment analysis [112, 189] and question answering [215].Furthermore, rhetorical structure can be useful for other discourse analysis taskslike co-reference resolution using Veins theory of discourse [50].A number of formal theories of discourse have been proposed to describe thecoherence structure of a text [10, 52, 122, 129, 225]. Rhetorical Structure The-ory (RST) [122], one of the most influential of them, represents texts by labeledhierarchical structures, called Discourse Trees (DTs). For instance, let us consider101the same example we saw in Chapter 1 as shown in Figure 3.1 for the followingtext from the RST-DT corpus [38]:? But he added: ?Some people use the purchasers? index as a leading indicator,some use it as a coincident indicator. But the thing it?s supposed to measure? manufacturing strength ? it missed altogether last month.?But he added:"Some people use the purchasers? index as a leading indicator, some use it as a coincident indicator. But the thing it?s supposed to measure -- manufacturing strength --it missed altogether last month." <P>ElaborationSame-UnitContrastContrastAttribution(1)(2) (3)(4) (5)(6)Figure 3.1: Discourse tree for two sentences in RST-DT. Each of the sen-tences contains three EDUs. The second sentence has a well-formeddiscourse tree, but the first sentence does not have one.The leaves of a DT correspond to contiguous atomic text spans, called Elemen-tary Discourse Units (EDUs) (six in the example). Adjacent EDUs are connectedby rhetorical relations (e.g., Elaboration, Contrast), forming larger discourse units(represented by internal nodes), which in turn are also subject to this relation link-ing. Discourse units linked by a rhetorical relation are further distinguished basedon their relative importance in the text: nuclei are the core parts of the relationwhile satellites are peripheral or supportive ones. For example, in Figure 3.1,Elaboration is a relation between a nucleus (EDU 4) and a satellite (EDU 5), andContrast is a relation between two nuclei (EDUs 2 and 3). Rhetorical analysis inRST involves two subtasks: discourse segmentation is the task of breaking thetext into a sequence of EDUs,3 and discourse parsing is the task of linking thediscourse units (EDUs and larger units) into a labeled hierarchical tree.While recent advances in automatic discourse segmentation have attained con-siderably higher accuracies [72], discourse parsing still poses significant challenges3Note that the discourse segmentation task considered in this chapter is different from the topicsegmentation task studied in the previous chapter.102[69] and the performance of the existing parsers [84, 191, 201] is still considerablyinferior compared to human gold-standard. Our work in this chapter aims to reducethis performance gap and take discourse parsing one step further. To this end, weaddress three key limitations of existing discourse parsers as follows.First, existing discourse parsers typically model the structure and the labelsof a DT separately, and also do not consider the sequential dependencies betweenthe discourse tree constituents, which has been recently shown to be critical [69].To address this limitation, as the first contribution, we propose a novel discourseparser based on probabilistic discriminative parsing models, expressed as Condi-tional Random Fields (CRFs) [205], to infer the probability of all possible dis-course tree constituents. The CRF models effectively represent the structure andthe label of a discourse tree constituent jointly, and whenever possible, capture thesequential dependencies between the constituents.Second, existing parsers apply greedy and sub-optimal parsing algorithms tobuild the DT for a document. To cope with this limitation, our CRF models supporta probabilistic bottom-up parsing algorithm which is non-greedy and optimal.Third, existing discourse parsers do not discriminate between intra-sententialparsing (i.e., building the DTs for the individual sentences) and multi-sententialparsing (i.e., building the DT for the whole document). However, we argue thatdistinguishing between these two conditions can result in more effective parsing.Two separate parsing models could exploit the fact that rhetorical relations aredistributed differently intra-sententially vs. multi-sententially. Also, they could in-dependently choose their own informative feature sets. As another key contributionof our work, we devise two different parsing components: one for intra-sententialparsing, the other for multi-sentential parsing. This provides for scalable, modularand flexible solutions, that can exploit the strong correlation observed between thetext structure (sentence boundaries) and the structure of the rhetorical tree.In order to develop a complete and robust discourse parser, we combine ourintra-sentential and multi-sentential parsers in two different ways. Since most sen-tences have a well-formed discourse sub-tree in the full DT (for example, the sec-ond sentence in Figure 3.1), our first approach constructs a DT for every sentenceusing our intra-sentential parser, and then runs the multi-sentential parser on theresulting sentence-level DTs. However, this approach would disregard those cases103where rhetorical structures violate sentence boundaries (also called ?leaky? bound-aries [218]). For example, consider the first sentence in Figure 3.1. It does nothave a well-formed discourse sub-tree because the unit containing EDUs 2 and 3merges with the next sentence and only then is the resulting unit merged with EDU1. Our second approach, in order to deal with the leaky cases, builds sentence-levelsub-trees by applying the intra-sentential parser on a sliding window covering twoadjacent sentences and by then consolidating the results produced by overlappingwindows. After that, the multi-sentential parser takes all these sentence-level sub-trees and builds a full rhetorical parse for the whole document.Our discourse parser assumes that the input text has been already segmentedinto elementary discourse units. As an additional contribution, we propose a noveldiscriminative approach to discourse segmentation that not only achieves state-of-the-art performance, but also reduces the time and space complexities by usingfewer features. Notice that the combination of our segmenter with our parser formsa complete probabilistic discriminative framework for rhetorical analysis.While previous parsers have been tested on only one corpus, we evaluate ourframework on texts from two very different genres: news articles and instructionalhow-to-do manuals. The results demonstrate that our approach to discourse pars-ing provides consistent and statistically significant improvements over previousmethods both at the sentence level and at the document level. The performanceof our final system compares very favorably to the performance of state-of-the-artdiscourse parsers.In the rest of the chapter, after discussing related work in Section 3.2, wepresent our rhetorical analysis framework in Section 3.3. In Section 3.4, we de-scribe our discourse parser. Then, in Section 3.5 we present our discourse seg-menter. The experiments and analysis, followed by future directions are discussedin Section 3.6. Finally, we summarize our contributions in Section 3.7.3.2 Related WorkRhetorical analysis has a long history. In this section, we provide a brief overviewof the approaches that follow RST as the theory of discourse, and that are relatedto our work; see the survey by Stede [198] for a detailed overview.1043.2.1 Unsupervised ApproachesA general issue in rhetorical analysis is having sufficient data to train a model ona particular genre. It may be preferable either to develop unsupervised models thatdo not require any labeled data or to collect the required data automatically.In his early work [125], Marcu presents a shallow approach relying on dis-course cues (e.g., because, but) and surface patterns. He uses hand-coded rules,derived from an extensive corpus study, to break the text into EDUs and to buildDTs for sentences first, then for paragraphs, and so on. Despite the fact that thiswork pioneered the field of rhetorical analysis, it has many limitations. First, identi-fying discourse cues is a difficult task on its own, because depending on the usage,the same phrase may or may not signal a coherence relation. Second, discoursesegmentation using only discourse cues fails to attain high accuracy [191]. Third,rhetorical tree structures do not always correspond to paragraph structures; for ex-ample, Sporleder and Lapata [193] report that more than 20% of the paragraphsin the RST-DT corpus of news articles [38] do not correspond to a discourse unit.Fourth, discourse cues are sometimes ambiguous; for example, but can signal Con-trast, Antithesis and Concession, and so on. Finally, a more serious problem withthe rule-based approach is that often rhetorical relations are not explicitly signaledby discourse cues. For example, in RST-DT, Marcu and Echihabi [127] found thatonly 61 out of 238 Contrast relations and 79 out of 307 Cause-Explanation re-lations were explicitly signaled by cue phrases. In the British National Corpus,Sporleder and Lascarides [196] report that half of the sentences lack a discoursecue. Other studies [179, 197, 201, 206] report even higher figures: about 60% ofdiscourse relations are not explicitly signaled. Rather than relying on hand-codedrules based on discourse cues and surface patterns, recent approaches employ ma-chine learning techniques with a large set of informative features.While some rhetorical relations need to be explicitly signaled by discourse cues(e.g., Concession) and some do not (e.g., Background), there is a large middleground of relations that may be signaled or not. For these ?middle ground? rela-tions, can we exploit features present in the signaled cases to automatically identifyrelations when they are not signaled? The idea is to use unambiguous discoursecues (e.g., although for Contrast, for example for Elaboration) to automatically105label a large corpus with rhetorical relations that could then be used to train a su-pervised model.4 A series of previous work has explored this idea. Marcu andEchihabi [127] first attempted to identify four broad classes of relations: Contrast,Elaboration, Condition, and Cause-Explanation-Evidence (CEV). They use a naiveBayes classifier based on word-pairs (w1,w2), where w1 occurs in the left segment,and w2 occurs in the right segment. Sporleder and Lascarides [195] include otherfeatures (e.g., words and their stems, POS tags, segment lengths, positions) in aboosting-based classifier (BoosTexter) to further improve classification accuracy.However, these studies evaluate classification performance on the instances whererhetorical relations are originally signaled (i.e., the cues were artificially removed).It is not clear how well this approach performs on the instances which are not origi-nally signaled. Subsequent studies [24, 192, 196] confirm that classifiers trained oninstances by stripping off the original cue phrases do not generalize well to implicitcases because they are linguistically quite different.Note that the above approach to identifying relations in absence of manuallylabeled data does not perform a complete rhetorical analysis. It only attempts toidentify a very small subset of coarser relations between two non-hierarchical dis-course segments. Arguably, in order to perform an effective and complete rhetori-cal analysis, one needs to employ supervised machine learning techniques.3.2.2 Supervised ApproachesIn [124], Marcu applies supervised machine learning techniques to build a dis-course segmenter and a shift-reduce discourse parser. Both the segmenter and theparser rely on C4.5 decision tree classifiers to learn the rules automatically fromthe data. The segmenter mainly uses shallow-syntactic (POS tags) and contextualfeatures. To learn the shift-reduce actions, the discourse parser encodes five typesof features: lexical (e.g, discourse cues), shallow-syntactic, similarity, operational(previous n shift-reduce operations) and rhetorical sub-structural features. Despitethe fact that this work has pioneered many of today?s machine learning approachesto discourse parsing, it has all the limitations mentioned in Section 3.1.4We categorize this approach as unsupervised because it does not rely on human-annotated data.106Soricut and Marcu [191] present the publicly available SPADE system5 thatcomes with probabilistic models for discourse segmentation and sentence-leveldiscourse parsing. Their segmentation and parsing models are based on lexico-syntactic patterns (features) extracted from the lexicalized syntactic tree of a sen-tence. The discourse parser uses an optimal parsing algorithm to find the mostprobable rhetorical tree structure for a sentence. SPADE was trained and tested onthe RST-DT corpus. This work, by showing empirically the connection betweensyntax and discourse at the sentence level, has greatly influenced all major con-tributions in this area ever since. However, it is limited in several ways. First,SPADE does not produce a full-text (document-level) parse. Second, its parsingmodel makes an independence assumption between the label and the structure ofa DT constituent, and it ignores the sequential and the hierarchical dependenciesbetween the constituents. Third, it relies only on lexico-syntactic features, and itfollows a generative approach to estimate the model parameters.Subsequent research addresses the question of how much syntax one reallyneeds in rhetorical analysis. Sporleder and Lapata [194] focus on the discoursechunking problem, comprising two subtasks: discourse segmentation and non-hierarchical nuclearity assignment. More specifically, they examine whether fea-tures derived via a POS tagger and chunker would be sufficient for these purposes.They formulate discourse chunking in two alternative ways. First, one-step classi-fication, where the discourse chunker, a multi-class classifier, assigns to each tokenone of the four labels: (i) B-NUC (beginning nucleus), (ii) I-NUC (inside nucleus),(iii) B-SAT (beginning satellite), and (iv) I-SAT (inside satellite). Therefore, thisapproach performs segmentation and nuclearity assignment simultaneously. Sec-ond, two-step classification, where in the first step, the discourse segmenter, a bi-nary classifier, labels each token as either B (beginning) or I (inside). Then, in thesecond step, a nuclearity labeler, another binary classifier, assigns nuclearity sta-tuses to the segments. The two-step approach avoids illegal chunk sequences likea B-NUC followed by an I-SAT or a B-SAT followed by an I-NUC, and in this ap-proach, it is easier to incorporate sentence-level properties like the requirement thata sentence must contain at least one nucleus. The evaluation on RST-DT shows that5http://www.isi.edu/licensed-sw/spade/107the two-step approach outperforms the one-step approach, and the performance iscomparable to the performance of SPADE.In follow-up work, Fisher and Roark [72] demonstrate over 4% absolute per-formance gain in discourse segmentation, by combining the features extracted fromthe syntactic tree with the ones derived via POS tagger and chunker. Using quite alarge number of features in a binary log-linear model they achieve state-of-the-artperformance in segmentation on the RST-DT test set.Recently, Hernault et al. [84] present the publicly available HILDA system6that comes with a segmenter and a parser based on Support Vector Machines(SVMs). The segmenter is a binary SVM classifier which uses the same lexico-syntactic features used in SPADE, but with more context. The discourse parseriteratively employs two SVM classifiers in pipeline to build a DT. In each iteration,a binary classifier first decides which of the adjacent units to merge, then a multi-class classifier connects the selected units with an appropriate relation label. Theyreport state-of-the-art performance in discourse parsing on the RST-DT corpus.On a different genre of instructional texts, Subba and Di-Eugenio [201] proposea shift-reduce parser that relies on a classifier for relation labeling. Their classifieruses Inductive Logic Programming (ILP) to learn first-order logic rules froma large set of features including the linguistically rich compositional semanticscoming from a semantic parser. They demonstrate that including compositionalsemantics with other features improves relation classification performance.Both HILDA and the ILP-based approach of Subba and Di-Eugenio [201] arelimited in several ways. First, they do not differentiate between intra-sententialparsing and multi-sentential parsing, and use a single uniform model in both sce-narios. Second, they take a greedy (sub-optimal) approach to construct a DT. Third,they disregard sequential dependencies between DT constituents, which has beenrecently shown to be critical by Feng and Hirst [69]. Furthermore, HILDA con-siders the structure and the labels of a DT separately. Our novel discourse parser,proposed in this chapter, addresses all these limitations of the existing parsers.6http://nlp.prendingerlab.net/hilda/108modelAlgorithmSentences segmented into EDUs Document-level discourse treeIntra-sententialparser Multi-sententialparsermodelAlgorithmSegmentation       modelSegmenter ParserDocumentFigure 3.2: Rhetorical analysis framework.3.3 Our Rhetorical Analysis FrameworkGiven a text, the first task in our rhetorical analysis pipeline (Figure 3.2) is to breakthe text into a sequence of Elementary Discourse Units (EDUs), i.e., discoursesegmentation. Since it is taken for granted that sentence boundaries are also EDUboundaries (i.e., EDUs do not span across multiple sentences), the segmentationtask boils down to finding EDU boundaries inside sentences.Once we have identified the EDUs, the discourse parsing problem is determin-ing which discourse units (EDUs or larger units) to relate (i.e., the structure), andhow to relate them (i.e., the labels or the rhetorical relations) in the process ofbuilding the hierarchical DT. Specifically, it requires: a parsing model to explorethe search space of possible structures and labels for their nodes; and a parsing al-gorithm for deciding among the candidates. Unlike previous studies [84, 124, 201],which follow a greedy approach, our approach to discourse parsing applies an op-timal parsing algorithm to the probabilities of all possible DT constituents to findthe most probable DT. A simple and straightforward strategy would be to use asingle unified parsing model for both sentence-level and document-level parsingwithout distinguishing the two cases, as was previously done in [84, 124, 201].However, this approach would be problematic in our case because of scalabilityand modeling issues. Note that the number of valid trees grows exponentially withthe number of EDUs in a document.7 Therefore, an exhaustive search over all thevalid discourse trees is often unfeasible, even for relatively small documents.For modeling, the problem is two-fold. On the one hand, it appears that rhetori-7For n+1 EDUs, the number of valid discourse trees is actually the Catalan number Cn.109cal relations are distributed differently intra-sententially vs. multi-sententially. Forexample, Figure 3.3 shows a comparison between the two distributions of the sixmost frequent relations on a development set containing 20 randomly selected doc-uments from the RST-DT corpus. Notice that relations Attribution and Same-Unitare more frequent than Joint in the intra-sentential case, whereas Joint is morefrequent than the other two in the multi-sentential case. On the other hand, differ-ent kinds of features are applicable and informative for intra-sentential vs. multi-sentential parsing. For example, syntactic features like dominance sets [191] areextremely useful for sentence-level discourse parsing, but are not even applicablein the multi-sentential case. Likewise, lexical chain features [193], that are usefulfor multi-sentential parsing, are not applicable at the sentence level.Elaboration Joint Attribution Same-Unit Contrast Explanation051015202530 Multi-sententialIntra-sententialFigure 3.3: Distributions of six most frequent relations in intra-sentential and multi-sentential parsing scenarios.Based on the above observations, our discourse parsing framework comprisestwo separate modules: an intra-sentential parser and a multi-sentential parser(Figure 3.2). First, the intra-sentential parser produces one or more discourse sub-trees for each sentence. Then, the multi-sentential parser generates a full DT for thedocument from these sub-trees. Both of our parsers have the same two components:a parsing model assigns a probability to every possible DT, and a parsing algorithmidentifies the most probable DT among the candidate DTs in that scenario. Whilethe two parsing models are rather different, the same parsing algorithm is shared bythe two modules. Staging multi-sentential parsing on top of intra-sentential pars-ing in this way allows us to exploit the strong correlation observed between the text110structure and the DT structure as explained in detail in Section 3.4.3.3.4 The Discourse ParserBefore describing our parsing models and the parsing algorithm in detail we intro-duce some terminology that we will use throughout the chapter.A DT can be formally represented as a set of constituents of the form R[i,m, j],where i ? m < j. This refers to a rhetorical relation R between the discourseunit containing EDUs i through m and the unit containing EDUs m+1 throughj. For example, the DT for the second sentence in Figure 3.1 can be representedas {Elaboration-NS[4,4,5], Same-Unit-NN[4,5,6]}. Notice that a relation R alsospecifies the nuclearity statuses of the discourse units involved, which can be oneof Nucleus-Satellite (NS), Satellite-Nucleus (SN) or Nucleus-Nucleus (NN).A common assumption made for building discourse trees effectively is thatthey are binary trees [60, 191]. That is, multi-nuclear relations (e.g., Joint, Same-Unit) involving more than two discourse units are mapped to a hierarchical right-branching binary tree. For example, a flat Joint(e1,e2,e3,e4) is mapped to a right-branching binary tree Joint(e1,Joint(e2,Joint(e3,e4))).3.4.1 Parsing ModelsAs mentioned before, the job of our intra-sentential and multi-sentential parsingmodels is to assign a probability to each of the constituents of all possible DTs atthe sentence level and at the document level, respectively. Formally, given themodel parameters ?, for each possible constituent R[i,m, j] in a candidate DTat the sentence or document level, the parsing model estimates P(R[i,m, j]|?),which specifies a joint distribution over the label R and the structure [i,m, j] ofthe constituent. For example, when applied to the sentences in Figure 3.1 sep-arately, our intra-sentential parsing model with (learned) parameters ? estimatesP(R[1,1,2]|?), P(R[2,2,3]|?), P(R[1,2,3]|?) and P(R[1,1,3]|?) for the first sen-tence, and P(R[4,4,5]|?), P(R[5, 5,6]|?), P(R[4,5,6]|?) and P(R[4,4,6]|?) forthe second sentence, respectively, for all R ranging on the set of relations.111Intra-Sentential Parsing ModelOur novel probabilistic model for sentence-level discourse parsing is shown in Fig-ure 3.4. The observed nodes U j in a sequence represent the discourse units (EDUsor larger units). The first layer of hidden nodes are the structure nodes, whereS j ? {0,1} denotes whether two adjacent discourse units U j?1 and U j should beconnected or not. The second layer of hidden nodes are the relation nodes, withR j ? {1 . . .M} denoting the relation between two adjacent units U j?1 and U j, whereM is the total number of relations in the relation set. The connections between ad-jacent nodes in a hidden layer encode sequential dependencies between the respec-tive hidden nodes, and can enforce constraints such as the fact that a S j= 1 mustnot follow a S j?1= 1. The connections between the two hidden layers model thestructure and the relation of DT constituents jointly.U UU U U2223 j t-1 tSS S S SR R R R R33 jj t-1t-1 tUnit sequenceat level iStructure sequenceRelationsequenceU1tFigure 3.4: A chain-structured DCRF as our intra-sentential parsing model.Notice that the graphical model in Figure 3.4 is a chain-structured undirectedgraphical model (also known as Markov Random Field (MRF)) with two hiddenlayers (chains). It becomes a Dynamic Conditional Random Field (DCRF) [205]when we directly model the hidden (output) variables by conditioning the cliquepotentials (factors) on the observed (input) variables:P(R2:t ,S2:t |x,?) =1Z(x,?)t?1?i=2?(Ri,Ri+1|x,?)?(Si,Si+1|x,?)?(Ri,Si|x,?)(3.1)where {?} and {?} are the factors over the edges of the relation and structurechains, respectively, and {?} are the factors over the edges connecting the relation112and structure nodes (i.e., between-chain edges). Here, x represents input featuresextracted from the observed variables, and Z(x,?) is the partition function. We usethe standard log-linear representation of the factors:?(Ri,Ri+1|x,?) = exp(?Tr f (Ri,Ri+1,x)) (3.2)?(Si,Si+1|x,?) = exp(?Ts f (Si,Si+1,x)) (3.3)?(Ri,Si|x,?) = exp(?Tc f (Ri,Si,x)) (3.4)where f (Y,Z,x) is a feature vector derived from the input features x and the locallabels Y and Z, and ?y is the corresponding weight vector.A DCRF is a generalization of linear-chain CRFs [110] to represent complexinteractions between labels, such as when performing multiple labeling tasks on thesame sequence. Recently, there has been an explosion of interest in CRFs for solv-ing structured output classification problems, with many successful applicationsin NLP including syntactic parsing [71], syntactic chunking [183] and discoursechunking [75] in the Penn Discourse Treebank [164]. CRFs, being a discrimi-native approach to sequence modeling, have several advantages over their gen-erative counterparts such as Hidden Markov Models (HMMs) and MRFs, whichfirst model the joint distribution p(y,x|?), then infer the conditional distributionp(y|x,?)). It has been advocated that discriminative models are generally moreaccurate than generative ones since they do not ?waste resources? modeling com-plex distributions that are observed (i.e., p(x)), instead they focus on modelingwhat we care about, i.e., the distribution of labels given the data [145]. Other keyadvantages include the ability to incorporate arbitrary overlapping local and globalfeatures, and the ability to relax strong independence assumptions. Furthermore,CRFs surmount the label bias problem [110] of the Maximum Entropy MarkovModel (MEMM), which is considered to be a discriminative version of the HMM.To obtain the probability of the constituents of all candidate DTs for a sentence,we apply our parsing model recursively at different levels of the DT and computethe posterior marginals over the relation-structure pairs. To illustrate the process,let us assume that the sentence contains four EDUs. At the first (bottom) level,when all the discourse units are the EDUs, there is only one possible unit sequence113e 1 e e2223S S3R R3(a)e 1eSR1:2 333e eSR2:32:3(b)2:3e4S4R4e4S4R4e4S4R41e eSR222 e3:4S3:4R3:41 eSR1:3 444e eSR2:42:4(c)2:4 eeSR1:2e3:43:43:4(i) (ii)(iii)(i) (ii) (iii)Figure 3.5: Our parsing model applied to the sequences at different levelsof a sentence-level discourse tree. (a) Only possible sequence at thefirst level, (b) Three possible sequences at the second level, (c) Threepossible sequences at the third level.to which we apply our DCRF model (Figure 3.5(a)). We compute the posteriormarginals P(R2,S2=1|e1,e2,e3,e4,?), P(R3, S3=1|e1,e2,e3,e4,?) and P(R4,S4=1|e1,e2,e3, e4,?) to obtain the probability of the constituents R[1,1,2], R[2,2,3] andR[3,3,4], respectively. At the second level, there are three possible unit sequences(e1:2,e3,e4), (e1,e2:3,e4) and (e1,e2,e3:4). Figure 3.5(b) shows their correspondingDCRF models. The posterior marginals P(R3,S3=1|e1:2,e3,e4,?), P(R2:3S2:3=1|e1,e2:3,e4,?), P(R4,S4=1|e1,e2:3,e4,?) and P(R3:4,S3:4=1|e1,e2,e3:4,?) computed fromthe three sequences correspond to the probability of the constituents R[1,2,3],R[1,1,3], R[2,3,4] and R[2,2,4], respectively. Similarly, we attain the probabilityof the constituents R[1,1,4], R[1,2,4] and R[1,3,4] by computing their respectiveposterior marginals from the three possible sequences at the third (top) level.At this point what is left to be explained is how we generate all possible se-quences for a given number of EDUs in a sentence. Algorithm 1 demonstrates howwe do that. More specifically, to compute the probabilities of each DT constituentR[i,k, j], we need to generate sequences like (e1, ? ? ? ,ei?1,ei:k,ek+1: j,e j+1, ? ? ? ,en)114for 1 ? i ? k < j ? n. In doing so, we may generate some duplicate sequences.Clearly, the sequence (e1, ? ? ? ,ei?1,ei:i,ei+1: j,e j+1, ? ? ? ,en) for 1? i? k < j < n isalready considered for computing the probability of R[i+1, j, j+1]. Therefore, itis a duplicate sequence that we exclude from our list of all possible sequences.Input: Sequence of EDUs: (e1,e2, ? ? ? ,en)Output: List of sequences: Lfor i = 1? n?1 dofor j = i+1? n doif j == n then // sequences at top and bottom levelsfor k = i? j?1 doL.append ((e1, ? ? ? ,ei?1,ei:k,ek+1: j,e j+1, ? ? ? ,en))endelse // sequences at intermediate levelsfor k = i+1? j?1 do // excludes duplicate sequencesL.append ((e1, ? ? ? ,ei?1,ei:k,ek+1: j,e j+1, ? ? ? ,en))endendendendAlgorithm 1: Generating all possible sequences for a sentence with n EDUs.Once we acquire the probability of all possible DT constituents, the discoursesub-trees for the sentences are built by applying an optimal probabilistic parsingalgorithm (Section 3.4.2) using one of the methods described in Section 3.4.3.Multi-Sentential Parsing ModelGiven the discourse units (sub-trees) for all the sentences in a document, a simpleapproach to build the rhetorical parse tree of the document would be to apply a newDCRF model, similar to the one in Figure 3.4 (with different parameters), to all thepossible sequences generated from these units to infer the probability of all pos-sible higher-order constituents. However, the number of possible sequences andtheir length increase with the number of sentences in a document. For example,assuming that each sentence has a well-formed DT, for a document with n sen-tences, algorithm 1 generates O(n3) sequences, where the sequence at the bottom115level has n units, each of the sequences at the second level has n-1 units, and soon. Since the model in Figure 3.4 has a ?fat? chain structure, we could use theforwards-backwards algorithm for exact inference in this model [204]. However,forwards-backwards on a sequence containing T units costs O(T M2) time, whereM is the number of relations in our relation set. This makes the chain-structuredDCRF model impractical for multi-sentential parsing of long documents, sincelearning requires to run inference on every training sequence with an overall timecomplexity of O(T M2n3) per document.U Ut-1 tSR tAdjacent Unitsat level iStructureRelationtFigure 3.6: A CRF as a multi-sentential parsing model.Our model for multi-sentential parsing is shown in Figure 3.6. The two ob-served nodes Ut?1 and Ut are two adjacent discourse units. The (hidden) structurenode S ? {0,1} denotes whether the two discourse units should be linked or not.The other hidden node R ? {1 . . .M} represents the relation between the two units.Notice that similar to the model in Figure 3.4, this is also an undirected graphicalmodel and becomes a CRF model if we directly model the labels by conditioningthe clique potential ? on the input features x, derived from the observed variables:P(Rt ,St |x,?) =1Z(x,?)?(Rt ,St |x,?) (3.5)?(Rt ,St |x,?) = exp(?T f (Rt ,St ,x)) (3.6)where f (Rt ,St ,x) is a feature vector derived from the input features x and the labelsRt and St , and ? is the corresponding weight vector. Although this model is similarin spirit to the model in Figure 3.4, we now break the chain structure, which makes116the inference much faster (i.e., complexity of O(M2)). Breaking the chain structurealso allows us to balance the data for training (an equal number of instances withS=1 and S=0), which dramatically reduces the learning time of the model.We apply our model to all possible adjacent units at all levels in the multi-sentential case, and compute the posterior marginals of the relation-structure pairsP(Rt ,St=1|Ut?1,Ut ,?) to obtain the probability of all possible DT constituents.Given the sentence-level discourse units, the following algorithm, which is a sim-plified variation of algorithm 1, extracts all possible adjacent units for a document.Input: Sequence of units: (U1,U2, ? ? ?Un), where Ux[0]:= start EDU ID ofunit x, and Ux[1]:= end EDU ID of unit x.Output: List of adjacent units: Lfor i = 1? n?1 dofor j = i+1? n dofor k = i? j?1 doLe f t =Ui[0] : Uk[1]Right =Uk+1[0] : U j[1]L.append ((Le f t,Right))endendendAlgorithm 2: Generating all possible adjacent units at all levels of adocument-level discourse tree.Both of our CRF models (intra-sentential and multi-sentential) are designedusing MALLET?s graphical model toolkit GRMM [132]. In order to avoid over-fitting, we regularize the CRF models with l2 regularization and learn the modelparameters using the limited-memory BFGS (L-BFGS) fitting algorithm.Features Used in our Parsing ModelsCrucial to parsing performance is the set of features used in the parsing models,as summarized in table 3.1. Table 3.1 also specifies in what parsing model eachfeature is used. Notice that some of the features are used in both models. Thefeatures are extracted from two adjacent discourse units Ut?1 and Ut . Most of thefeatures have been explored in previous studies (e.g., [84, 191, 194]). However, we117improve some of these as explained below.8 Organizational features Intra & Multi-SententialNumber of EDUs in unit 1 (or unit 2).Number of tokens in unit 1 (or unit 2).Distance of unit 1 in EDUs to the beginning (or to the end).Distance of unit 2 in EDUs to the beginning (or to the end).4 Text structural features Multi-SententialNumber of sentences in unit 1 (or unit 2).Number of paragraphs in unit 1 (or unit 2).8 N-gram features N?{1,2,3} Intra & Multi-SententialBeginning (or end) lexical N-grams in unit 1.Beginning (or end) lexical N-grams in unit 2.Beginning (or end) POS N-grams in unit 1.Beginning (or end) POS N-grams in unit 2.5 Dominance set features Intra-SententialSyntactic labels of the head node and the attachment node.Lexical heads of the head node and the attachment node.Dominance relationship between the two units.9 Lexical chain features Multi-SententialNumber of chains spanning unit 1 and unit 2.Number of chains start in unit 1 and end in unit 2.Number of chains start (or end) in unit 1 (or in unit 2).Number of chains skipping both unit 1 and unit 2.Number of chains skipping unit 1 (or unit 2).2 Contextual features Intra & Multi-SententialPrevious and next feature vectors.2 Sub-structural features Intra & Multi-SententialRoot nodes of the left and right rhetorical sub-trees.Table 3.1: Features used in our intra- and multi-sentential parsing models.Organizational features encode useful information about text organization asshown by duVerle and Prendinger [60]. We measure the length of the units as thenumber of EDUs and tokens in it. However, in order to better adjust to the lengthvariations, rather than computing their absolute numbers in a unit, we choose tomeasure their relative numbers with respect to their total numbers in the two units.For example, if the two units under consideration contain three EDUs in total, a118unit containing two of the EDUs will have a relative EDU number of 0.67. We alsomeasure the distances of the units in terms of the number of EDUs from the begin-ning and end of the sentence (or text in the multi-sentential case). Text structuralfeatures capture the correlation between text structure and rhetorical structure bycounting the number of sentence and paragraph boundaries in the units.Discourse cues (e.g., because, but), when present, signal rhetorical relationsbetween two text segments [106, 125]. However, recent studies [22, 84] suggestthat an empirically acquired lexical N-gram dictionary is more effective than afixed list of cue phrases, since this approach is domain independent and capable ofcapturing non-lexical cues such as punctuation. To build the lexical N-gram dic-tionary empirically from the training corpus we consider the first and last N tokens(N?{1,2,3}) of each unit and rank them according to their mutual informationwith the two labels, Structure and Relation.8 Intuitively, the most informative cuesare not only the most frequent, but also the ones that are indicative of the labelsin the training data [28]. In addition to the lexical N-grams we also encode thePOS tags of the first and last N tokens (N?{1,2,3}) in a unit as shallow-syntacticfeatures.Lexico-syntactic features dominance sets extracted from the Discourse Seg-mented Lexicalized Syntactic Tree (DS-LST) of a sentence has been shown to beextremely effective for intra-sentential discourse parsing in SPADE [191]. Fig-ure 3.7(a) shows the DS-LST (i.e., lexicalized syntactic tree with EDUs identified)for a sentence with three EDUs from the RST-DT corpus, and Figure 3.7(b) showsthe corresponding discourse tree. In a DS-LST, each EDU except the one withthe root node must have a head node NH that is attached to an attachment nodeNA residing in a separate EDU. A dominance set D (shown at the bottom of Fig-ure 3.7(a)) contains these attachment points (shown in boxes) of the EDUs in aDS-LST. In addition to the syntactic and lexical information of the head and attach-ment nodes, each element in D also includes a dominance relationship between theEDUs involved. The EDU with NA dominates (represented by ?>?) the EDU withNH .Soricut and Marcu [191] hypothesize that the dominance set (i.e., lexical heads,8In contrast, HILDA [84] ranks the N-grams by their frequencies in the training corpus.119NPS (to)VBNPDTTOVPNPINNP123D = { ((1, efforts/NP) > (2, to/S)), ((3, say/S) > (1, hamstrung/S)) }The bankDT NNS VPAUX VPwasVBN PPhamstrung in its effortsPRP$ NNSVPPPNNS IN NPDT VBG NNNPINPPPRP$ NNSNP PPTO NPDT NN ,NP VP .NNS VBPanalysts     saySNHNANANH(efforts)(hamstrung)(say)(its) (efforts)to       face        the challenges of a   changing market governmentto theby   its   links.,(a) The discourse segmented lexicalized syntactic tree (DS-LST) for a sentence inRST-DT. Boxed nodes form the dominance set D as shown at the bottom.to face the challenges of a changing market by its links to the government,analysts say.[1-2] ElaborationAttributionThe bank was hamstrung in its efforts(1) (2)(3)(b) Discourse tree for the above sentence.Figure 3.7: Dominance set features for intra-sentential discourse parsing.120syntactic labels and dominance relationships) carries the most informative cluesfor intra-sentential parsing. For instance, the dominance relationship between theEDUs in our example sentence is 3 > 1 > 2, which favors the DT structure [1,1,2]over [2,2,3]. In order to extract dominance set features for two adjacent units Ut?1and Ut , containing EDUs ei: j and e j+1:k, respectively, we first compute D from theDS-LST of the sentence. We then extract the element from D that holds acrossthe EDUs j and j+1. In our example, for the two units, containing EDUs e1 ande2, respectively, the relevant dominance set element is (1, efforts/NP)>(2, to/S).We encode the syntactic labels and lexical heads of NH and NA and the dominancerelationship as features in our intra-sentential parsing model.As described in Chapter 2, Lexical chains [143] are sequences of semanticallyrelated words that can indicate topical boundaries in a text. Features extracted fromlexical chains are also shown to be useful for finding paragraph-level discoursestructure [193]. For example, consider the text with 4 paragraphs (P1 to P4) inFigure 3.8(a). Now, let us assume that there is a lexical chain that spans the wholetext, skipping paragraphs P2 and P3, while a second chain only spans P2 and P3.This situation makes it more likely that P2 and P3 should be linked in the DT beforeany of them is linked with another paragraph. Therefore, the DT structure in Figure3.8(b) should be more likely than the structure in Figure 3.8(c).P P P P1 2 3 4 P P P P1 2 3 4 P P P P1 2 3 4(a) (b) (c)Figure 3.8: Correlation between lexical chains and discourse structure. (a)Lexical chains spanning paragraphs. (b) Two possible DT structures.One challenge in computing lexical chains is that words can have multiplesenses and semantic relationships depend on the sense rather than the word itself.Several methods have been proposed to compute lexical chains [17, 73, 86, 187].We follow the approach proposed by Galley and McKeown [73], that extracts lex-ical chains after performing Word Sense Disambiguation (WSD). In the prepro-121cessing step, we extract the nouns from the document and lemmatize them usingWordNet?s built-in morphy function [67]. Then, by looking up in WordNet we ex-pand each noun to all of its senses, and build a Lexical Semantic Relatedness Graph(LSRG). In a LSRG, the nodes represent noun-tokens with their candidate senses,and the weighted edges between senses of two different tokens represent one of thethree semantic relations: repetition, synonym and hypernym. For example, Figure3.9(a) shows a partial LSRG, where the token bank has two possible senses, namelymoney bank and river bank. Using the money bank sense, bank is connected withinstitution and company by hypernymy relations (edges marked with H), and withanother bank by a repetition relation (edges marked with R). Similarly, using theriver bank sense, it is connected with riverside by a hypernymy relation and withbank by a repetition relation. Nouns that are not found in WordNet are consideredas proper nouns having only one sense, and are connected by a repetition relation.bankcompanyinstitutionriversidebankRRHHHHHHSbankcompanyinstitutionriversidebank(bank, company, institution, bank)(riverside)(a) (b)Figure 3.9: Extracting lexical chains. (a) A Lexical Semantic RelatednessGraph (LSRG) for five noun-tokens. (b) Resultant graph after perform-ing WSD. The box at the bottom shows the lexical chains.We use this LSRG first to perform WSD, then to construct lexical chains. ForWSD, the weights of all edges leaving the nodes under their different senses aresummed up and the one with the highest score is considered to be the right sensefor the word-token. For example, if repetition and synonymy are weighted equally,and hypernymy is given half as much weight as either of them, the score of bank?stwo senses are: 1+ 0.5+ 0.5 = 2 for the sense money bank and 1+ 0.5 = 1.5 for122the sense river bank. Therefore, the selected sense for bank in this context is riverbank. In case of a tie, we select the sense that is most frequent (i.e., the first sensein WordNet). Note that this approach to WSD is different from [193], which takesa greedy approach.Finally, we prune the graph by only keeping the links that connect words withthe selected senses. At the end of the process, we are left with the edges that formthe actual lexical chains. For example, Figure 3.9(b) shows the result of pruningthe graph in Figure 3.9(a). The lexical chains extracted from the pruned graph areshown in the box at the bottom. Following [193], for each chain element, we keeptrack of the location (i.e., sentence Id) in the text where that element was found,and exclude chains containing only one element. Given two discourse units, wecount the number of chains that: hit the two units, exclusively hit the two units,skip both units, skip one of the units, start in a unit, and end in a unit.We also consider more contextual information by including the above featurescomputed for the neighboring adjacent unit pairs in the current feature vector. Forexample, the contextual features for units Ut?1 and Ut includes the feature vectorcomputed from Ut?2 and Ut?1 and the feature vector computed from Ut and Ut+1.We incorporate hierarchical dependencies between the constituents in a DT byrhetorical sub-structural features. For two adjacent discourse units Ut?1 and Ut ,we extract the roots of the two rhetorical sub-trees. For our example in Figure3.7(b), the root of the rhetorical sub-tree spanning over EDUs e1:2 is Elaboration-NS. However, this assumes the presence of a labeled DT, which is not the case whenwe apply the parser to a new text (sentence or document). This problem can be eas-ily solved by looping twice through building the parsing model and applying theparsing algorithm (see Section 3.4.2). We first build the model without consideringthe sub-structural features. Then we find the optimal DT employing our parsing al-gorithm. This intermediate DT will now provide labels for the sub-structures. Nextwe can build a new, more accurate model by including the sub-structure features,and run again the parsing algorithm to find the final optimal DT.In addition to the above features, we also experimented with other featuresincluding WordNet-based lexical semantics, subjectivity and TF.IDF-based cosinesimilarity. However, since such features did not improve parsing performance onour development set, they were excluded from our final set of features.1233.4.2 Parsing AlgorithmOur parsing models assign a probability to every possible DT constituent in theintra-sentential and multi-sentential scenarios. The job of the parsing algorithm isto find the most probable DT for the whole text. Formally, this can be written as,DT ? = argmaxDTP(DT |?) (3.7)where ? specifies the parameters of the parsing model (intra-sentential or multi-sentential). We implement a probabilistic CKY-like bottom-up algorithm for com-puting the most likely parse using dynamic programming (see [100] for details).Specifically, with n discourse units, we use the upper-triangular portion of the n?ndynamic programming table D, where cell D[i, j] (for i < j) stores:D[i, j] = P(r?[Ui(0),Uk?(1),U j(1)]) (3.8)where Ux(0) and Ux(1) are the start and end EDU Ids of discourse unit Ux, and(k?,r?) = argmaxi?k? j ; R?{1???M}P(R[Ui(0),Uk(1),U j(1)])?D[i,k]?D[k+1, j] (3.9)Recall that the notation R[Ui(0),Uk(1),U j(1)] refers to a rhetorical relation Rbetween the discourse unit containing EDUs Ui(0) through Uk(1) and the unit con-taining EDUs Uk(1)+ 1 through U j(1). In addition to D, which stores the proba-bility of the most probable constituents of a DT, we also simultaneously maintaintwo other n?n dynamic programming tables S and R for storing the structure (i.e.,Uk?(1)) and the relations (i.e., r?) of the corresponding DT constituents, respec-tively. For example, given 4 EDUs e1 ? ? ?e4, the S and R tables at the left of Figure3.10 together represent the DT shown at the right. To find the DT, we first lookat the top-right entries in the two tables; here S[1,4] = 2 and R[1,4] = r2 specifythat the two discourse units e1:2 and e3:4 should be connected by the relation r2 (theroot in the DT). Then, we see how EDUs e1 and e2 should be connected by lookingat the entries S[1,2] and R[1,2]; here S[1,2] = 1 and R[1,2] = r1 indicate that theyshould be connected by the relation r1 (the left non-terminal in the DT). Finally, to124see how EDUs e3 and e4 should be linked, we look at the entries S[3,4] and R[3,4],which tell us that they should be linked by the relation r4 (the right non-terminal).1 1 22 23Sr1 r3 r2r2 r3r4Rr2r1e1 e2r4e3 e4Figure 3.10: The S and R dynamic programming tables (left), and the corre-sponding discourse tree (right).Note that, in contrast to previous studies on document-level discourse parsing[84, 126, 201], which use a greedy algorithm, our approach finds a discourse treethat is globally optimal. Our approach is also different from the sentence-leveldiscourse parser SPADE [191]. SPADE first finds the tree structure that is globallyoptimal, then it assigns the most probable relations to the internal nodes. Morespecifically, the cell D[i, j] in SPADE?s dynamic programming table stores:D[i, j] = P([Ui(0),Uk(1),U j(1)]) (3.10)where k = argmaxi?p? jP([Ui(0),Up(1),U j(1)]). Disregarding the relation label R whilepopulating D, this approach may find a discourse tree that is not globally optimal.3.4.3 Document-level Parsing ApproachesNow that we have presented our intra-sentential and our multi-sentential parsers,we are ready to describe how they can be effectively combined to perform document-level rhetorical analysis. Recall that a key motivation for two-stage parsing is thatit allows us to capture the strong correlation between text structure and discoursestructure in a scalable, modular and flexible way. In the following, we describe twodifferent approaches to model this correlation.1251S-1S (1 Sentence-1 Sub-tree)A key finding from previous studies on sentence-level rhetorical analysis is thatmost sentences have a well-formed discourse sub-tree in the full DT [72, 191]. Forexample, Figure 3.11(a) shows 10 EDUs in 3 sentences (see boxes), where the DTsfor the sentences obey their respective sentence boundaries. The 1S-1S approachaims to maximally exploit this finding. It first constructs a DT for every sentenceusing our intra-sentential parser, and then it provides our multi-sentential parserwith the sentence-level DTs to build the rhetorical parse for the whole document.1     2  3S 18  9   10S 34   5      6   7S 21    2   3S 18   9    10S 34   5    6    7S 2(a) (b)???Figure 3.11: Two possible DTs for three sentences.Sliding WindowWhile the assumption made by 1S-1S clearly simplifies the parsing process, it to-tally ignores the cases where rhetorical structures violate sentence boundaries. Forexample, in the DT shown in Figure 3.11(b), sentence S2 does not have a well-formed sub-tree because some of its units attach to the left (4-5, 6) and some to theright (7). Vliet and Redeker [218] call these cases ?leaky? boundaries. Even thoughless than 5% of the sentences have leaky boundaries in RST-DT, in other corporathis can be true for a larger portion of the sentences. For example, we observeover 12% of sentences with leaky boundaries in the Instructional corpus of Subbaand Di-Eugenio [201]. However, we notice that in most cases where DT struc-tures violate sentence boundaries, its units are merged with the units of its adjacentsentences, as in Figure 3.11(b). For example, this is true for 75% of cases in ourdevelopment set containing 20 news articles from RST-DT and for 79% of cases126in our development set containing 20 how-to-do manuals from the Instructionalcorpus. Based on this observation, we propose a sliding window approach.In this approach, our intra-sentential parser works with a window of two con-secutive sentences, and builds a DT for the two sentences. For example, given thethree sentences in Figure 3.11, our intra-sentential parser constructs a DT for S1-S2and a DT for S2-S3. In this process, each sentence in a document except the first andthe last will be associated with two DTs: one with the previous sentence (say DTp)and one with the next (say DTn). In other words, for each non-boundary sentence,we will have two decisions: one from DTp and one from DTn. Our parser consol-idates the two decisions and generates one or more sub-trees for each sentence bychecking the following three mutually exclusive conditions one after another:? Same in both: If the sentence under consideration has the same (in both struc-ture and labels) well-formed sub-tree in both DTp and DTn, we take this sub-tree.For example, in Figure 3.12(a), S2 has the same sub-tree in the two DTs (one forS1-S2 and one for S2-S3). The two decisions agree on the DT for the sentence.1     2  3S18  9   10S34   5      6   7S21    2   3S18   9    10S34   5    6    7S2(a)(c)8  9   10S 34   5    6     7S2 (b)4   5    6    7S2(i) (ii)4   5      6   7S2Figure 3.12: Extracting sub-trees for S2.? Different but no cross: If the sentence under consideration has a well-formed127sub-tree in both DTp and DTn, but the two sub-trees vary either in structure or inlabels, we pick the most probable one. For example, consider the DT for S1-S2in Figure 3.12(a) and the DT for S2-S3 in Figure 3.12(b). In both cases S2 has awell-formed sub-tree, but they differ in structure. We pick the sub-tree which hasthe higher probability in the two dynamic programming tables.? Cross: If either or both of DTp and DTn segment the sentence into multiple sub-trees, we pick the one with more sub-trees. For example, consider the two DTs inFigure 3.12(c). In the DT for S1-S2, S2 has three sub-trees (4-5,6,7), whereas inthe DT for S2-S3, it has two (4-6,7). So, we extract the three sub-trees for S2 fromthe first DT. If the sentence has the same number of sub-trees in both DTp andDTn, we pick the one with higher probability in the dynamic programming tables.At the end, the multi-sentential parser takes all these sentence-level sub-treesfor a document, and builds a full rhetorical parse for the whole document.3.5 The Discourse SegmenterOur discourse parser above assumes that the input text has been already segmentedinto a sequence of EDUs. However, discourse segmentation is also a challengingproblem, and previous studies [72, 191] have identified it as a primary source ofinaccuracy for discourse parsing. Regardless of discourse parsing, segmentationitself can be useful in several NLP applications including sentence compression[194] and textual alignment in machine translation [198]. Therefore, we have de-veloped our own discourse segmenter, that not only achieves state-of-the-art perfor-mance as shown later, but also reduces the time complexity by using fewer features.3.5.1 Segmentation ModelOur discourse segmenter implements a binary classifier to decide for each word-token (except the last) in a sentence, whether to place an EDU boundary after thatword. We use a maximum entropy model to build a discriminative classifier.Specifically, we use a Logistic Regression (LR) classifier with l2 regularization:P(y|w,?) = Ber(y|Sigm(?T x))+??T? (3.11)128where the output y? {0,1} denotes whether to put an EDU boundary (y = 1) or not(y = 0) after the word-token w, which is represented using a feature vector x. In theequation, Ber(?) and Sigm(?) refer to the Bernoulli distribution and Sigmoid (alsoknown as logistic) function, respectively. We learn the model parameters ? usingthe L-BFGS fitting algorithm, which as described in Chapter 2, is time and spaceefficient. To avoid overfitting, we use 5-fold cross validation to learn the regular-ization strength parameter ? from the training data. We also use a simple baggingtechnique [30] to deal with the sparsity of boundary (i.e., y = 1) tags. Note that ourfirst attempt at the segmentation task implemented a linear-chain CRF model [110]to capture the sequential dependencies between the tags in a discriminative way.However, the binary LR classifier, using the same set of features, not only outper-forms the CRF model, but also reduces the time and space complexities. This isnot surprising because given the sparsity of the boundary (i.e., y = 1) tags, Markovdependencies between tags do not deliver additional improvement. Also, since wecould not balance the data by using techniques like bagging in the CRF model, thisfurther degrades the performance.3.5.2 Features Used in the Segmentation ModelOur set of features for discourse segmentation are mostly inspired from previousstudies but used in a novel way as we describe below.Our first subset of features which we call SPADE features, includes the lexico-syntactic patterns extracted from the lexicalized syntactic tree of the given sen-tence. These features replicate the features used in SPADE?s segmenter, but usedin a discriminative way. To decide on an EDU boundary after a word-token wk,we find the lowest constituent in the lexicalized syntactic tree that spans over to-kens wi . . .w j such that i ? k < j. The production that expands this constituent inthe tree, with the potential EDU boundary marked, forms the primary feature. Forinstance, to determine the existence of an EDU boundary after the word efforts inour sample sentence shown in Figure 3.7, the production NP(efforts)? PRP$(its)NNS(efforts) ? S(to) extracted from the lexicalized syntactic tree in Figure 3.7(a)constitutes the primary feature, where ? denotes the potential EDU boundary.SPADE predicts an EDU boundary if the relative frequency (i.e., Maximum129Likelihood Estimate (MLE)) of a potential boundary given the production in thetraining data is greater than 0.5. If the production has not been observed frequentlyenough, the unlexicalized version of the production, e.g., NP ? PRP$ NNS ? Sis used for prediction. If the unlexicalized version is also found to be rare, othervariations of the production depending on whether they include the lexical headsand how many non-terminals (one or two) they consider before and after the po-tential boundary are examined one after another (see [72] for details). In contrast,we compute the MLE estimates for a primary production (feature) and its othervariations, and use those directly as features with/without binarizing the values.Shallow syntactic features like Chunk and POS tags have been shown to pos-sess valuable clues for discourse segmentation [72, 194]. For example, it is lesslikely that an EDU boundary occurs within a chunk. We annotate the tokens ofa sentence with chunk and POS tags using the state-of-the-art Illinois tagger9 andencode these as features in our model. Note that the chunker assigns each tokena tag using the BIO notation, where B stands for beginning of a particular phrase(e.g., noun phrase, verb phrase), I stands for inside of a particular phrase and Ostands for outside of a phrase. The rationale for using the Illinois chunker is that ituses a larger set of tags (23 in total), thus more informative than most of the otherexisting taggers, which typically use only 5 tags (B-NP, I-NP, B-VP, I-VP and O).EDUs are normally multi-word strings. Thus, a token near the beginning orend of a sentence is unlikely to be the end of a segment. Therefore, for each tokenwe include its relative position (i.e., absolute position/total number of tokens) inthe sentence and distances to the beginning and end of the sentence as features.It is unlikely that two consecutive tokens are tagged with EDU boundaries.Therefore, we incorporate contextual information for a token into our model byincluding the above features computed for its neighboring tokens.We also experimented with different N-gram (N ? {1,2,3}) features extractedfrom the token sequence, POS sequence and chunk sequence. However, since suchfeatures did not improve segmentation accuracy on the development set, they wereexcluded from our final set of features.9Available at http://cogcomp.cs.illinois.edu/page/software1303.6 ExperimentsThis section presents our experimental results. First, we describe the corpora onwhich the experiments were performed and the evaluation metrics used to measurethe performance of the discourse segmenter and the discourse parsers. Then weshow the performance of our discourse segmenter followed by the performance ofour discourse parser.3.6.1 CorporaWhile previous studies on rhetorical analysis only report their results on a particularcorpus, to demonstrate the generality of our method, we experiment with texts fromtwo very different genres: news articles and instructional how-to-do manuals.Our first corpus is the standard RST-DT [38], which contains discourse anno-tations for 385 Wall Street Journal articles taken from the Penn Treebank corpus[128]. The corpus is partitioned into a training set of 347 documents and a test setof 38 documents. 53 documents, selected from both sets were annotated by twoannotators, based on which we measure human agreement. In RST-DT, the original25 rhetorical relations defined by Mann and Thompson [122] are further dividedinto a set of 18 coarser relation classes with 78 finer-grained relations.Our second corpus is the Instructional corpus prepared by Subba and Di-Eugenio [201], which contains discourse annotations for 176 how-to-do manualson home-repair. The corpus was annotated with 26 informational relations (e.g.,Preparation-Act, Act-Goal).For our experiments with the intra-sentential parser, we extracted a sentence-level DT from a document-level DT by finding the subtree that exactly spans overthe sentence. In RST-DT, by our count, 7321 out of 7673 sentences in the trainingset, 951 out of 991 sentences in the test set, and 1114 out of 1208 sentences inthe doubly-annotated set have a well-formed DT. On the other hand, 3032 out of3430 sentences in the Instructional corpus have a well-formed DT. This forms thecorpora for our experiments with intra-sentential discourse parsing. However, theexistence of a well-formed DT in not a necessity for discourse segmentation, there-fore, we do not exclude any sentence in our discourse segmentation experiments.1313.6.2 Evaluation (and Agreement) MetricsThis section describes the metrics used to measure both how much the annotatorsagree with each other, and how well the systems perform when their outputs arecompared with human annotations for the discourse analysis tasks.Metrics for Discourse SegmentationSince sentence boundaries are considered to be also the EDU boundaries, we eval-uate segmentation performance with respect to the intra-sentential segment bound-aries, which is a standard method for measuring segmentation accuracy [72, 191].Specifically, if a sentence contains n EDUs, which corresponds to n? 1 intra-sentential segment boundaries, we measure the segmenter?s ability to correctlyidentify these n? 1 boundaries. Let h be the total number of intra-sentential seg-ment boundaries in the human annotation, m be the total number of intra-sententialsegment boundaries in the model output, and c be the total number of correct seg-ment boundaries in the model output. Then, we measure Precision (P), Recall (R)and F1-score for segmentation performance as follows:P =cm, R =ch, and F1? score =2PRP+R=2ch+mMetrics for Discourse ParsingTo evaluate parsing performance, we use the standard unlabeled and labeled pre-cision, recall and F1-score as proposed by Marcu [126]. The unlabeled metricsmeasure how accurate the parser is in finding the right structure of the DT, whilethe labeled metrics measure the parser?s ability to find the right labels (i.e., nucle-arity and relation) in addition to the right structure. Assume for example that giventhe two sentences of Figure 3.1, our system generates the DT shown in Figure 3.13.Figure 3.14 shows the same human-annotated DT shown in Figure 3.1 (Figure3.14(a)) and the same system-generated DT shown in Figure 3.13 (Figure 3.14(b))when we align the two structures. For the sake of illustration, instead of showingthe real EDUs, we only show their IDs. Notice that the automatic segmenter breaksthe EDU marked 2-3 in the human annotation into two EDUs, and did not identify132But he added:"Some people use the purchasers? index some use it as a coincident indicator.But the thing it?s supposed tomeasure -- manufacturing strength -- it missed altogether last month." <P>Same-UnitContrastContrastAttribution(1)(2) (3)(4)(5)as a leadingindicator, Elaboration (6)Figure 3.13: A hypothetical system-generated discourse tree for the two sen-tences in Figure 3.1.JointContrastContrastAttribution(1)(2) (3)(4)(5-6)Elaboration(7)Same-UnitContrastContrastAttribution(1)(2-3) (4) (7)(5) (6)Elaboration(a) (b)Figure 3.14: Measuring the accuracy of a rhetorical parser. (a) The human-annotated discourse tree. (b) The system-generated discourse tree.the break between EDUs 5 and 6. In table 3.2, we list all constituents in the twoDTs and their associated labels at the span, nuclei and relation levels. The recall (R)and precision (P) figures are shown at the bottom of the table. Note that following[126], the relation labels are assigned to the children nodes rather than to the parentnodes in the evaluation process to deal with non-binary trees in human annotations.From this discussion, this is easy to understand that if the number of EDUs are thesame in the human and system annotations, and the discourse trees are binary, thenwe get the same figures for precision, recall and F1-score.3.6.3 Discourse Segmentation EvaluationIn this section we present our experiments on discourse segmentation.133Spans Nuclearity RelationsConstituents Human System Human System Human System1-1 * * S S Attribution Attribution2-2 * S Elaboration3-3 * N Span4-4 * * N N Contrast Contrast5-5 * N Span6-6 * S Elaboration7-7 * * N N Same-Unit Joint2-3 * * N N Contrast Contrast5-6 * * N N Same-Unit Joint2-4 * * S N Contrast Span5-7 * * N N Contrast Contrast1-4 * N Contrast2-7 * N SpanR = 7/10, P = 7/10 R = 6/10, P = 6/10 R = 4/10, P = 4/10Table 3.2: Measuring parsing accuracy (P = Precision, R = Recall).Experimental Setup for Discourse SegmentationWe compare the performance of our segmenter with the performance of the twopublicly available segmenters, namely the segmenters of HILDA [84] and SPADE[191]. We also compare our results with the state-of-the-art results reported byFisher and Roark [72] on the RST-DT test set. We ran HILDA with its defaultsettings. For SPADE, we applied the same modifications to its default settingsas described in [72], which delivers significantly improved performance over itsoriginal version. Specifically, in our experiments on RST-DT, we trained SPADEusing the human-annotated syntactic trees extracted from the Penn Treebank, andduring testing, we replaced the Charniak parser [42] with a more accurate rerankingparser [43]. However, due to the lack of gold syntactic trees in the Instructionalcorpus, we trained SPADE in this corpus using the syntactic trees produced by thereranking parser. To avoid using the gold syntactic trees, we used the rerankingparser in all our systems for both training and testing purposes. This syntacticparser was trained on the sections of the Penn Treebank not included in our test set.134We applied the same canonical lexical head projection rules [47, 120] to lexicalizethe syntactic trees as done in HILDA and SPADE.Note that previous studies [72, 84, 191] on discourse segmentation only reporttheir performance on the RST-DT test set. To compare our results with them, weevaluate our model on the RST-DT test set. In addition, we show a more gen-eral performance of SPADE and our system on the two corpora based on 10-foldcross validation. However, SPADE does not come with a training module for itssegmenter. We reimplemented this module and verified it on the RST-DT test set.Results for Discourse SegmentationTable 3.3 shows the segmentation results of different systems in Precision, Recall,and F1-score on the two corpora. On RST-DT, HILDA?s segmenter delivers theweakest performance having a F1-score of only 74.1.10 SPADE performs muchbetter than HILDA with an absolute F1-score improvement of 11.1%. Our seg-menter LR outperforms SPADE with an absolute F1-score improvement of 4.9% (p< 2.4e-06), and also achieves comparable results to the results of Fisher and Roark[72] (F&R), even though we use fewer features.11 Notice that human agreementfor this task is quite high, i.e., a F1-score of 98.3 computed on the doubly-annotatedportion of the RST-DT corpus mentioned in Section 3.6.1.RST-DT InstructionalStandard Test Set Doubly 10-fold 10-fold 10-foldHILDA SPADE F&R LR Human SPADE LR SPADE LRPrecision 77.9 83.8 91.3* 88.0* 98.5 83.7 87.5* 65.1 73.9*Recall 70.6 86.8 89.7* 92.3* 98.2 86.2 89.9* 82.8 89.7*F1-score 74.1 85.2 90.5* 90.1* 98.3 84.9 88.7* 72.8 80.9*Table 3.3: Segmentation results of different models on the two corpora. Per-formances significantly superior to SPADE are denoted by *.Since Fisher and Roark [72] only report their results on the RST-DT test set andwe did not have access to their system, we compare our approach with only SPADEwhen evaluating on a whole corpus based on 10-fold cross validation. On RST-DT,10The high segmentation accuracy reported in [84] is due to a less stringent evaluation metric.11Since we did not have access to the system or to the complete output/results of Fisher and Roark[72], we were not able to perform a significance test.135our segmenter delivers an absolute F1-score improvement of 3.8%, which repre-sents a more than 25% relative error rate reduction. The improvement is higheron the Instructional corpus with an absolute F1-score improvement of 8.1%, whichcorresponds to a relative error reduction of 30%. The improvements for both cor-pora are statistically significant (p < 3.0e-06). When we compare our results on thetwo corpora, we observe a substantial decrease in performance on the Instructionalcorpus. This could be due to a smaller amount of data in this corpus and the inac-curacies in the syntactic parser and taggers, which are trained on news articles. Apromising future direction would be to apply effective domain adaptation methods(e.g., easyadapt [53]) to improve segmentation performance in the Instructionaldomain by leveraging the rich data in the news domain (i.e., RST-DT).3.6.4 Discourse Parsing EvaluationThis section presents our experiments on discourse parsing. First, we describethe experimental setup. Then, we present the results. While presenting the per-formance of our discourse parser, we show a breakdown of intra-sentential vs.inter-sentential results, in addition to the aggregated results at the document level.Experimental Setup for Discourse ParsingIn our experiments on sentence-level (i.e., intra-sentential) parsing, we compareour approach with SPADE [191] on RST-DT, and with the ILP-based approach ofSubba and Di-Eugenio [201] on the Instructional corpus, since they are the state-of-the-art on the respective genres. For SPADE, we applied the same modificationsto its default settings as described in Section 3.6.3, which leads to improved per-formance. Similarly, in our experiments on document-level (i.e., multi-sentential)parsing, we compare our approach with HILDA [84] on RST-DT, and with theILP-based approach [201] on the Instructional corpus. The results for HILDAwere obtained by running the system with default settings on the same inputs weprovided to our system. Since we could not run the ILP-based system (not publiclyavailable), we report the performance presented in their paper.Our experiments on RST-DT use the 18 coarser relations (see Figure 3.15)defined by Carlson and Marcu [37] and also used in SPADE and HILDA. After at-136taching the nuclearity statuses (NS, SN, NN) to these relations, we get 41 distinctrelations.12 Our experiments on the Instructional corpus consider the same 26 pri-mary relations (e.g., Goal:Act, Cause:Effect) used by Subba and Di-Eugenio [201]and also treat the reversals of non-commutative relations as separate relations. Thatis, Goal-Act and Act-Goal are considered as two different relations. Attaching thenuclearity statuses to these relations gives 76 distinct relations.Evaluation of the Intra-sentential ParserThis section presents our experimental evaluation on intra-sentential parsing. First,we show the performance of the sentence-level discourse parsers when they areprovided with manual (or gold) segmentations. This allows us to judge the parsingperformance independently of the discourse segmentation task. Then, we show theend-to-end performance, i.e., the performance based on automatic segmentation.Intra-sentential parsing results based on manual segmentationTable 3.4 presents the intra-sentential parsing results when manual segmentationis used.13 Notice that our sentence-level parser DCRF consistently outperformsSPADE on the RST-DT test set in all three metrics, and the improvements arestatistically significant (p < 0.01). Especially, on the relation labeling task, whichis the hardest among the three tasks, we get an absolute F1-score improvementof 9.6%, which represents a relative error rate reduction of 29.6%. Our F1-scoreof 77.1 in relation labeling is also close to the human agreement of 83.0 on thedoubly-annotated data. Our results on the RST-DT test set are also consistent withthe mean scores over 10-folds on the whole RST-DT corpus.The improvements are higher on the Instructional corpus, where we compareour mean results over 10-folds with the reported results of the ILP-based system[201], giving absolute F1-score improvements of 4.8%, 15.5% and 10.6% in span,nuclearity and relations, respectively.14 In other words, our parser reduces the12Not all relations take all the possible nuclearity statuses. For example, Elaboration and Attribu-tion are mono-nuclear relations, and Same-Unit and Joint are multi-nuclear relations.13Recall from the discussion in Section 3.6.2 that precision, recall and F1-score are the same whenmanual segmentation is used.14Subba and Di-Eugenio [201] report their results based on an arbitrary split between training andtest sets. Since we did not have access to their particular split, we compare our model?s performance137RST-DT InstructionalStandard Test Set 10-fold Doubly Reported 10-foldScores SPADE DCRF DCRF Human ILP DCRFSpan 93.5 96.5* 95.3 95.7 92.9 98.3Nuclearity 85.8 89.4* 88.2 90.4 71.8 89.2Relation 67.6 79.8* 77.7 83.0 63.0 75.6Table 3.4: Intra-sentential parsing results based on manual segmentation.Performances significantly superior to SPADE are denoted by *.errors by 67.6%, 54.6% and 28.6% in span, nuclearity and relations, respectively.If we compare the performance of our DCRF parser on the two corpora, wenotice that our parser is more accurate in finding the right tree structure (see Spanrow in the table) on the Instructional corpus. This may be due to the fact that sen-tences in the Instructional domain are relatively short and contain fewer EDUs thansentences in the news domain, thus making it easier to find the right tree structure.However, when we compare the performance on the relation labeling task, we ob-serve a decrease on the Instructional corpus. This may be due to the small amountof data available for training and the imbalanced distribution of a large number ofdiscourse relations (i.e., 76 with nuclearity attached) in this corpus.Intra-sentential parsing results based on automatic segmentationIn order to evaluate the performance of the end-to-end sentence-level discourseanalysis systems, we feed the intra-sentential discourse parsers the output of theirrespective segmenters. Table 3.5 shows the (P)recision, (R)ecall, and (F1)-scoreresults for different metrics. We compare our results with SPADE on the RST-DTtest set. We achieve absolute F1-score improvements of 3.6%, 3.4% and 7.4% inspan, nuclearity and relation, respectively. These improvements are statisticallysignificant (p<0.001). Our system, therefore, reduces the errors by 15.5%, 11.4%,and 17.6% in span, nuclearity and relations, respectively. These results are alsoconsistent with the mean results over 10-folds on the whole RST-DT corpus.based on 10-fold cross validation with their reported results. Also, since we did not have access totheir system/output, we could not perform a significance test on the Instructional corpus.138RST-DT InstructionalTest set 10-fold 10-foldScores SPADE DCRF DCRF DCRFP R F1 P R F1 P R F1 P R F1Span 75.9 77.4 76.7 80.8* 84.0* 82.4* 79.6 80.7 80.1 73.5 80.7 76.9Nuc. 69.8 70.5 70.2 75.2* 78.1* 76.6* 73.9 76.5 75.2 64.6 71.0 67.6Rel. 57.4 58.5 58.0 66.1* 68.8* 67.5* 65.2 67.4 66.3 54.8 60.4 57.5Table 3.5: Parsing results using automatic segmentation. Performances sig-nificantly superior to SPADE are denoted by *.The rightmost column in the table shows our mean results over 10-folds on theInstructional corpus. We could not compare with the ILP-based approach [201]because no results were reported using an automatic segmenter. It is interesting toobserve how much our end-to-end system is affected by an automatic segmenter onboth RST-DT and the Instructional corpus (see Table 3.4 and Table 3.5). Neverthe-less, taking into account the segmentation results in Table 3.3, this is not surprisingbecause previous studies [191] have already shown that automatic segmentation isthe primary impediment to high accuracy discourse parsing. This demonstrates theneed for a more accurate segmentation model in the Instructional genre.Evaluation of the Complete ParserWe experiment with our full document-level discourse parser on the two corporausing the two parsing approaches described in Section 3.4.3, namely 1S-1S and thesliding window. On RST-DT, the standard split was used for training and testing.On the Instructional corpus, Subba and Di-Eugenio [201] used 151 documentsfor training and 25 documents for testing. Since we did not have access to theirparticular split, we took 5 random samples of 151 documents for training and 25documents for testing, and report the average performance over the 5 test sets.Table 3.6 presents F1-scores for our parsers and the existing systems on the twocorpora based on manual segmentation.15 On both corpora, our parsers, namely,1S-1S (TSP 1-1) and the sliding window (TSP SW), outperform existing systemsby a wide margin (p<7.1e-05 on RST-DT).16 On RST-DT, our parsers achieve ab-15Recall that Precision, Recall and F1-score are the same when manual segmentation is used.16Since we did not have access to the output or to the system of Subba and Di-Eugenio [201], we139solute F1-score improvements of 8%, 9.4% and 11.4% in span, nuclearity and re-lation, respectively, over HILDA. This represents relative error reductions of 32%,23% and 21% in span, nuclearity and relation, respectively. Our results are alsoclose to the upper bound, i.e. human agreement on this data set.On the Instructional genre, our parsers deliver absolute F1-score improvementsof 10.5%, 13.6% and 8.14% in span, nuclearity and relations, respectively, overthe ILP-based approach. Our parsers, therefore, reduce errors by 36%, 27% and13% in span, nuclearity and relations, respectively.RST-DT InstructionalMetrics HILDA TSP 1-1 TSP SW Human ILP TSP 1-1 TSP SWSpan 74.68 82.56* 82.84*? 88.70 70.35 80.67 81.88?Nuclearity 58.99 68.32* 68.30* 77.72 49.47 63.03 63.13Relation 44.32 55.83* 55.81* 65.75 35.44 43.52 43.60Table 3.6: Parsing results of different document-level parsers using man-ual (gold) segmentation. Performances significantly superior to HILDA(with p<7.1e-05) are denoted by *. Significant differences between TSP1-1 and TSP SW (with p<0.01) are denoted by ?.If we compare the performance of our parsers on the two corpora, we observehigher results on RST-DT. This can be explained in at least two ways. First, theInstructional corpus has a smaller amount of data with a larger set of relations (76when nuclearity attached). Second, some frequent relations are (semantically) verysimilar (e.g., Preparation-Act, Step1-Step2), which makes it difficult even for thehuman annotators to distinguish them [201].Comparison between our two models reveals that TSP SW significantly out-performs TSP 1-1 only in finding the right structure on both corpora (p<0.01).Not surprisingly, the improvement is higher on the Instructional corpus. A likelyexplanation is that the Instructional corpus contains more leaky boundaries (12%),allowing the sliding window approach to be more effective in finding those, with-out inducing much noise for the labels. This clearly demonstrates the potential ofTSP SW for datasets with even more leaky boundaries e.g., the Dutch [218] andthe German Potsdam [197] corpora.were not able to perform a significance test on the Instructional corpus.140Error analysis reveals that although TSP SW finds more correct structures, acorresponding improvement in labeling relations is not present because in a fewcases, it tends to induce noise from the neighboring sentences for the labels. Forexample, when parsing was performed on the first sentence in Figure 3.1 in isola-tion using 1S-1S, our parser rightly identifies the Contrast relation between EDUs2 and 3. But, when it is considered with its neighboring sentences by the slidingwindow, the parser labels it as Elaboration. A promising strategy to deal with thisand similar problems that we plan to explore in future, is to apply both approachesto each sentence and combine them by consolidating three probabilistic decisions,i.e. the one from 1S-1S and the two from the sliding window.Analysis on the importance of the featuresTo analyze the importance of the features, Table 3.7 presents the intra-sententialand multi-sentential parsing results based on manual segmentation on the RST-DTtest set using different subsets of features. Every new subset of features appearsto improve the performance. Specifically, for intra-sentential parsing, when weadd the Organizational features with the Dominance set features (see I2 column),we get about 2% absolute improvements in nuclearity and relations. With N-gramfeatures, the gain is even higher; 6% in relations and 3.5% in nuclearity for intra-sentential parsing (see I3), and 3.8% in relations and 3.1% in nuclearity for multi-sentential parsing (see M2). This demonstrates the utility of the N-gram features,which is also consistent with the findings of [60, 181]. The features extracted fromLexical chains have also proved to be useful for multi-sentential parsing. Theydeliver absolute improvements of 2.7%, 2.9% and 2.3% in span, nuclearity andrelations, respectively (see M3). Including the Contextual features further givesimprovements of 3% in nuclearity and 2.2% in relation for intra-sentential parsing(see I4), and 1.3% in nuclearity and 1.2% in relation for multi-sentential parsing(see M4). Notice that Sub-structural features are more beneficial for document-level parsing than they are for sentence-level parsing, i.e., an improvement of 2.2%vs. an improvement of 0.9%. This is not surprising because in general document-level discourse trees are much larger than sentence-level trees, making the sub-structural features more effective for document-level parsing.141Intra-sentential Multi-sentential (TSP 1-1)Scores I1 I2 I3 I4 I5 M1 M2 M3 M4 M5Span 91.3 92.1 93.3 94.6 96.5 74.2 75.8 78.5 80.9 82.6Nuclearity 78.2 80.3 83.8 86.8 89.4 60.6 63.7 65.6 66.9 68.3Relation 66.2 68.1 74.1 76.3 79.8 46.3 50.1 52.4 53.6 55.8Table 3.7: Parsing results using different subsets of features on RST-DT testset. Feature subsets for Intra-sentential parsing: I1 = {Dominance set},I2 = {Dominance set, Organizational}, I3 = {Dominance set, Organi-zational, N-gram}, I4 = {Dominance set, Organizational, N-gram, Con-textual}, I5 (all) = {Dominance set, Organizational, N-gram, Contex-tual, Sub-structural}. Feature subsets for Multi-sentential parsing: M1= {Organizational, Text structural}, M2 = {Organizational, Text struc-tural, N-gram}, M3 = {Organizational, Text structural, N-gram, Lexicalchain}, M4 = {Organizational, Text structural, N-gram, Lexical chain,Contextual}, M5 (all) = {Organizational, Text structural, N-gram, Lexi-cal chain, Contextual, Sub-structural}.Error Analysis on Relation Labeling and Future DirectionsTo further analyze the errors made by our parser on the hardest task of relationlabeling, Figure 3.15 presents the confusion matrix for TSP 1-1 on the RST-DTtest set. The relation labels are ordered according to their frequency in the RST-DT training set. In general, the errors are produced by two different causes actingtogether: (i) imbalanced distribution of the relations, and (ii) semantic similaritybetween the relations. The most frequent relation Elaboration tends to overshadowothers, especially the ones which are semantically similar (e.g., Explanation, Back-ground) and less frequent (e.g., Summary, Evaluation). Models sometimes failto distinguish relations that are semantically similar (e.g., Temporal:Background,Cause:Explanation).These observations suggest two ways to improve our parser. We would liketo employ a more robust method (e.g., ensemble methods with bagging) to dealwith the imbalanced distribution of relations, along with taking advantage of richersemantic knowledge (e.g., compositional semantics) to cope with the errors causedby semantic similarity between the relations.In the future, we plan to investigate to what extent discourse segmentation and142T-CT-OT-CMM-MCMPEVSUCNDENCATEEXBACOJOS-UATELT-C  T-O  T-CM      M-M     CMP  EV  SU     CND  EN     CA  TE    EX     BA   CO      JO   S-U   AT     EL10732111227114121291309359000102031333500127220000100000001018532 00201210079367571080000000302112332000000130010290192100100013200041121210080000000000704310010000100013051110060000000024221000001400000080000000100000001002210101220100001010000011110000000040000000020000000000000000000000001000000000000000020000000000000001000000000000000000Figure 3.15: Confusion matrix for relation labels on the RST-DT test set. Y-axis represents true and X-axis represents predicted relations. The re-lations are Topic-Change (T-C), Topic-Comment (T-CM), TextualOr-ganization (T-O), Manner-Means (M-M), Comparison (CMP), Evalua-tion (EV), Summary (SU), Condition (CND), Enablement (EN), Cause(CA), Temporal (TE), Explanation (EX), Background (BA), Contrast(CO), Joint (JO), Same-Unit (S-U), Attribution (AT) and Elaboration(EL).discourse parsing can be jointly performed. We would also like to explore howour system performs on other genres like conversational (e.g., blogs, emails) andevaluative (e.g., customer reviews) texts. To address the problem of limited anno-tated data in various genres, we are planning to develop an online version of oursystem that will allow users to fix the parser?s output and let the model learn fromthat feedback. A longer term goal is to extend our framework to also work withgraph structures of discourse, as recommended by several recent discourse theories[230]. Once we achieve similar performance on graph structures, we will performextrinsic evaluations to determine their relative utility for various NLP tasks.1433.7 ConclusionIn this chapter, we have presented a complete probabilistic discriminative frame-work for performing rhetorical analysis in the RST framework. Our discourse seg-menter is a binary classifier based on a maximum entropy model. Our discourseparser applies an optimal parsing algorithm to probabilities inferred from two CRFmodels: one for intra-sentential parsing and the other for multi-sentential pars-ing. The CRF models effectively represent the structure and the label of discoursetree constituents jointly. Furthermore, the DCRF model for intra-sentential parsingcaptures the sequential dependencies between the constituents. The two separatemodels (i.e., one for intra-sentential parsing and the other for multi-sentential pars-ing) use their own informative feature sets and the distributional variations of therelations in the two parsing conditions.We have also presented two novel approaches to effectively combine the intra-sentential and the multi-sentential parsing modules, that can exploit the strong cor-relation observed between the text structure and the structure of the discourse tree.The first approach 1S-1S builds a DT for every sentence using the intra-sententialparser, and then runs the multi-sentential parser on the resulting sentence-levelDTs. On the other hand, to deal with leaky boundaries, our second approach buildssentence-level sub-trees by applying the intra-sentential parser on a sliding win-dow covering two adjacent sentences and then consolidating the results producedby overlapping windows. After that, the multi-sentential parser takes all thesesentence-level sub-trees and builds a full rhetorical parse for the whole document.Empirical evaluations on two different genres demonstrate that our approach todiscourse segmentation achieves state-of-the-art performance more efficiently us-ing fewer features. A series of experiments on the discourse parsing task shows thatboth our intra-sentential and multi-sentential parsers significantly outperform thestate-of-the-art, often by a wide margin. A comparison between our combinationstrategies reveals that the sliding window approach is more robust across domains.An error analysis informs us that although the sliding window approach findsmore correct tree structures, it tends to induce noise for the relation labels from theneighboring sentences in a few cases. Another analysis of the performance of ourdiscourse parser on the relation labeling task tells us that the relations that are very144frequent tend to mislead the identification of the less frequent ones, and the modelssometimes fail to distinguish relations that are semantically similar.145Chapter 4Dialog Act RecognitionIn addition to the coarse-grained topical structure discussed in Chapter 2 and thefine-grained rhetorical structure discussed in Chapter 3, asynchronous conversa-tions exhibit another form of discourse structure, which comprises the dialog actsand the conversational structure. The Fragment Quotation Graph (FQG) discussedin Chapter 2 provides a fine-grain conversational structure of an asynchronous con-versation by linking the text fragments in the messages based on their reply-torelations. In this chapter, we study dialog act recognition, which aims to iden-tify the communicative acts (e.g., question, request) performed by the participantsin the course of the conversation. We present three unsupervised approaches: agraph-theoretic deterministic framework and two probabilistic conversational mod-els (namely HMM and HMM+Mix) for modeling dialog acts in asynchronous con-versations. The deterministic models do not consider sequential dependencies be-tween the act types, while the probabilistic models do. First we show that capturingsequential dependencies between the act types is important in asynchronous con-versations as it is in synchronous domains (e.g., meetings, phone conversations).Then we demonstrate that the probabilistic models learn better sequential depen-dencies when they are trained on the sequences extracted from the conversationalstructure, rather than when they are trained on the sequences based on the temporalorder. A comparison between the probabilistic models confirms that HMM+Mix isa better conversational model than the simple HMM model.11This chapter is based on the peer-reviewed conference paper Joty et al. [95] (IJCAI-2011).1464.1 IntroductionWhat makes an asynchronous conversation different from a monolog? Althoughparticipants communicate in writing (as opposed to speech), the nature of the inter-action in asynchronous media is conversational, in the sense that once a participantinitiated the communication all the other contributions are replies to previous ones.That is, the participants take turns, each of which consists of one or more utter-ances. The utterances in a turn perform certain communicative actions (e.g., ask-ing a question, answering a question, requesting something, offering an apology,greeting), which are called dialog acts2 (DAs) [12]. For instance, in the last emailshown in Figure 4.1, the sentence Yes - I could ... answers the question posed inthe third email. Two-part structures connecting two DAs (e.g., Question-Answer,Request-Accept) are called adjacency pairs [180].Uncovering the complex dialog structure of an asynchronous conversation is animportant step towards deep conversational analysis. Annotating utterances withDAs as shown in Figure 4.1 provides an initial level of structure ? that has beenshown to be useful for many applications in spoken dialog including meeting sum-marization [148, 151], collaborative task learning agents [5], artificial companionsfor people to use the Internet [229] and flirtation detection in speed-dates [170]. Webelieve that similar benefits will also hold for written asynchronous conversations.While considerable progress has been made in DA recognition for synchronousconversations (e.g., [102, 232] for chats, [57, 107] for meetings, [14, 171, 200] forphone conversations), very little work has been conducted in asynchronous do-mains, especially at the sentence level. The dominant approaches to DA recogni-tion in synchronous domains have been mostly supervised, and use either simpleclassifiers (binary or multi-class) or more structured models like Hidden MarkovModels (HMMs), Maximum Entropy Markov Models (MEMMs), and ConditionalRandom Fields (CRFs). Since turns in synchronous conversations occur one afterthe other with minimal delay, the conversation flow in these conversations exhibitssequential dependencies between adjacency pairs (e.g., question followed by an-swer, request followed by grant). Sequence labelers like HMMs, MEMMs andCRFs, which are capable of capturing these inter-dependencies between the dialog2Also known as speech acts.147 Frag ment   From: Brian  To: rdf core Subject: 20030220 telecon Date: Tue Feb 17 13:52:15  I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed. I would like to get moving on comments on the TAG architecture document.  Jan  ?are you still up for reviewing? Can we aim to get other comments in by the end of this week and agreement by email next week? From: Jeremy To: Brian Subject: Re: 20030220 telecon Date: Wed Feb 18 05:18:10  > I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed.  > ?.. a??ee?ent ?? e?ail ne?t week  ?I think that means we will not formally respond to I18N on the charmod comments, shall I tell t hem that we do not intend to, but that the e-mail discussion has not shown any disagreement. e.g. I have informed the RDF Core WG of your decisions, and no one has indicated unhappiness  - however we have not formally discussed these issues;  and are not li kely to. From: Jeremy To: Brian Subject: Re: 20030220 telecon Date: Thu Feb 19 05:42:21   > Ah. Is this a problem. Have I understood correctly they are going through last call again anyway.  Yes  ?I could change my draft informal response to indicate that if we have any other formal response it will be included in our LC review comments on their new documents. > When is the deadline? > I'm prepared to decide by email so we can formally respond by email. Two weeks from when I received the message ....i.e. during Cannes -I suspect that is also the real deadline, in that I imagine they want to make their final decisions at Cannes. I am happy to draft a formal response that is pretty vacuous, for e-mail vote. is pretty vacuous, for e-mail vote. From: Brian To: Jeremy Subject: Re: 20030220 telecon Date: Wed Feb 18 13:16:21  > I think that means we will not formally respond to I18N on the charmod comments, shall  > I tell them that we do not intend to, but that the e-mail discussion has not shown any disagreement. Ah,  Is this a problem.  Have I understood correctly they are going through last call again anyway.  > e.g. I have informed the RDF Core WG of your decisions, and no one has indicated unhappiness  > - however we have not formally discussed these issues; and are not likely to.  When is the deadline? I'm prepared to decide by email so we can formally respond by email. From: Pat To: Brian Subject: Re: 20030220 telecon Date: Wed Feb 18 16:56:26  > I propose to cancel this weeks telecon and schedule another for 12 Mar 2004, if needed. Im assuming that they are all cancelled unless I hear otherwise. Maybe that should be our default? > I would like to get moving on comments on the TAG architecture document. I still plan to write a rather long diatribe on this if I can find the time. I doubt if the rest of the WG will endorse all of it but I will send it along asap, hopefully some time next week.  DA  [A M ]  [S ]  [QY ]  [QY ]        [S ]   [S ]       [ QY ]  [QY]    [QW ]   [S]     [ S]  [ QY ]   [S ]  [ S]      [ A ]   [ S]  [S ]  [ S]  (a)  (b)  (c)         (d)   (e)      (f)     (g)      ( h )   (i )        (j)     (k)  Figure 4.1: Sample truncated email conversation from our BC3 corpus. Theright most column (i.e., DA) specifies the dialog act assignments for thesentences. The DA tags are defined in Table 4.1. The Fragment columnspecifies the fragments in the Fragment Quotation Graph (FQG) (seeFigure 4.2). 148acts, generally perform better than the simple classifiers (e.g., MaxEnt, SVMs).However, the supervised learning strategy is very domain specific and requiresconsiderable labeled data which can be labor intensive and expensive to acquire.Arguably, as the number of social media grows (e.g., email, blogs, Facebook) andthe number of communicative settings in which people interact through these me-dia also grows, the supervised paradigm of ?label-train-test? becomes too expensiveand unrealistic. Every novel use of a new media may require not only new annota-tions, but possibly also new annotation guidelines and new DA tagsets. In contrast,the approach we present in this thesis adopts an unsupervised paradigm, where DArecognition is considered as a two step process: first, clustering the sentences basedon their DA types, and then, assigning an appropriate DA label to each cluster. Thisapproach is more robust across new forms of media and new domains. In this chap-ter, we investigate a graph-theoretic deterministic framework and two probabilisticconversational models for clustering sentences based on their DAs. The DA labelfor each cluster needs to be then determined through other means, which we do notexplore in this study, but may include minimal supervision.The graph-theoretic framework, used previously for topic segmentation inChapter 2, clusters sentences into DAs based on their lexical and structural sim-ilarity, but ignores sequential dependencies between the DA types. The perfor-mance of this model crucially depends on how one measures the similarity be-tween two sentences. We experimented with a wide range of similarity metrics,including TF.IDF-based cosine similarity, word subsequence kernel [32], extendedstring subsequence kernel [85] with part of speech (POS) tags, basic element-baseddependency similarity [88] and tree kernel-based deep syntactic similarity [48].Quite differently, the probabilistic conversational models frame DA cluster-ing as a sequence-labeling problem that can be solved by variations of HMMs withthe assumption that a conversation is a sequence of hidden DAs and each DA emitsan observed sentence. The performance of a probabilistic model depends on theaccuracies of the state transition distributions (i.e., sequence dependencies betweenDAs) and the act-emission (or observation) distributions. While the idea of usingprobabilistic models for sequence labeling to perform DA tagging is not new, wemake several key contributions in showing how it can be effectively applied toasynchronous conversations by dealing with critical limitations in previous work.149Unlike synchronous conversations, the conversational flow in asynchronousconversations often lacks sequential dependencies between the act types in its tem-poral order. For example, consider the email conversation shown in Figure 4.1.If we arrange the sentences (excluding the quoted sentences) as they arrive in theconversation, it becomes hard to capture the sequential dependencies between theact types because the two components of the adjacency pairs can be far apart in thesequence. This could lead to inaccurate sequential dependencies in the sequencelabelers, when they are applied to this sequence of the sentences. This example alsodemonstrates that the use of quotations (see the last email) can help us in puttingthe components of adjacency pairs close to each other. Therefore, we hypothesizethat the sequences extracted from the conversational structure of the asynchronousconversation, e.g., the Fragment Quotation Graph (FQG), are rather more effec-tive to accurately learn the sequence dependencies between the DA tags. That is,two neighboring sentences in the conversational structure are likely to be related(expressing related DAs), independently from their arrival time-stamp.Among recent attempts on unsupervised DA modeling, Ritter et al. [174] pro-pose HMMs to cluster the tweets in a Twitter conversation based on their dialogacts. They use a unigram language model as the act-emission distribution. How-ever, other features like speaker, relative position and sentence length have alsoproved to be beneficial in supervised settings [93, 102]. In this work, we modeleach observation (sentence) not only by its unigrams but also by its speaker, rela-tive position and length. Another crucial finding of Ritter et al. [174] is that theirsimple HMM conversational model tends to find some undesirable topical clustersin addition to the DA clusters. Without any supervision, distinguishing DAs fromtopics is in fact a common challenge, because many of features used for modelingDAs are also indicators of topics. Ritter et al. [174] address this problem by propos-ing an HMM+Topic model, which tries to separate the topic words from the DAindicators. In this work, we propose a more adequate HMM+Mix model, whichnot only explains away the topics, but also improves the act-emission distributionby defining it as a mixture model. We present the Expectation Maximization (EM)derivation to learn the parameters of this model.We evaluate our models on two different datasets: email and forum. In whatis to the best of our knowledge the first quantitative evaluation of unsupervised150DA tagging for asynchronous conversations, we show that (i) the graph-theoreticframework is not the right model for this task, (ii) the conversational models learnbetter sequential dependencies when they are trained on sequences extracted fromthe conversational structure rather than when they are trained on sequences basedon the temporal order of the sentences and (iii) HMM+Mix is a better conversa-tional model than the simple HMM model.The rest of the chapter is organized as follows. After discussing related workin Section 4.2, we present our corpora in Section 4.3. In Section 4.4 we presentthe graph-theoretic framework and its performance on clustering sentences basedon their DAs. We present our probabilistic conversational models in Section 4.5and their evaluation in Section 4.6. Finally, we conclude with future directions inSection 4.7.4.2 Related WorkThere has been little work on DA recognition in asynchronous conversation. Theapproaches can be broadly classified into supervised, semi-supervised and unsuper-vised methods. Cohen et al. [46] first use the term email speech act for classifyingemails based on their acts (e.g., deliver, meeting). However, their classifiers do notcapture the sequential dependencies between the act types. In their follow-up work,Carvalho and Cohen [39] address this limitation by using two different classifiers? one for content and one for context ? in an iterative collective classification algo-rithm. The content classifier only looks at the content of the message, whereas thecontext classifier takes into account both the content of the message and the dialogact labels of its parent and children in the thread structure of the email conversa-tion. Note that their inventory of dialog acts is very specific to a work environment,and their approaches operate at the email level, not at the sentence level. Identi-fication of adjacency pairs like question-answer pairs in email discussions usingsupervised learning methods was investigated in [173, 186]. Ferschke et al. [70]use DAs to analyze the collaborative process of editing Wiki pages, and apply su-pervised models to identify the DAs in Wikipedia Talk pages.Several semi-supervised methods have been proposed for DA recognition inasynchronous conversation. Jeong et al. [93] use semi-supervised boosting to tag151the sentences in email and forum discussions with DAs by adapting knowledgefrom annotated spoken conversations (i.e., MRDA-tagged meeting and DAMSL-tagged telephone conversations). Given a sentence represented as a set of trees(i.e., dependency tree, n-gram tree and POS tag tree), the boosting algorithm iter-atively learns the best feature set (i.e., sub-trees) that minimizes the errors in thetraining data. Note that this approach of adapting knowledge from synchronousdomains to asynchronous domains could be problematic because domain adapta-tion methods generally work better when the distance between the source and thetarget domains is minimal. Another crucial limitation is that they do not considerthe sequential dependencies between the act types, something we successfully ex-ploit in our work. Zhang et al. [235] also employ semi-supervised methods forDA recognition in twitter. They use a transductive SVM and a graph-based labelpropagation framework to leverage the knowledge from abundant unlabeled data.Among recent research on unsupervised DA modeling, Ritter et al. [174] pro-pose two HMM-based unsupervised conversational models for modeling DAs intwitter. In particular, they use a simple HMM model and a HMM+Topic model tocluster the Twitter posts (not the sentences) into DAs. Since, they use a unigramlanguage model to define the emission distribution, their simple HMM model tendsto find some topical clusters in addition to the clusters that are based on DAs. TheHMM+Topic model tries to separate the DA indicators from the topic words. Byvisualizing the type of conversations found by the two models, they show that theoutput of the HMM+Topic model is more interpretable than that of the HMM one,however, their classification accuracy is not empirically evaluated. Therefore, itis not clear whether these models are actually useful (i.e., beat the baseline), andwhich of the two models is a better DA tagger. Recently, Paul [160] proposes touse a mixed membership Markov model to cluster sentences based on their DAs,and show that this model outperforms a simple HMM.Our conversation models were inspired by the models of Ritter et al. [174],but we improve on those by making the following four key contributions: (i) wemodel at the finer level of granularity (i.e., at the sentence-level as opposed to post-level) with a richer feature set including not only unigram but also sentence relativeposition, sentence length and speaker, (ii) our models exploit the graph structureof the conversations, (iii) our HMM+Mix model not only explains away the topics152(like HMM+Topic does), but also improves the emission distribution by defining itas a mixture model [23], (iv) we provide clustering accuracy of the models on twocorpora (email and forum) by applying the one-to-one metric from [63].4.3 CorporaTo show the generality of our methods, we experiment with two different types ofasynchronous conversations: email and forum. Below we describe our datasets.4.3.1 Dataset Selection and Clean UpWe used the same DA tagset and test datasets used by Jeong et al. [93]. The tagset,containing 12 act categories with their relative frequencies in the email and forum(test) corpora, is shown in Table 4.1. This inventory of DAs is originally adoptedfrom the MRDA tagset [55]. We use this tagset because it is domain independentand suitable for sentence-level annotation [93]. Our test datasets include the 40email conversations from our BC3 corpus [213] and 200 forum conversations fromthe travel forum site TripAdvisor3. As already mentioned in Chapter 2, the BC3email conversations were originally selected from the W3C mailing list.4Notice that the DA tags have similar distribution in the two corpora. Statementis the most frequent tag in both corpora covering about 66% to 70% of the sen-tences. Polite mechanism, Yes-no question and Action motivator have frequenciesin the range 6% to 9% in the two corpora. Other DA tags have considerably lowerfrequencies in the two corpora. The ? agreements between two human annotatorsare 0.79 and 0.73 for the email and forum corpora, respectively.Due to privacy issues, there are only a few corpora of asynchronous conversa-tions available for training an unsupervised system. Since it is preferable to trainand test such a system on similar data, we have chosen the W3C email corpus (asopposed to Enron) to train our models on email conversations.5 W3C contains23,957 email conversations. However, the raw data is too noisy (with system mes-sages and signatures) to directly inform our models. We cleaned up the data with3http://tripadvisor.com4http://research.microsoft.com/en-us/um/people/nickcr/w3c-summary.html5In contrast, Jeong et al. [93] train on Enron and test on BC3.153Tag Description Email ForumS Statement 69.56% 65.62%P Polite mechanism 6.97% 9.11%QY Yes-no question 6.75% 8.33%AM Action motivator 6.09% 7.71%QW Wh-question 2.29% 4.23%A Accept response 2.07% 1.10%QO Open-ended question 1.32% 0.92%AA Acknowledge and appreciate 1.24% 0.46%QR Or/or-clause question 1.10% 1.16%R Reject response 1.06% 0.64%U Uncertain response 0.79% 0.65%QH Rhetorical question 0.75% 0.08%Table 4.1: Dialog act tags and their relative frequencies in the two corpora.the intention to keep only the headers, bodies and quotations. By processing theheaders, we then reconstruct the thread structure of the email conversations.In order to train our models on forum conversations, we crawled 25,000 forumthreads from the same travel forum site i.e., TripAdvisor. Our forum data is lessnoisy, but does not contain any thread structure (i.e., reply-to relations).4.3.2 Dealing with Conversational StructureIn probabilistic conversational models, sequence dependencies between DA tagscan be learned either from the simple temporal order of the utterances in a con-versation, or from the sequences extracted from the graph structure of the conver-sation. We create a temporally ordered conversation by simply arranging its postsbased on their arrival time. For the graph structure, we construct the FragmentQuotation Graph (FQG) [35] described in Chapter 2 as a fine-grained conversa-tional structure of email conversations. For example, Figure 4.2 shows the FQGfor the email conversation shown in Figure 4.1. Once again, for the sake of illus-tration, the real contents of the emails are abbreviated as a sequence of labels, eachrepresenting a text fragment (see the Fragment column in Figure 4.1). However,since the forum conversations in TripAdvisor do not contain any thread structure,154and participants hardly quote from others? posts in this forum, we could not buildFQG for the forum threads. But we noticed that participants in this forum almostalways respond to the initial post of the thread, and generally mention other partici-pants? names (as a method of disentanglement) to respond to their post. Therefore,we create the graph structure of a forum conversation with the simple assumptionthat a post usually responds to the initial post unless it mentions other participants?names. If it mentions other participants? names, we consider the most recent postfrom those participants to be the post it replies to. One can also employ more so-phisticated methods such as the ones described in [11, 220] to uncover the implicitthread structure of a forum conversation, which is beyond the scope of this thesis.Figure 4.2: (a) The email conversation of Figure 4.1 with the fragments. Ar-rows indicate ?reply-to? relations. (b) The corresponding FQG.Our assumption is that two neighboring sentences in the graph structure of aconversation are likely to express related DAs. Notice that the paths in the graph(e.g., c-d- f - j in Figure 4.2(b)) capture the adjacency relations (e.g., question-answer, request-grant) between text fragments. Therefore, we extract the sentencesalong each of the paths as a sequence, on which we train our probabilistic conver-sational models. However, by doing this the sentences in the nodes shared bymultiple paths (e.g., c,e,g) are duplicated in multiple sequences. Section 4.5.4describes how our conversational models deal with these duplicates.1554.4 Graph-theoretic FrameworkOur first model for clustering sentences based on their DAs is built on the samegraph-theoretic framework used earlier for topic segmentation in Chapter 2. Asmentioned before, this framework has been successfully applied to many otherNLP tasks including chat disentanglement [63] and coreference resolution [190].In this work, we investigate whether the same framework can be adapted to clus-tering sentences of an asynchronous conversation based on their DAs.4.4.1 Algorithm DescriptionIn this framework, first we form a complete similarity graph G = (V,E), where thenodes V represent the sentences in a conversation and the edge-weights representthe similarity between the nodes (i.e., for an edge (u,v) ? E, edge-weight w(u,v)represents how similar the sentences u and v are). Then we formulate the cluster-ing problem as a k-way-mincut graph-partitioning problem with the intuition thatsentences in a cluster should be similar to each other, while sentences in differentclusters should be dissimilar, which we solve again by optimizing the normalizedcut (Ncut) criterion (Equation 2.6) described in Section 2.3.1. Note that, depend-ing on the task, the performance of this framework depends on how one measuresthe similarity between two sentences. In the following, we briefly describe thesimilarity metrics we experimented with in our work.Unigrams or bag-of-words (BOW) have been quite extensively used in previ-ous work on DA recognition ([102, 174]). To measure the BOW-based similaritybetween two sentences, we represent each sentence as a vector of TF.IDF [178]values of its words and compute the cosine similarity (Equation 2.5) between thevectors. However, recall from the discussion in Chapter 2 that this model has beenquite successful for finding topical clusters [121]. Although we retain the stop-words and punctuation, which are arguably useful for recognizing DAs, it may stillfind clusters that are based on topics rather than DAs. In an attempt to abstractaway the topic words, we also use a variation of the above metric, where we mask(BOW-M) the nouns in the sentences and measure the cosine similarity as before.6The BOW and BOW-M similarity metrics do not consider the order of the6Since nouns are arguably the most indicative for topics.156words. One can use n-gram co-occurrences to account for the order of the words.The Word Subsequence Kernel (WSK) [32], which is an improvement over n-gramco-occurrences, considers the order by transforming the sentences into higher di-mensional spaces and then measuring sentence similarity in that space. ExtendedString Subsequence Kernel (ESK) [85], which is a simple extension of WSK, al-lows one to incorporate word-specific syntactic or semantic (e.g., word sense, POStags) information into WSK. Since Jeong et al. [93] found n-grams and POS tagsuseful for DA recognition, we implement the WSK and the ESK with POS tags(ESK-P).Jeong et al. [93] also found that sub-trees extracted from the dependency treesare important features for DA recognition. We measure the dependency-based sim-ilarity between two sentences by first extracting their Basic Elements (BE) (i.e.,head-modifier-relation triples) [88] from their corresponding dependency trees,and then by counting the number of BE co-occurrences in the two trees. The BEsencode some syntactic and semantic information and one can quite easily decidewhether any two units match considerably more easily than with longer units (e.g.,sentences, parse trees) [89].Like dependency tree, the sub-trees of the syntactic tree may also be importantindicators for DA identification. To measure the syntactic similarity between twosentences, we first parse the sentences using Charniak parser [41], and then com-pute the similarity between pairs of parse trees using the Tree Kernel (TK) functionas described in [48].4.4.2 Evaluation of the Graph-theoretic ModelWe wish to compare the DAs automatically identified by our models with thehuman-labeled DAs. However, since unsupervised clustering techniques do notassign any label to the clusters, metrics widely used in supervised classification,such as the ? statistic or F1 score, are not applicable in our case. In this work,we propose to use the one-to-one metric [63] used earlier for measuring the per-formance of the topic segmentation models in Chapter 2. Briefly, given a humanannotation and a model annotation, one-to-one optimally pairs up the clusters fromthe two annotations by maximizing the total overlap, and then reports the percent-157age of overlap found.The number of DA categories available to the systems was fixed to 12. Ta-ble 4.2 shows the one-to-one accuracy of the graph-theoretic framework with var-ious similarity metrics. The right most column shows the performance of the ma-jority class baseline, that considers all the utterances in a corpus as statements.Corpus BOW BOW-M WSK ESK-P BE TK BASELINEEmail 62.6 34.3 64.7 24.8 39.1 22.5 70.0Forum 65.0 38.2 65.8 36.3 46.0 30.1 66.0Table 4.2: One-to-one accuracy for different similarity metrics in the graph-theoretic framework.A comparison between the performances of BOW and BOW-M (i.e., BOWwith nouns masked) systems demonstrates that BOW performs way better thanBOW-M. This indicates that masking the nouns in an attempt to abstract away thetopic words degrades the performance substantially. The WSK system performsslightly better than the BOW system meaning that considering the order of thewords in the similarity metric is useful. However, when we add the POS tags of thewords in ESK (see ESK-P in Table 4.2), the performance degrades dramatically.This means that similarity metric based on POS tags has an adverse effect on clus-tering sentences into DAs. The results of the BE and TK systems indicate that theshallow (dependency) and the deep syntactic similarities between sentences alsoare not useful for recognizing DAs in this framework. More interestingly, noticethat all the systems fail to beat the majority class baseline indicating that this frame-work is not the right model for recognizing DAs in asynchronous conversations.4.5 Probabilistic Conversational ModelsThe graph-theoretic framework described above has three key limitations whenapplied to cluster the sentences of an asynchronous conversation based on their acttypes. First, this framework does not capture the potentially informative sequentialstructure of a conversation; for example, an answer usually follows a question, anaccept usually follows a request and so on. Second, this framework seems to be158still confused by topics even when we mask the nouns, or consider the word order,POS tags and syntactic structures. Third, unlike our probabilistic conversationalmodels (described below), this framework does not allow us to incorporate otherimportant features (e.g., speaker, position, length) beyond lexical and syntacticsimilarity between the sentences in a principled way. To address these limitationswe propose probabilistic conversation models, which assume that a conversation isa Markov sequence of hidden DAs, with each DA emitting an observed sentence.4.5.1 HMM Conversational ModelFigure 4.3 shows our first probabilistic conversational model in plate notation. Anasynchronous conversation Ck is a sequence of hidden DAs Di, and each DA gen-erates an observed sentence Xi, represented by its: (i) bag-of-words (i.e., unigrams)shown in the Wi plate, (ii) speaker Si, (iii) relative position Pi, and (iv) length Li.These features take discrete values, and are modeled as multinomial distributions.Following Ritter et al. [174], for unigrams, we limit our vocabulary to the 5,000most frequent words in the corpus. The relative position of a sentence is computedby dividing its position in the post by the number of sentences in the post. We thenconvert the relative positions and lengths to a sequence of natural numbers.Figure 4.3: HMM conversational modelWe place a symmetric Dirichlet prior (with hyperparameter ? = 2) on each ofthe multinomials (i.e., distributions over initial states, state transitions, unigrams,159speakers, positions and lengths)7, and compute a maximum a posteriori (MAP)estimate of the parameters using the Baum-Welch (EM) algorithm with forward-backward providing the smoothed node and edge marginals for each sequence inthe E-step. Specifically, given a sequence X1:T , forward-backward computes:?i( j) := P(Di = j|X1:T ,?) (4.1)?i( j,k) := P(Di?1 = j,Di = k|X1:T ,?) (4.2)where the local evidence is given by:P(Xi|Di) =[?jP(Wi, j|Di)]P(Si|Di)P(Pi|Di)P(Li|Di) (4.3)4.5.2 HMM+Mix Conversational ModelThe HMM conversational model described above is similar to the conversationalmodel of [174] except that, in addition to the unigrams of a sentence, we also useits relative position, speaker and length to model the emission distribution. How-ever, as they point out, without additional guidance this unsupervised model mayfind some undesired clusters. For example, their HMM model with unigram fea-ture tends to find some clusters that are based on topics rather than DAs. Since thefeatures used in our conversational model are indicators of DAs as well as of top-ics (see Section 2.3.1 in Chapter 2), our model with the extended feature set mayalso find some unwanted clusters. Note that similar to Ritter et al. [174], we havealso noticed in our graph-theoretic framework that masking nouns in an attemptto abstract away the topics even degrades the clustering performance. As a solu-tion, Ritter et al. [174] propose the HMM+Topic model to separate the influenceof the topic words from the DA indicators. In this model, a word is generated fromone of the three hidden sources: (i) DA, (ii) Topic and (iii) General English. In con-trast, we propose the HMM+Mix model where the emission distribution is definedas a mixture model. This model has two main advantages over the HMM+Topic7The parameters were not shown in Figures 4.3 and 4.4 to reduce visual clutter.160model: first, by defining the emission distribution as a mixture model we not onlyindirectly explain away the topics but also enrich the emission distribution, sincethe mixture models can define multimodal distributions [23], and second, learningand inference in this model is much easier (using EM) without requiring approxi-mate inference techniques such as Gibbs sampling.Figure 4.4: HMM+Mix conversational modelFigure 4.4 shows the extended HMM+Mix model, where emission distribu-tions are modeled as mixtures of multinomials with Mi ? {1, . . . ,M} representingthe mixture component. Notice also that each observed sentence Xi is now ex-plained by two hidden causes (parents), i.e., its dialog act Di and the mixture com-ponent Mi. Therefore, the two causes will try to explain each other away to explainan observed sentence. Similar to the previous model, we put symmetric Dirich-let priors (with hyperparameter ? = 2) on the multinomials and compute MAPestimates of the parameters using EM, where the local evidence is given by:P(Xi|Di) =?Mip(Mi|Di)P(Xi|Di,Mi) (4.4)we define P(Xi|Di,Mi) in a way similar to Naive Bayes:161P(Xi|Di,Mi) =[?jP(Wi, j|Di,Mi)]P(Si|Di,Mi)P(Pi|Di,Mi)P(Li|Di,Mi) (4.5)In this model, for each sequence X1:T , in addition to ?i( j) and ?i( j,k) (Equation 4.1,4.2), we also need to compute ?i( j,k) in the E-step:?i( j,k) := P(Di = j,Mi = k|X1:T ,?) (4.6)One can show that this is given by the following expression:?i( j,k) := ?i( j)P(Mi = k|Di = j)P(Xi|Di = j,Mi = k)?m P(Mi = m|Di = j)P(Xi|Di = j,Mi = m)(4.7)The details of the EM algorithm for this model is given in Appendix A.3.4.5.3 Initialization in EMIn EM, we must ensure that we initialize the model parameters carefully to mini-mize the chance of getting stuck in poor local optima. We use multiple (10) restartsand pick the best solution based on the estimated likelihood of the data. For thefirst run, we ignore Markov dependencies between the DAs, and estimate the pa-rameters of the emission distributions using the standard mixture model estimationmethod (i.e., using EM), and later use it to initialize other parameters (i.e., initialstate distribution and state transition distributions). For the other 9 runs, we ran-domly initialize the parameters. We use this process of initialization in an attemptto ensure that our models are at least as good as the simple mixture model, whichignores the sequential structure of an asynchronous conversation.4.5.4 Applying Conversational ModelsWe apply (i.e., train and test) the conversational models described above to thetemporal order of the utterances and to the sequences extracted from the graphstructure of the conversation. With the learned parameters, given a (test) sequenceof sentences, we use Viterbi decoding to infer the most probable DA sequence.However, as described in Section 4.3.2, sequences extracted from the graph struc-162ture of a conversation may have duplicate sentences. As a result, when we runViterbi decoding on it, for the same sentence we get multiple DA assignments (onefor each position). We take the maximum vote to finally determine its DA.4.6 ExperimentsIn our experiments with email and forum corpora, we create 50 training samplesfrom a corpus, each sample containing 12,000 randomly selected conversations.When selecting the conversations, we ensure that each conversation contains atleast two posts. For a corpus, we train our conversational models on each of the 50training samples (each containing 12,000 conversations), and evaluate the perfor-mance on the test set each time. We train our models both on the temporal order ofthe utterances and on the order extracted from the graph structure of the conversa-tion. The number of DAs available to the models was set to 12. In the HMM+Mixmodel, the number of mixture components M was empirically set to 3.8Email ForumTemporal Graph Temporal GraphBASELINE 70.00 70.00 66.00 66.00HMM 73.45 76.81 69.67 74.41HMM+Mix 76.73 79.66 75.61 78.35Table 4.3: Mean one-to-one accuracy for various models on the two corpora.Table 4.3 presents the mean one-to-one accuracy of our conversational modelsand the majority class baseline on the two corpora. Our models beat the baselinein both corpora, proving their effectiveness in DA recognition. When we comparethe performances of the models on the two corpora, we notice a similar trend.Both models benefit significantly from the graph-structure of the conversations(p<0.05). The finer conversational structure of email conversations in the form ofFQG and the assumed conversational structure of forum conversations have beenproved to be beneficial for recognizing DAs in these corpora. A comparison be-tween our models shows that the HMM+Mix model outperforms the simple HMM8We experimented with M={1, 2, 3, 4, 5}, and achieved the highest accuracy with M=3.163(p<0.05). This indicates that the mixture model acting as the emission distributionnot only explains the topics away but also defines a finer observation model.4.7 Conclusion and Future WorkIn our investigation of approaches for modeling DAs in asynchronous conversa-tions we have made several key contributions. We apply a graph-theoretic frame-work to the DA tagging task and compare it with probabilistic sequence-labelingmodels. Then, we show how in the probabilistic models the sequence dependenciescan be more effectively learned by taking the conversational structure into account.After that, we successfully expand the set of sentence features considered in theact-emission distribution. Finally, we improve the act-emission distribution by ap-plying a mixture model. Quantitative evaluation with human annotations showsthat while the graph-theoretic framework is not the right model for this task, theprobabilistic conversation models (i.e., HMM and HMM+Mix) are quite effectiveand their benefits are more pronounced with graph-structural conversational orderas opposed to the temporal one. Comparison of the outputs of these models revealsthat HMM+Mix model can predict DAs better than the HMM model. In the future,we wish to experiment with the Bayesian versions of these conversation modelsand also apply our models to other conversational modalities.164Chapter 5ConclusionAs the Internet continues to grow, billions of people all over the world are nowhaving conversations by writing in a growing number of social media in a seem-ingly unlimited variety of communicative settings. Effective processing of theseasynchronous conversations can benefit both organizations and individuals. As aconsequence, end-user applications (e.g., summarization, information extraction)targeting conversations in social media are now growing at an accelerating pace,and since these conversations are different in many aspects from the traditionalmedia (e.g., news paper), NLP researchers are now facing new challenges to buildtools (e.g., POS tagger, topic segmenter) to support these downstream applications.In this thesis, we developed a new set of NLP tools for different discourseanalysis tasks in asynchronous conversation, which can support many NLP ap-plications including summarization and information extraction. In particular, wepresented novel computational models for topic segmentation and labeling, rhetor-ical analysis and dialog act recognition in asynchronous conversation. Our initialhypothesis was that we can effectively address the technical challenges in thesetasks by applying sophisticated graph-based methods and probabilistic graphicalmodels. Graph-based methods allow us to encode different discourse-related infor-mation (e.g., conversational structure and other domain-specific features) in termsof the edge weights in the graph so that all the information can be simultaneouslytaken into account while performing discourse analysis tasks. Similarly, proba-bilistic graphical models allow for efficient and sound reasoning about multiple165interrelated random variables (i.e., structured output) in a compact but semanti-cally intuitive graph representation. The successful application of these models toa complex task (e.g., a discourse analysis task in asynchronous conversations) is nota trivial matter. It requires identifying what aspects of the conversation should berepresented for the target task, how they should be represented in the model, as wellas how learning and inference in the resulting representations can be performed ef-fectively. And this in-depth investigation was done in this thesis for each of thediscourse analysis tasks. Our hypothesis was successfully verified by showing thatall the models we devised in this thesis outperform state-of-the art techniques onwidely accepted performance metrics.In chapter 2, we studied topic segmentation and labeling in asynchronous con-versation. We annotated two new corpora, measured inter-annotator agreement andpresented a complete computational framework for performing these tasks. We firstshowed that existing off-the-shelf tools are insufficient for these tasks, because theydo not consider the conversation specific features of asynchronous conversations.In response, we proposed two novel unsupervised topic segmentation models thattake into account the fine-grained conversational structure, and a novel supervisedtopic segmentation model that combines lexical, conversational and topic features.We did so either by using a more informative prior or by using a graph-basedclustering framework with no or little supervision. We demonstrated that unsu-pervised topic segmentation models benefit significantly when they consider thefine-grained conversational structure. Our comparison of the supervised segmen-tation model with the unsupervised models showed that the supervised methodoutperforms the unsupervised ones even using only a few labeled conversations.For topic labeling, we proposed two novel graph-based ranking (i.e., randomwalk) models that respectively captures two forms of conversation specific infor-mation: the informative clues from the leading sentences and the fine-grained con-versational structure. We demonstrated that the random walk model performs sig-nificantly better when it exploits the conversation specific clues from either of thetwo sources, and the random walk model that considers both sources of informa-tion performs consistently well across domains. The overall performance of theend-to-end system is also promising when compared with human annotations.We made our topic annotated conversational corpora, annotation manual and166softwares available online. Researchers from several organizations including Mi-crosoft Research Asia, Cisco Research and the Qatar Computing Research Insti-tute (QCRI) are now using our corpora as well as softwares for research purposes.There is also an ongoing research project at the University of British Columbia oneffective visualization of asynchronous conversations using our developed tools.1In Chapter 3, we presented our complete discriminative framework for rhetori-cal analysis. We identified crucial limitations in the parsing model and in the pars-ing algorithm of the existing discourse parsers, and addressed them by proposingmore adequate parsing models and an optimal parsing algorithm.We employed probabilistic graphical models as our parsing models which al-low efficient learning and reasoning in a compact graph representation. Whileexisting discourse parsers use a single unified parsing model, we proposed to usetwo different parsing models ? one for intra-sentential parsing and one for multi-sentential parsing. More specifically, we proposed a Dynamic Conditional RandomField (DCRF) as our intra-sentential parsing model which represents the structureand the labels of a discourse tree jointly and captures the sequential dependen-cies between the tree constituents. We demonstrated that our DCRF parsing modelalong with an optimal parsing algorithm outperforms the state-of-the-art by a widemargin in intra-sentential parsing. However, for multi-sentential parsing, we facedthe challenge of scaling up the DCRF model to arbitrary long documents. So,we proposed to break the chain structure in the DCRF model for multi-sententialparsing, which also allowed us to balance the training data. We showed that usingtwo separate parsing models result in a more effective parsing method, because themodels could leverage their own informative feature sets and the distinct distribu-tions of the coherence relations.A two-stage parsing approach also allowed us to exploit the strong correlationbetween the text structure and the tree structure. We proposed to capture this cor-relation by combining our intra-sentential and multi-sentential parsers in two dif-ferent ways. While our first approach did not consider the cases where rhetoricalstructures violate sentence boundaries (i.e., leaky boundaries), our second approachattempted to tackle them using a sliding window over two consecutive sentences.1More on this research can be found at https://www.cs.ubc.ca/cs-research/lci/research-groups/intelligent-user-interfaces167We demonstrated that both of our approaches outperform the state-of-the-art by alarge margin. A comparison between our approaches showed that the sliding win-dow approach is more accurate in finding the right structure of the discourse trees,especially when it gets enough data to learn from.Recently, we made a demo and the source code of our discourse parsing frame-work available online. Researchers from several universities including the Univer-sity of Toronto and the University of Utah are now using our discourse parser.In Chapter 4, we presented both deterministic and probabilistic unsupervisedapproaches to dialog act modeling in asynchronous conversation. While the de-terministic (i.e., graph-based clustering) models did not consider the sequential de-pendencies between the act types, the probabilistic conversational (graphical) mod-els (i.e., HMMs) did. First, we demonstrated that like synchronous conversations, itis crucial to capture the sequential dependencies between the acts in asynchronousconversations. Then, we showed that the probabilistic conversational models learnbetter sequential dependencies when trained on the sequences extracted from theconversational structure, rather than on the temporal order of the sentences.However, it turns out that the simple unsupervised HMM model tends to findsome undesired topical clusters in addition to dialog act clusters. We addressed thisby proposing a novel HMM+Mix model, which not only explains away the topics,but also improves the act emission distribution by defining it as a mixture model.5.1 Prospective ApplicationsThere are many NLP applications that can be significantly benefited from our ex-tracted discourse structures. We briefly describe the most important ones below.5.1.1 Conversation SummarizationOne of the most important applications of discourse structures is text summariza-tion. A number of summarization methods utilizing topic structures have been pro-posed for monologs (e.g., [20, 56, 80, 115]). Topic-based summarization methodshave also been explored for summarizing meeting transcripts [104, 221]. Murrayet al. [148, 151] show the benefits of using dialog acts in meeting summarization.A number of research investigate the utility of rhetorical structure for measuring168text importance in single document summarization (e.g., [54, 118, 126]). Recently,Christensen et al. [45] propose to incorporate coherence structure into their multi-document summarization system for selecting and ordering sentences.Most of the above approaches to summarization are extractive, where informa-tive sentences are selected to form an abridged version of the source document(s).This approach has been quite successful for summarizing factual texts (e.g., newsarticles, biographies), and has also been applied to asynchronous conversation[35, 146, 169]. This approach has been by far the most popular in the field ofsummarization, largely because it does not require to generate novel sentences.Another potential approach to summarization is abstractive, which first ex-tracts key information (e.g., entities) and possibly the discourse structures, andthen generates novel texts based on the extracted information and structures. Re-cent studies [133, 149] have shown that although users find the extractive sum-maries to be valuable for browsing documents, these summaries are less coherentthan human written abstracts and users in general prefer abstracts over extracts.Thus, there is now a growing interest in abstractive summarization, even for con-versations [151, 222]. Carenini et al. [36] postulate an abstractive summarizationsystem for conversations that can take us closer to the human-style summarizationof conversations. To our knowledge, with the exception of a partial proposal pre-sented in [150], none has attempted to abstract asynchronous conversations yet,and no existing extractive methods use deep discourse structures (i.e., topic, coher-ence and dialog act structures).2 We strongly believe that the discourse structuresthat we extract in this work will be very beneficial for both extractive and abstrac-tive summarization approaches in asynchronous conversations. For an example ofmoving towards abstraction, see our recent paper [136] which describes how anabstract topic label can be generated for a given topical cluster.5.1.2 Sentiment AnalysisDiscourse structures can play an important role in sentiment analysis. A researchproblem in sentiment analysis is extracting fine-grained opinions about differentaspects (or features) of a product. Several recent work (e.g., [112, 189]) have2Carenini et al. [34, 35] use a fine-grained conversational structure to measure text importance intheir summarization system.169already considered to exploit the rhetorical structure for this task.Another challenging problem in sentiment analysis is assessing the overallopinion expressed in a review because not all sentences in a review contributeequally to the overall sentiment. For example, some sentences are subjective, whileothers are objective [157]; some express the main claims, while others supportthem [207]; some express opinions about the main entity, while others are aboutthe peripherals. Discourse structures (i.e., rhetorical and topical structures) couldbe useful to capture the relative weights of the discourse units towards the overallsentiment. For example, the nucleus and satellite distinction along with the rhetori-cal relations could be useful to infer the relative weights of the connecting discourseunits. Similarly, topical structures could be useful to distinguish between opinionsexpressed towards the main entity and opinions expressed towards the peripherals.Topic models along with sentiment analysis could be useful in many interestingapplications including analysis of perspectives (e.g., left vs. right, Palestinian vs.Israeli) [116, 152], prediction of election outcomes [211], and so on.5.1.3 Information Extraction and VisualizationTopic models like LDAs coupled with effective user interfaces are now extensivelyused to facilitate efficient browsing and exploring a large document collection [25,40, 58, 117, 156]. Similarly, our topic segmentation and labeling models wouldallow us to effectively browse a possibly complex and long conversation.Topic and rhetorical structures have been used to select the parts of a documentwhich are relevant to extract named entities, the relations that hold between them,and the events where they play a role (e.g., [130, 142, 209]). Rhetorical structurehas also been used for extracting answers of why questions [215]. The dialog actstructure has been used to detect action items in meetings [147, 166].However, the extracted information (e.g., topics, sentiment, dialog acts) areof little use if these cannot be conveyed effectively to the user. Conversation vi-sualization tools with interactive multimedia interfaces offer attractive solutionsfor conveying large and complex information using graphics and texts. Arguelloand Rose? [8] developed such an interactive tool for visualizing topical segmentsin monolog and synchronous dialog (e.g., tutoring dialog). Recently, Rashid [172]170proposes such a tool for browsing and summarizing meeting conversations. Dorket al. [58] propose a visualization tool for monitoring events in Twitter. Effectivevisualization of the information extracted from asynchronous conversations is anactive research area which can heavily benefit from our work.5.1.4 Misc.Among other applications of discourse structures, Machine Translation (MT) hasreceived a resurgence of interest recently. Researchers believe that MT systemsshould consider discourse phenomena that go beyond the current sentence to en-sure consistency in the choice of lexical items or referring forms, and the fact thatsource-language coherence relations between discourse units are also realized inthe target language [82]. For a survey on discourse in MT see [81]. A workshopon discourse in MT was arranged recently at the ACL conference.3Another application for research is Computer Supportive Collaborative Learn-ing (CSCL). Automatic methods using NLP techniques to analyze online discus-sions for CSCL (e.g., argumentation) have been proposed [9, 144, 175, 227]. Webelieve, our discourse analysis tools can be also beneficial for these applications.5.2 Future DirectionsMoving forward, there are still many unsolved and fundamental challenges to beaddressed in discourse analysis of asynchronous conversations. While specificvenues for future work on each discourse analysis task were discussed at the endof the corresponding chapters, here we briefly outline a number of general futuredirections that arise from the work presented in this thesis.? Extrinsic Evaluation: In all the evaluations in this thesis, we measured howthe discourse analysis models perform with respect to human gold standard. Itwould be extremely interesting to see how effective our models are when theextracted discourse structures are exploited in the target end-user applications likeconversation summarization and conversation visualization.? Interactive Discourse Analysis: A key limitation researchers face working3http://www.idiap.ch/workshop/DiscoMT/171on discourse, especially if the discourse is of a new kind (e.g., asynchronousconversation), is scarcity of annotated data. To address this problem, in the futurewe would like to develop an online version of our systems that would allow usersto fix the output of the system and let the model learn from that feedback.? A Joint Model Integrating All Tasks: In this thesis, we considered the threediscourse analysis tasks (i.e., topic modeling, rhetorical analysis, dialog structureanalysis) separately. A promising avenue of investigation would be to integrateall these tasks in a mutually beneficial way using a joint model. To what extentthese tasks can be jointly modeled is an open question for future research [36].? Towards Coherent Summarization: A significant challenge in summariza-tion is producing not only informative but also coherent summaries. Existingapproaches usually follow a two-step approach, where a sentence ordering stepfollows a sentence selection step to make the summary coherent. However, webelieve coupling these two steps into a joint process would be more effective be-cause if the sentences are selected first without paying any attention to coherencethen an expected (coherent) ordering of the selected sentences may not exist.Recently, Christensen et al. [45] attempt to perform sentence selection and order-ing together using constraints on a discourse structure. However, they representthe discourse as an unweighted directed graph which is shallow and not suffi-ciently informative in most cases. Furthermore, their approach does not allowcompression at the sentence level, which is often beneficial in summarization. Inthe future, we would like to investigate the utility of our discourse structures forperforming sentence compression, selection and ordering in a joint process.? Combining Cohesion and Coherence for Summarization: A line of previousresearch in unsupervised extractive summarization used cohesion as a measure ofimportance for sentence selection (e.g., [17, 35, 65]). These methods use simplerepresentations that are based on lexical similarity. On the other hand, anotherline of research used coherence structure to measure the importance of a sentence(or clause) [54, 118, 126]. It would be interesting to see how a summarizationapproach that combines these two phenomena of discourse together in a singlemodel performs. The work in this thesis would allow that because our rhetorical172structure (i.e., a discourse tree) provides the necessary coherence structure.? Towards Discourse Informed Sentiment Analysis: We would like to investi-gate how discourse structures could play a role in assessing the overall sentimentexpressed in a review by aggregating the sentiments expressed in its components(e.g., clauses). In doing so, we see the need to consider both the topical and therhetorical structures. Topical structure would allow us to distinguish opinions ex-pressed towards main vs. peripheral entities. Similarly, rhetorical structure wouldallow us to distinguish between main and supportive claims.? Applications in Social Media: Finally, we would like to work on problemsin social computing that involve conversation understanding and generation. Forexample, imagine an autonomous system that monitors social media and identi-fies conversations about humanitarian crisis (e.g., Hurricane Sandy, earthquakein Pakistan). The system would extract necessary information (e.g., requestingor offering food, shelter), and in cases it needs further information (or to verifythe extracted information), the system would ask for it by generating a post inthe conversation. The system would then map the offers to the correspondingrequests and let the victims know about the offers.Another research avenue in social computing that we would like to investigateis how to leverage the strengths of social media and traditional news media toprovide a holistic view on a developing event. Users of social media as wellas journalists of news articles may have very different perspectives (e.g., left vs.right, Democrat vs. Republican) on an event. We believe an aggregated viewwould be very useful for a general user.173Bibliography[1] P. H. Adams and C. H. Martell. Topic Detection and Extraction in Chat. InProceedings ICSC-2008, pages 581?588. IEEE, 2008. ? pages 23[2] A. Alexandrescu and K. Kirchhoff. Graph-based learning for statisticalmachine translation. In Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 119?127, Boulder,Colorado, June 2009. Association for Computational Linguistics. URLhttp://www.aclweb.org/anthology/N/N09/N09-1014. ? pages 28[3] J. Allan. Topic Detection and Tracking: Event-based InformationOrganization, pages 1?16. Kluwer Academic Publishers, Norwell, MA,USA, 2002. ? pages 8, 39[4] J. Allan, C. Wade, and A. Bolivar. Retrieval and Novelty Detection at theSentence Level. In Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval,SIGIR ?03, pages 314?321, Toronto, Canada, 2003. ACM. ? pages 71[5] J. Allen, N. Chambers, G. Ferguson, L. Galescu, H. Jung, and W. Taysom.PLOW: A Collaborative Task Learning Agent. In Proceedings of theTwenty-Second Conference on Artificial Intelligence, AAAI?07, pages22?26. AAAI, 2007. ? pages 22, 147[6] D. Andrzejewski, X. Zhu, and M. Craven. Incorporating DomainKnowledge into Topic Modeling via Dirichlet Forest Priors. In Proceedingsof the 26th Annual International Conference on Machine Learning, ICML?09, pages 25?32, Montreal, Quebec, Canada, 2009. ACM. ? pages 58[7] P. M. Aoki, M. H. Szymanski, L. D. Plurkowski, J. D. Thornton,A. Woodruff, and W. Yi. Where?s the ?party? in ?multi-party??: analyzingthe structure of small-group sociable talk. In CSCW ?06, pages 393?402,Banff, Canada, 2006. ACM. ? pages 26174[8] J. Arguello and C. Rose?. Infomagnets: Making sense of corpus data. InProceedings of the Human Language Technology Conference of theNAACL, Companion Volume: Demonstrations, pages 253?256, New YorkCity, USA, June 2006. Association for Computational Linguistics. ?pages 170[9] J. Arguello, B. S. Butler, E. Joyce, R. Kraut, K. S. Ling, C. Rose?, andX. Wang. Talk to me: foundations for successful individual-groupinteractions in online communities. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, CHI ?06, pages959?968, Montr&#233;al, Qu&#233;bec, Canada, 2006. ACM. ? pages171[10] N. Asher and A. Lascarides. Logics of Conversation. CambridgeUniversity Press, 2003. ? pages 17, 101[11] E. Aumayr, J. Chan, and C. Hayes. Reconstruction of ThreadedConversations in Online Discussion Forums. In Proceedings of the FifthInternational AAAI Conference on Weblogs and Social Media, ICWSM?11,pages 26?33. AAAI, 2011. ? pages 48, 155[12] J. L. Austin. How to do things with words. Harvard University Press,1962. ? pages 22, 147[13] S. Banerjee and A. I. Rudnicky. Segmenting meetings into agenda items byextracting implicit supervision from human note-taking. In Proceedings ofthe 12th international conference on Intelligent user interfaces, IUI ?07,pages 151?159, Honolulu, Hawaii, USA, 2007. ACM. ? pages 11[14] S. Bangalore, G. Di Fabbrizio, and A. Stent. Learning the Structure ofTask-Driven Human-Human Dialogs. In Proceedings of the 21stInternational Conference on Computational Linguistics and the 44thannual meeting of the Association for Computational Linguistics,COLING-ACL?06, pages 201?208. ACL, 2006. ? pages 8, 23, 39, 147[15] N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. MachineLearning, 56:89?113, 2004. ISSN 1-3. ? pages 28, 48, 57[16] N. S. Baron. Always On: Language in an Online and Mobile World.Oxford ; New York : Oxford University Press, 2008. ? pages 1[17] R. Barzilay and M. Elhadad. Using Lexical Chains for TextSummarization. In Proceedings of the 35th Annual Meeting of the175Association for Computational Linguistics and the 8th European ChapterMeeting of the Association for Computational Linguistics, Workshop onIntelligent Scalable Test Summarization, pages 10?17, Madrid, 1997. ACL.? pages 121, 172[18] R. Barzilay and M. Lapata. Modeling local coherence: an entity-basedapproach. In Proceedings of the 43rd Annual Meeting on Association forComputational Linguistics, ACL ?05, pages 141?148, Ann Arbor,Michigan, 2005. Association for Computational Linguistics. ? pages 48[19] R. Barzilay and L. Lee. Catching the Drift: Probabilistic Content Models,with Applications to Generation and Summarization. In Proceedings of theHuman Language Technology Conference of the North American Chapterof the Association for Computational Linguistics, HLT-NAACL?04. ACL,2004. ? pages 31, 39[20] R. Barzilay and K. R. McKeown. Sentence Fusion for MultidocumentNews Summarization. Computational Linguistics, 31:297?328, September2005. ISSN 0891-2017. ? pages 168[21] D. Beeferman, A. Berger, and J. Lafferty. Statistical Models for TextSegmentation. In Machine Learning, volume 34, pages 177?210. KluwerAcademic Publishers, Feb. 1999. ? pages 9, 10, 81[22] O. Biran and O. Rambow. Identifying Justifications in Written Dialogs byClassifying Text as Argumentative. International Journal of SemanticComputing, 5(4):363?381, 2011. ? pages 119[23] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.? pages 153, 161[24] S. Blair-Goldensohn, K. McKeown, and O. Rambow. Building andRefining Rhetorical-Semantic Relation Models. In Proceedings of theHuman Language Technologies: The Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics,HLT-NAACL?07, pages 428?435. ACL, 2007. ? pages 20, 106[25] D. Blei. Topic Modeling and Digital Humanities. Journal of DigitalHumanities, 2, 2013. ISSN 1. ? pages 170[26] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. The Journal ofMachine Learning Research, 3:993?1022, Mar. 2003. ISSN 1532-4435. ?pages 10, 13, 32, 45, 50, 51176[27] D. M. Blei and P. J. Moreno. Topic Segmentation with an Aspect HiddenMarkov Model. In Proceedings of the 24th annual international ACMSIGIR conference on Research and development in information retrieval,SIGIR ?01, pages 343?348, New Orleans, Louisiana, USA, 2001. ACM. ?pages 9, 10, 45[28] J. Blitzer. Domain Adaptation of Natural Language Processing Systems.PhD thesis, University of Pennsylvania, Pennsylvania, 2008. ? pages 119[29] J. Boyd-Graber and D. M. Blei. Syntactic Topic Models. In NeuralInformation Processing Systems, NIPS?08, 2008. ? pages 52[30] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123?140, Aug.1996. ISSN 0885-6125. ? pages 65, 129[31] S. Brin and L. Page. Tha anatomy of a large-scale hypertextual web searchengine. In Computer Networks, volume 30(1-7), pages 107?117, 1998. ?pages 29[32] N. Cancedda, E. Gaussier, C. Goutte, and J. M. Renders. Word SequenceKernels. Journal of Machine Learning Research (JMLR), 3:1059?1082,2003. ? pages 149, 157[33] J. Carbonell and J. Goldstein. The Use of MMR, Diversity-basedReranking for Reordering Documents and Producing Summaries. InProceedings of the 21st annual international ACM SIGIR conference onResearch and development in information retrieval, pages 335?336,Melbourne, Australia, 1998. ACM. ? pages 46, 76[34] G. Carenini, R. T. Ng, and X. Zhou. Summarizing Email Conversationswith Clue Words. In Proceedings of the 16th international conference onWorld Wide Web, WWW?07, pages 91?100, Banff, Canada, 2007. ACM.? pages 26, 27, 49, 54, 169[35] G. Carenini, R. T. Ng, and X. Zhou. Summarizing Emails withConversational Cohesion and Subjectivity. In Proceedings of the 46thAnnual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies, ACL-HLT?08, pages 353?361, OH, 2008. ACL.? pages 17, 27, 49, 55, 72, 154, 169, 172[36] G. Carenini, G. Murray, and R. Ng. Methods for Mining and SummarizingText Conversations, volume 3. Morgan Claypool, 2011. ? pages 1, 8, 25,26, 48, 77, 169, 172177[37] L. Carlson and D. Marcu. Discourse Tagging Reference Manual. TechnicalReport ISI-TR-545, University of Southern California InformationSciences Institute, 2001. URLhttp://www.isi.edu/?marcu/discourse/tagging-ref-manual.pdf. ? pages 136[38] L. Carlson, D. Marcu, and M. Okurowski. RST Discourse Treebank(RST-DT) LDC2002T07. Linguistic Data Consortium, Philadelphia, 2002.? pages 17, 21, 102, 105, 131[39] V. R. Carvalho and W. W. Cohen. On the Collective Classification of Email?Speech Acts?. In Proceedings of the 28th annual international ACMSIGIR conference on Research and development in information retrieval,SIGIR ?05, pages 345?352, New York, NY, USA, 2005. ACM. ? pages25, 151[40] A. Chaney and D. Blei. Visualizing Topic Models. In Proceedings of theSixth International AAAI Conference on Weblogs and Social Media,ICWSM?12, pages 419?422, 2012. ? pages 170[41] E. Charniak. A Maximum-Entropy-Inspired Parser. In Technical ReportCS-99-12, Brown University, Computer Science Department, 1999. ?pages 157[42] E. Charniak. A Maximum-Entropy-Inspired Parser. In Proceedings of the1st North American Chapter of the Association for ComputationalLinguistics Conference, NAACL?00, pages 132?139, Seattle, Washington,2000. ACL. ? pages 134[43] E. Charniak and M. Johnson. Coarse-to-Fine n-Best Parsing and MaxEntDiscriminative Reranking. In Proceedings of the 43rd Annual Meeting ofthe Association for Computational Linguistics, ACL?05, pages 173?180,NJ, USA, 2005. ACL. ? pages 134[44] F. Y. Y. Choi, P. W. Hastings, and J. Moore. Latent Semantic Analysis forText Segmentation. In Proceedings of the 2001 Conference on EmpiricalMethods in Natural Language Processing, EMNLP?01, pages 109?117,Pittsburgh, USA, 2001. ACL. ? pages 9, 10, 43, 44, 64[45] J. Christensen, Mausam, S. Soderland, and O. Etzioni. Towards CoherentMulti-Document Summarization. In Proceedings of the 2013 Conferenceof the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, NAACL-HLT?13, pages1163?1173, Atlanta, Georgia, June 2013. ACL. ? pages 169, 172178[46] W. W. Cohen, V. R. Carvalho, and T. M. Mitchell. Learning to ClassifyEmail into ?Speech Acts?. In Proceedings of the 2004 Conference onEmpirical Methods in Natural Language Processing, EMNLP?04, pages309?316, 2004. ? pages 25, 151[47] M. Collins. Head-Driven Statistical Models for Natural Language Parsing.Computational Linguistics, 29(4):589?637, Dec. 2003. ISSN 0891-2017.? pages 135[48] M. Collins and N. Duffy. Convolution Kernels for Natural Language. InNeural Information Processing Systems, NIPS?01, pages 625?632,Vancouver, Canada, 2001. ? pages 149, 157[49] C. Cortes and V. N. Vapnik. Support Vector Networks. Machine Learning,20:273?297, 1995. ? pages 10, 61[50] D. Cristea, N. Ide, and L. Romary. Veins theory: A model of globaldiscourse cohesion and coherence. In In Proceedings of the 36th AnnualMeeting of the Association for Computational Linguistics and of the 17thInternational Conference on Computational Linguistics (COLING/ACL98),pages 281?285, 1998. ? pages 101[51] D. Crystal. Language and the Internet. Cambridge University Press, 2001.? pages 39[52] L. Danlos. D-STAG: a Discourse Analysis Formalism based onSynchronous TAGs. TAL, 50(1):111?143, 2009. ? pages 101[53] H. Daume. Frustratingly Easy Domain Adaptation. In Proceedings of the45th Annual Meeting of the Association for Computational Linguistics,ACL?07, pages 256?263, Prague, Czech Republic, 2007. ACL. ? pages136[54] H. Daume?, III and D. Marcu. A noisy-channel model for documentcompression. In Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics, ACL ?02, pages 449?456, Philadelphia,Pennsylvania, 2002. Association for Computational Linguistics.doi:10.3115/1073083.1073159. URLhttp://dx.doi.org/10.3115/1073083.1073159. ? pages 18, 101, 169, 172[55] R. Dhillon, S. Bhagat, H. Carvey, and E. Shriberg. Meeting RecorderProject: Dialog Act Labeling Guide. Technical report, ICSI Tech. Report,2004. URL http:179//www.icsi.berkeley.edu/ftp/global/pub/speech/papers/MRDA-manual.pdf.? pages 153[56] G. Dias, E. Alves, and J. G. P. Lopes. Topic Segmentation Algorithms forText Summarization and Passage Retrieval: an Exhaustive Evaluation. InProceedings of the 22nd national conference on Artificial intelligence -Volume 2, pages 1334?1339, Vancouver, BC, Canada, 2007. AAAI. ?pages 8, 39, 168[57] A. Dielmann and S. Renals. DBN based Joint Dialogue Act Recognition ofMultiparty Meetings. In Proceedings of the IEEE International Conferenceon Acoustics Speech and Signal Processing, ICASSP ?07, 2007. ? pages23, 147[58] M. Dork, D. Gruen, C. Williamson, and S. Carpendale. A VisualBackchannel for Large-Scale Events. IEEE Transactions on Visualizationand Computer Graphics, 16(6):1129?1138, Nov. 2010. ? pages 170, 171[59] G. Durrett, D. Hall, and D. Klein. Decentralized entity-level modeling forcoreference resolution. In Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers), pages114?124, Sofia, Bulgaria, August 2013. Association for ComputationalLinguistics. ? pages 31[60] D. duVerle and H. Prendinger. A Novel Discourse Parser based on SupportVector Machine Classification. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th International JointConference on Natural Language Processing of the AFNLP, pages665?673, Suntec, Singapore, 2009. ACL. ? pages 111, 118, 141[61] J. Eisenstein. Hierarchical Text Segmentation from Multi-scale LexicalCohesion. In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of the Association forComputational Linguistics, NAACL ?09, pages 353?361, Boulder,Colorado, 2009. ACL. ? pages 9, 10, 31, 45[62] J. Eisenstein and R. Barzilay. Bayesian Unsupervised Topic Segmentation.In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, EMNLP ?08, pages 334?343, Honolulu, Hawaii,2008. ACL. ? pages 10, 15, 45, 60180[63] M. Elsner and E. Charniak. Disentangling Chat. ComputationalLinguistics, 36:389?409, 2010. ISSN 3. ? pages 23, 26, 28, 38, 48, 60, 64,81, 82, 153, 156, 157[64] M. Elsner and E. Charniak. Disentangling Chat with Local CoherenceModels. In Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies - Volume 1,HLT ?11, pages 1179?1189, Portland, Oregon, 2011. ACL. ? pages 48[65] G. Erkan and D. Radev. LexRank: Graph-based Lexical Centrality asSalience in Text Summarization. Journal of Artificial IntelligenceResearch, 22:457?479, 2004. ? pages 28, 29, 172[66] A. Eshghi and P. G. Healey. What is conversation? distinguishing dialoguecontexts. In In THE ANNUAL MEETING OF THE COGNITIVE SCIENCESOCIETY (CogSci 2009), pages 1240?1245, 2009. ? pages 48[67] C. Fellbaum. WordNet - An Electronic Lexical Database. Cambridge, MA,1998. MIT Press. ? pages 122[68] D. Feng, J. Kim, E. Shaw, and E. Hovy. Towards modeling threadeddiscussions using induced ontology knowledge. In proceedings of the 21stnational conference on Artificial intelligence - Volume 2, AAAI?06, pages1289?1294. AAAI Press, 2006. ISBN 978-1-57735-281-5. URLhttp://dl.acm.org/citation.cfm?id=1597348.1597393. ? pages 46[69] V. Feng and G. Hirst. Text-level Discourse Parsing with Rich LinguisticFeatures. In Proceedings of the 50th Annual Meeting of the Association forComputational Linguistics, ACL ?12, pages 60?68, Jeju Island, Korea,2012. ACL. ? pages 21, 31, 103, 108[70] O. Ferschke, I. Gurevych, and Y. Chebotar. Behind the Article:Recognizing Dialog Acts in Wikipedia Talk Pages. In Proceedings of the13th Conference of the European Chapter of the Association forComputational Linguistics, EACL ?12, pages 777?786, Avignon, France,2012. ACL. ? pages 151[71] J. Finkel, A. Kleeman, and C. Manning. Efficient, Feature-based,Conditional Random Field Parsing. In Proceedings of the 46th AnnualMeeting of the Association for Computational Linguistics, ACL?08, pages959?967, Columbus, Ohio, USA, 2008. ACL. ? pages 113181[72] S. Fisher and B. Roark. The Utility of Parse-derived Features forAutomatic Discourse Segmentation. In Proceedings of the 45th AnnualMeeting of the Association for Computational Linguistics, ACL?07, pages488?495, Prague, Czech Republic, 2007. ACL. ? pages 19, 102, 108, 126,128, 130, 132, 134, 135[73] M. Galley and K. McKeown. Improving Word Sense Disambiguation inLexical Chaining. In Proceedings of the 18th International JointConference on Artificial Intelligence, IJCAI?03, pages 1486?1488,Acapulco, Mexico, 2003. ? pages 121[74] M. Galley, K. McKeown, E. Fosler-Lussier, and H. Jing. DiscourseSegmentation of Multi-party Conversation. In Proceedings of the 41stAnnual Meeting on Association for Computational Linguistics - Volume 1,ACL ?03, pages 562?569, Sapporo, Japan, 2003. ACL. ? pages 9, 10, 11,13, 32, 38, 43, 44, 50, 60, 62, 64, 65[75] S. Ghosh, R. Johansson, G. Riccardi, and S. Tonelli. Shallow DiscourseParsing with Conditional Random Fields. In Proceedings of the 5thInternational Joint Conference on Natural Language Processing,IJCNLP?11, pages 1071?1079, Chiang Mai, Thailand, 2011. AFNLP. ?pages 31, 113[76] T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum. Integratingtopics and syntax. In Advances in Neural Information Processing Systems,NIPS?05, pages 537?544. MIT Press, 2005. ? pages 52[77] A. Gruenstein, J. Niekrasz, and M. Purver. Meeting structure annotation ?annotations collected with a general purpose toolkit. In Recent Trends inDiscourse and Dialogue, 39:247?274, 2008. ? pages 11[78] A. D. Haghighi, A. Y. Ng, and C. D. Manning. Robust textual inference viagraph matching. In Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing, HLT?05, pages 387?394, Vancouver, British Columbia, Canada, 2005.Association for Computational Linguistics. ? pages 28[79] M. Halliday and R. Hasan. Cohesion in English. Longman, London, 1976.? pages 5[80] S. Harabagiu and F. Lacatusu. Topic Themes for Multi-documentSummarization. In Proceedings of the 28th annual international ACM182SIGIR conference on Research and development in information retrieval,pages 202?209, Salvador, Brazil, 2005. ACM. ? pages 39, 168[81] C. Hardmeier. Discourse in statistical machine translation: A survey and acase study. Discours, 2012. ? pages 171[82] C. Hardmeier, J. Nivre, and J. Tiedemann. Document-wide decoding forphrase-based statistical machine translation. In Proceedings of the 2012Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning, EMNLP-CoNLL ?12,pages 1179?1190, Jeju Island, Korea, 2012. Association for ComputationalLinguistics. ? pages 171[83] M. A. Hearst. TextTiling: Segmenting Text into Multi-paragraph SubtopicPassages. Computational Linguistics, 23(1):33?64, March 1997. ISSN0891-2017. ? pages 8, 9, 43, 51, 62[84] H. Hernault, H. Prendinger, D. duVerle, and M. Ishizuka. HILDA: ADiscourse Parser Using Support Vector Machine Classification. Dialogueand Discourse, 1(3):1?33, 2010. ? pages 19, 21, 103, 108, 109, 117,119, 125, 134, 135, 136[85] T. Hirao, , J. Suzuki, H. Isozaki, and E. Maeda. Dependency-basedSentence Alignment for Multiple Document Summarization. InProceedings of the 20th international conference on computationallinguistics, COLING?04, pages 446?452, Geneva, Switzerland, 2004. ACL.? pages 149, 157[86] G. Hirst and D. St-Onge. Lexical Chains as Representation of Context forthe Detection and Correction of Malapropisms. In Christiane Fellbaum,editor, WordNet: An Electronic Lexical Database and Some of itsApplications, pages 305?332. MIT press, 1997. ? pages 121[87] J. Hobbs. Coherence and coreference. Cognitive Science, 3:67?90, 1979.ISSN 1. ? pages 5[88] E. Hovy, C. Y. Lin, and L. Zhou. A BE-based Multi-document Summarizerwith Query Interpretation. In Proceedings of Document UnderstandingConference, DUC?05, Vancouver, Canada, 2005. ? pages 149, 157[89] E. Hovy, C. Y. Lin, L. Zhou, and J. Fukumoto. Automated SummarizationEvaluation with Basic Elements. In Proceedings of the Fifth Conference onLanguage Resources and Evaluation, LREC?06, Genoa, Italy, 2006. ?pages 157183[90] P. Hsueh, J. D. Moore, and S. Renals. Automatic Segmentation ofMultiparty Dialogue. In the Proceedings of the 11th Conference of theEuropean Chapter of the Association for Computational Linguistics,EACL?06, Trento, Italy, 2006. ACL. ? pages 44[91] A. Hulth. Improved Automatic Keyword Extraction Given More LinguisticKnowledge. In Proceedings of the 2003 conference on Empirical methodsin natural language processing, EMNLP ?03, pages 216?223. ACL, 2003.? pages 16, 47, 75[92] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin,T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. The ICSI MeetingCorpus. In Proceedings of IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP-03), pages 364?367, 2003. ?pages 11, 38[93] M. Jeong, C.-Y. Lin, and G. G. Lee. Semi-supervised speech actrecognition in emails and forums. In Proceedings of the 2009 conferenceon Empirical methods in natural language processing, EMNLP?09, 2009.? pages 25, 150, 151, 153, 157[94] S. Joty, G. Carenini, G. Murray, and R. T. Ng. Exploiting ConversationStructure in Unsupervised Topic Segmentation for Emails. In Proceedingsof the conference on Empirical Methods in Natural Language Processing,EMNLP?10, pages 388?398, Massachusetts, USA, 2010. ACL. ? pages37, 56[95] S. Joty, G. Carenini, and C.-Y. Lin. Unsupervised Modeling of Dialog Actsin Asynchronous Conversations. In Proceedings of the twenty secondInternational Joint Conference on Artificial Intelligence, IJCAI?11,Barcelona, 2011. ? pages 49, 55, 146[96] S. Joty, G. Carenini, G. Murray, and R. T. Ng. Supervised TopicSegmentation of Email Conversations. In Proceedings of the FifthInternational AAAI Conference on Weblogs and Social Media, ICWSM?11,pages 530?533, Barcelona, Spain, 2011. AAAI. ? pages 37, 65[97] S. Joty, G. Carenini, and R. T. Ng. A Novel Discriminative Framework forSentence-Level Discourse Analysis. In Proceedings of the 2012 JointConference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning, EMNLP-CoNLL ?12, pages904?915, Jeju Island, Korea, 2012. ACL. ? pages 100184[98] S. Joty, G. Carenini, and R. T. Ng. Topic Segmentation and Labeling inAsynchronous Conversations. Journal of Artificial Intelligence Research(JAIR), 47:521?573, 2013. ? pages 37[99] S. Joty, G. Carenini, R. T. Ng, and Y. Mehdad. Combining Intra- andMulti-sentential Rhetorical Parsing for Document-level DiscourseAnalysis. In Proceedings of the 51st Annual Meeting of the Association forComputational Linguistics, ACL ?13, Sofia, Bulgaria, 2013. ACL. ? pages100[100] D. Jurafsky and J. Martin. Speech and Language Processing, chapter 14.Prentice Hall, 2008. ? pages 4, 31, 124[101] S. Kim, T. Baldwin, and M. Kan. Evaluating N-gram Based EvaluationMetrics for Automatic Keyphrase Extraction. In Proceedings of the 23rdInternational Conference on Computational Linguistics, COLING?10,pages 572?580, Beijing, China, 2010. ACL. ? pages 83[102] S. N. Kim, L. Cavedon, and T. Baldwin. Classifying Dialogue Acts inOne-on-one Live Chats. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Processing, EMNLP?10. ACL,2010. ? pages 23, 31, 147, 150, 156[103] S. N. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. SemEval-2010 Task5 : Automatic Keyphrase Extraction from Scientific Articles. InProceedings of the 5th International Workshop on Semantic Evaluation,pages 21?26, Uppsala, Sweden, July 2010. ACL. ? pages 92[104] T. Kleinbauer, S. Becker, and T. Becker. Combining Multiple InformationLayers for the Automatic Generation of Indicative Meeting Abstracts. InProceedings of the Eleventh European Workshop on Natural LanguageGeneration, ENLG?07, pages 151?154, Stroudsburg, PA, USA, 2007.ACL. ? pages 8, 39, 168[105] J. M. Kleinberg. Authoratative sources in a hyperlinked environment. InACM, volume 46(5), 1999. ? pages 29[106] A. Knott and R. Dale. Using Linguistic Phenomena to Motivate a Set ofCoherence Relations. Discourse Processes, 18(1):35?62, 1994. ? pages119[107] J. Kola?r?. A Comparison of Language Models for Dialog Act Segmentationof Meeting Transcripts. In Proceedings of the 11th international185conference on Text, Speech and Dialogue, TSD ?08, pages 117?124, Berlin,Heidelberg, 2008. Springer-Verlag. ? pages 23, 147[108] D. Koller and N. Friedman. Probabilistic Graphical Models Principles andTechniques. The MIT Press, 2009. ? pages 3, 29[109] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, S. Member, and E. J.Hartemink. Sparse Multinomial Logistic Regression: Fast Algorithms andReneralization Bounds. IEEE Transactions on Pattern Analysis andMachine Intelligence, 27:957?968, 2005. ISSN 6. ? pages 61[110] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data. InProceedings of the Eighteenth International Conference on MachineLearning, pages 282?289, San Francisco, CA, USA, 2001. MorganKaufmann Publishers Inc. ? pages 10, 113, 129[111] J. Lau, K. Grieser, D. Newman, and T. Baldwin. Automatic Labelling ofTopic Models. In Proceedings of the 49th annual meeting on Associationfor Computational Linguistics, ACL?11, pages 1536?1545, Portland, USA,2011. ACL. ? pages 46, 68[112] A. Lazaridou, I. Titov, and C. Sporleder. A Bayesian Model for JointUnsupervised Induction of Sentiment, Aspect and DiscourseRepresentations. In Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics, ACL ?13, Sofia, Bulgaria, 2013.ACL. ? pages 18, 101, 169[113] H. LeThanh, G. Abeysinghe, and C. Huyck. Generating DiscourseStructures for Written Texts. In Proceedings of the 20th internationalconference on Computational Linguistics, COLING ?04, Geneva,Switzerland, 2004. ACL. doi:10.3115/1220355.1220403. URLhttp://dx.doi.org/10.3115/1220355.1220403. ? pages 19[114] C.-Y. Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InProceedings of Workshop on Text Summarization Branches Out, pages74?81, Barcelona, 2004. ? pages 83[115] C. Y. Lin and E. Hovy. The Automated Acquisition of Topic Signatures forText Summarization. In Proceedings of the 18th conference onComputational linguistics, pages 495?501. ACL, 2000. ? pages 168186[116] W.-H. Lin, E. Xing, and A. Hauptmann. A joint topic and perspectivemodel for ideological discourse. In Proceedings of the Europeanconference on Machine Learning and Knowledge Discovery in Databases -Part II, ECML PKDD ?08, pages 17?32, Antwerp, Belgium, 2008.Springer-Verlag. ISBN 978-3-540-87480-5. ? pages 170[117] S. Liu, M. X. Zhou, S. Pan, Y. Song, W. Qian, W. Cai, and X. Lian.TIARA: Interactive, Topic-based Visual Text Summarization and Analysis.ACM Trans. Intell. Syst. Technol., 3(2):25:1?25:28, Feb. 2012. ISSN2157-6904. ? pages 39, 170[118] A. Louis, A. Joshi, and A. Nenkova. Discourse Indicators for ContentSelection in Summarization. In Proceedings of the 11th Annual Meeting ofthe Special Interest Group on Discourse and Dialogue, SIGDIAL ?10,pages 147?156, Tokyo, Japan, 2010. ACL. ? pages 18, 101, 169, 172[119] W. Magdy. Tweetmogaz: a news portal of tweets. In Proceedings of the36th international ACM SIGIR conference on Research and development ininformation retrieval, SIGIR ?13, pages 1095?1096. ACM, 2013. ? pages2[120] D. Magerman. Statistical Decision-tree Models for Parsing. In Proceedingsof the 33rd annual meeting on Association for Computational Linguistics,ACL?95, pages 276?283, Cambridge, Massachusetts, 1995. ACL. ? pages135[121] I. Malioutov and R. Barzilay. Minimum Cut Model for Spoken LectureSegmentation. In Proceedings of the 21st International Conference onComputational Linguistics and the 44th annual meeting of the Associationfor Computational Linguistics, COLING-ACL?06, pages 25?32, Sydney,Australia, 2006. ACL. ? pages 9, 10, 28, 44, 57, 86, 156[122] W. Mann and S. Thompson. Rhetorical Structure Theory: Toward aFunctional Theory of Text Organization. Text, 8(3):243?281, 1988. ?pages 17, 101, 131[123] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to InformationRetrieval. Cambridge University Press, 2008. ? pages 46[124] D. Marcu. A Decision-based Approach to Rhetorical Parsing. InProceedings of the 37th annual meeting of the Association forComputational Linguistics on Computational Linguistics, ACL?99, pages365?372, Morristown, NJ, USA, 1999. ACL. ? pages 20, 106, 109187[125] D. Marcu. The Rhetorical Parsing of Unrestricted Texts: A Surface-basedApproach. Computational Linguistics, 26:395?448, 2000. ? pages 20,105, 119[126] D. Marcu. The Theory and Practice of Discourse Parsing andSummarization. MIT Press, Cambridge, MA, USA, 2000. ? pages 18,101, 125, 132, 133, 169, 172[127] D. Marcu and A. Echihabi. An Unsupervised Approach to RecognizingDiscourse Relations. In Proceedings of the 40th Annual Meeting onAssociation for Computational Linguistics, ACL?02, pages 368?375. ACL,2002. ? pages 20, 105, 106[128] M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a LargeAnnotated Corpus of English: The Penn Treebank. ComputationalLinguistics, 19(2):313?330, 1994. ? pages 131[129] J. Martin. English Text: System and Structure. John Benjamins,Philadelphia/Amsterdam, 1992. ? pages 101[130] M. Maslennikov and T.-S. Chua. A multi-resolution framework forinformation extraction from free text. In Proceedings of the 45th AnnualMeeting of the Association of Computational Linguistics, pages 592?599,Prague, Czech Republic, June 2007. Association for ComputationalLinguistics. URL http://www.aclweb.org/anthology/P07-1075. ? pages18, 170[131] E. Mayfield, D. Adamson, and C. P. Rose?. Hierarchical ConversationStructure Prediction in Multi-party Chat. In Proceedings of the 13thAnnual Meeting of the Special Interest Group on Discourse and Dialogue,SIGDIAL ?12, pages 60?69. ACL, 2012. ? pages 23, 26, 48[132] A. McCallum. MALLET: A Machine Learning for Language Toolkit.http://mallet.cs.umass.edu, 2002. ? pages 117[133] K. Mckeown, J. Hirschberg, M. Galley, and S. Maskey. From Text toSpeech Summarization. In Proceedings of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing, ICASSP?05,2005. ? pages 169[134] O. Medelyan. Human-Competitive Automatic Topic Indexing. PhD thesis,The University of Waikato, Hamilton, New Zealand, 2009. ? pages 47,68, 75188[135] O. Medelyan, E. Frank, and I. H. Witten. Human-Competitive Taggingusing Automatic Keyphrase Extraction. In Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing,EMNLP?09, pages 1318?1327, Singapore, 2009. ACL. ? pages 16, 47,83, 91[136] Y. Mehdad, G. Carenini, R. T. Ng, and S. Joty. Towards topic labeling withphrase entailment and aggregation. In Proceedings of the 2013 Conferenceof the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages 179?189, Atlanta,Georgia, June 2013. Association for Computational Linguistics. ? pages96, 169[137] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of Multinomial topicmodels. In Proceedings of the 13th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 490?499,California, USA, 2007. ACM. ? pages 16, 45, 67, 68, 71, 75[138] R. Mihalcea. Unsupervised large-vocabulary word sense disambiguationwith graph-based algorithms for sequence data labeling. In Proceedings ofthe conference on Human Language Technology and Empirical Methods inNatural Language Processing, HLT ?05, pages 411?418, Vancouver,British Columbia, Canada, 2005. Association for ComputationalLinguistics. doi:10.3115/1220575.1220627. URLhttp://dx.doi.org/10.3115/1220575.1220627. ? pages 28[139] R. Mihalcea and D. Radev. Graph-based Natural Language Processingand Information Retrieval. Cambridge University Press, 2011. ? pages 3,16, 17, 29, 39, 47, 68[140] R. Mihalcea and P. Tarau. TextRank: Bringing Order into Text. InProceedings of the 2004 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP?04, pages 404?411, Barcelona, Spain,2004. ? pages 16, 28, 29, 47, 69, 71, 74, 83, 91[141] T. Minka. The Dirichlet-tree Distribution. Technical report, JustsystemPittsburgh Research Center, 1999. URLhttp://research.microsoft.com/?minka/papers/dirichlet/minkadirtree.pdf. ?pages 58[142] Y. Mizuta, A. Korhonen, T. Mullen, and N. Collier. Zone analysis inbiology articles as a basis for information extraction. I. J. MedicalInformatics, pages 468?487, 2006. ? pages 170189[143] J. Morris and G. Hirst. Lexical Cohesion Computed by Thesaural Relationsas an Indicator of Structure of Text. Computational Linguistics, 17(1):21?48, 1991. ? pages 9, 43, 50, 121[144] J. Mu, K. Stegmann, E. Mayfiled, C. Rose?, and F. Fischer. The acodeaframework: Developing segmentation and classification schemes for fullyautomatic analysis of online discussions. International Journal OfComputer-supported Collaborative Learning, 7(2):285?305, 2012. ?pages 171[145] K. Murphy. Machine Learning A Probabilistic Perspective. The MITPress, 2012. ? pages 30, 31, 89, 113[146] G. Murray and G. Carenini. Summarizing Spoken and WrittenConversations. In Proceedings of EMNLP, Honolulu, Hawaii, 2008. ?pages 169[147] G. Murray and S. Renals. Detecting Action Items in Meetings. InProceedings of the 5th international workshop on Machine Learning forMultimodal Interaction, MLMI ?08, pages 208?213, Utrecht, TheNetherlands, 2008. Springer-Verlag. ? pages 170[148] G. Murray, S. Renals, J. Carletta, and J. Moore. Incorporating Speaker andDiscourse Features into Speech Summarization. In Proceedings of theHuman Language Technology Conference of the North American Chapterof the Association for Computational Linguistics, HLT-NAACL?06, 2006.? pages 22, 147, 168[149] G. Murray, T. Kleinbauer, P. Poller, T. Becker, S. Renals, and J. Kilgour.Extrinsic Summarization Evaluation: A Decision Audit Task. ACM Trans.Speech Lang. Process., 6:2:1?2:29, October 2009. ISSN 1550-4875. ?pages 169[150] G. Murray, G. Carenini, and R. Ng. Interpretation and Transformation forAbstracting Conversations. In Human Language Technologies: The 2010Annual Conference of the North American Chapter of the Association forComputational Linguistics, HLT ?10, pages 894?902, Los Angeles,California, 2010. ACL. ? pages 169[151] G. Murray, G. Carenini, and R. T. Ng. Generating and Validating Abstractsof Meeting Conversations: a User Study. In Proceedings of the 6thInternational Natural Language Generation Conference, INLG?10, 2010.? pages 22, 147, 168, 169190[152] D. Nguyen, E. Mayfield, and C. P. Rose?. An analysis of perspectives ininteractive settings. In Proceedings of the First Workshop on Social MediaAnalytics, SOMA ?10, pages 44?52, Washington D.C., District ofColumbia, 2010. ACM. ? pages 170[153] V.-A. Nguyen, J. Boyd-Graber, and P. Resnik. SITS: A HierarchicalNonparametric Model using Speaker Identity for Topic Segmentation inMultiparty Conversations. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics: Long Papers - Volume 1,ACL ?12, pages 78?87, Jeju Island, Korea, 2012. ACL. ? pages 10, 45[154] J. Otterbacher, G. Erkan, and D. R. Radev. Using Random Walks forQuestion-focused Sentence Retrieval. In Proceedings of Human LanguageTechnology Conference and Conference on Empirical Methods in NaturalLanguage Processing, pages 915?922, Vancouver, Canada, 2005. ? pages28, 29[155] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank CitationRanking: Bringing Order to the Web. Technical Report 1999-66, StanfordInfoLab, 1999. ? pages 16, 46, 69, 72[156] S. Pan, M. X. Zhou, Y. Song, W. Qian, F. Wang, and S. Liu. OptimizingTemporal Topic Segmentation for Intelligent Text Visualization. InProceedings of the 2013 international conference on Intelligent userinterfaces, IUI ?13, pages 339?350, Santa Monica, California, USA, 2013.ACM. ? pages 170[157] B. Pang and L. Lee. A sentimental education: sentiment analysis usingsubjectivity summarization based on minimum cuts. In Proceedings of the42nd Annual Meeting on Association for Computational Linguistics, ACL?04, Barcelona, Spain, 2004. Association for Computational Linguistics.? pages 28, 170[158] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A Method forAutomatic Evaluation of Machine Translation. In Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics, ACL?02,pages 311?318, Philadelphia, Pennsylvania, 2002. ACL. ? pages 83[159] R. J. Passonneau and D. J. Litman. Discourse Segmentation by Human andAutomated Means. Computational Linguistics, 23(1):103?139, Mar. 1997.ISSN 0891-2017. ? pages 60191[160] M. J. Paul. Mixed Membership Markov Models for UnsupervisedConversation Modeling. In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Processing and ComputationalNatural Language Learning, EMNLP-CoNLL ?12, pages 94?104,Stroudsburg, PA, USA, 2012. ACL. ? pages 152[161] T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet::Similarity -Measuring the Relatedness of Concepts. In Proceedings of Fifth AnnualMeeting of the North American Chapter of the Association forComputational Linguistics (NAACL-04), pages 38?41, Boston, MA, 2004.? pages 84[162] L. Pevzner and M. A. Hearst. A Critique and Improvement of anEvaluation Metric for Text Segmentation. Computational Linguistics, 28(1):19?36, Mar. 2002. ISSN 0891-2017. ? pages 81[163] R. Prasad, A. Joshi, N. Dinesh, A. Lee, E. Miltsakaki, and B. Webber. ThePenn Discourse TreeBank as a Resource for Natural Language Generation.In Proceedings of the Corpus Linguistics Workshop on Using Corpora forNatural Language Generation, pages 25?32, Birmingham, U.K., 2005. ?pages 18, 101[164] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, andB. Webber. The Penn Discourse TreeBank 2.0. In Proceedings of the SixthInternational Conference on Language Resources and Evaluation (LREC),pages 2961?2968, Marrakech, Morocco, 2008. ELRA. ISBN2-9517408-4-0. ? pages x, 31, 113[165] M. Purver. Topic Segmentation. In G. Tur and R. de Mori, editors, SpokenLanguage Understanding: Systems for Extracting Semantic Informationfrom Speech, pages 291?317. Wiley, 2011. ISBN 978-0-470-68824-3. ?pages 9, 11, 38, 43, 50, 77[166] M. Purver, P. Ehlen, and J. Niekrasz. Detecting Action Items in Multi-partyMeetings: Annotation and Initial Experiments. In Proceedings of the Thirdinternational conference on Machine Learning for Multimodal Interaction,MLMI?06, pages 200?211, Bethesda, MD, 2006. Springer-Verlag. ?pages 170[167] M. Purver, K. P. Kording, T. L. Griffiths, and J. B. Tenenbaum.Unsupervised Topic Modelling for Multi-Party Spoken Discourse. InProceedings of the 21st International Conference on Computational192Linguistics and the 44th annual meeting of the Association forComputational Linguistics, COLING-ACL?06, pages 17?24, Sydney,Australia, 2006. ACL. ? pages 9, 10, 45[168] L. R. Rabiner. A Tutorial on Hidden Markov Models and SelectedApplications in Speech Recognition. In Proceedings of the IEEE, pages257?285, 1989. ? pages 212[169] O. Rambow, L. Shrestha, J. Chen, and C. Lauridsen. Summarizing EmailThreads. In Proceedings of the Human Language Technology Conferenceof the North American Chapter of the Association for ComputationalLinguistics, HLT-NAACL-Short ?04, pages 105?108. ACL, 2004. ? pages169[170] R. Ranganath, D. Jurafsky, and D. Mcfarland. It?s Not You, it?s Me:Detecting Flirting and its Misperception in Speed-Dates. In Proceedings ofthe 2009 Conference on Empirical Methods in Natural LanguageProcessing, EMNLP?09, pages 334?342, Singapore, 2009. ACL. ? pages147[171] V. K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. CombiningLexical, Syntactic and Prosodic Cues for Improved Online Dialog ActTagging. Comput. Speech Lang., 23:407?422, October 2009. ISSN0885-2308. ? pages 147[172] S. Rashid. A Visual Interface for Browsing and SummarizingConversations. Master?s thesis, University of British Columbia, Vancouver,2012. ? pages 170[173] S. Ravi and J. Kim. Profiling Student Interactions in Threaded Discussionswith Speech Act Classifiers. In Proceedings of AI in EducationConference, AIED?07, 2007. ? pages 151[174] A. Ritter, C. Cherry, and B. Dolan. Unsupervised Modeling of TwitterConversations. In Human Language Technologies: The 2010 AnnualConference of the North American Chapter of the Association forComputational Linguistics, HLT ?10, pages 172?180, LA, California, 2010.ACL. ? pages 25, 35, 150, 152, 156, 159, 160[175] C. Rose?, Y. chia Wang, J. Arguello, K. Stegmann, A. Weinberger, andF. Fischer. Analyzing collaborative learning processes automatically:Exploiting the advances of computational linguistics in computer-supported193collaborative learning. International Journal Of Computer-supportedCollaborative Learning, 3(3):237?271, 2008. ? pages 171[176] C. P. Rose?, B. Di Eugenio, L. S. Levin, and C. Van Ess-Dykema. Discourseprocessing of dialogues with multiple threads. In Proceedings of the 33rdannual meeting on Association for Computational Linguistics, ACL ?95,pages 31?38, Stroudsburg, PA, USA, 1995. Association for ComputationalLinguistics. doi:10.3115/981658.981663. URLhttp://dx.doi.org/10.3115/981658.981663. ? pages 26[177] H. Sacks, A. Schegloff, and G. Jefferson. A Simplest Systematics for theOrganization of Turn-taking for Conversation. Language, 50:696?735,1974. ? pages 22, 78[178] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval.McGraw-Hill, Inc., New York, NY, USA, 1986. ISBN 0070544840. ?pages 9, 62, 71, 156[179] H. Schauer and U. Hahn. Anaphoric Cues for Coherence Relations. InProceedings of the Conference on Recent Advances in Natural LanguageProcessing, RANLP ?01, pages 228?234, 2001. ? pages 105[180] A. Schegloff. Sequencing in conversational openings1. AmericanAnthropologist, 70(6):1075?1095, 1968. ISSN 1548-1433. ? pages 22,147[181] F. Schilder. Robust Discourse Parsing via Discourse Markers, Topicalityand Position. Natural Language Engineering, 8(3):235?255, June 2002.ISSN 1351-3249. ? pages 141[182] E. Seneta. Non-negative Matrices and Markov Chains. Springer-Verlag,1981. ? pages 72[183] F. Sha and F. Pereira. Shallow Parsing with Conditional Random Fields. InProceedings of the 2003 Conference of the North American Chapter of theAssociation for Computational Linguistics on Human LanguageTechnology - Volume 1, NAACL-HLT?03, pages 134?141, Edmonton,Canada, 2003. ACL. ? pages 113[184] D. Shen, Q. Yang, J.-T. Sun, and Z. Chen. Thread detection in dynamic textmessage streams. In Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval,SIGIR ?06, pages 35?42, Seattle, Washington, USA, 2006. ACM. ISBN1-59593-369-7. ? pages 23, 26, 47194[185] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEETransactions on Pattern Analysis and Machine Intelligence, 22(8):888?905, 2000. ISSN 0162-8828. ? pages 28, 44, 56, 57[186] L. Shrestha and K. McKeown. Detection of Question-Answer Pairs inEmail Conversations. In Proceedings of the 20th international conferenceon Computational Linguistics, COLING ?04, Morristown, NJ, USA, 2004.ACL. ? pages 151[187] H. G. Silber and K. F. McCoy. Efficiently Computed Lexical Chains As anIntermediate Representation for Automatic Text Summarization.Computational Linguistics, 28(4):487?496, 2002. ? pages 121[188] N. A. Smith. Linguistic Structure Prediction. Synthesis Lectures on HumanLanguage Technologies. Morgan and Claypool, May 2011. ? pages 21[189] S. Somasundaran. Discourse-Level Relations for Opinion Analysis. PhDthesis, University of Pittsburgh, Pittsburgh, 2010. ? pages 18, 101, 169[190] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A Machine Learning Approach toCoreference Resolution of Noun Phrases. Computational Linguistics, 27(4):521?544, Dec. 2001. ISSN 0891-2017. ? pages 28, 60, 156[191] R. Soricut and D. Marcu. Sentence Level Discourse Parsing UsingSyntactic and Lexical Information. In Proceedings of the 2003 Conferenceof the North American Chapter of the Association for ComputationalLinguistics on Human Language Technology - Volume 1, NAACL?03, pages149?156, Edmonton, Canada, 2003. ACL. ? pages 19, 21, 103, 105, 107,110, 111, 117, 119, 125, 126, 128, 132, 134, 135, 136, 139[192] C. Sporleder. Manually vs. Automatically Labelled Data in DiscourseRelation Classification. Effects of Example and Feature Selection. LDVForum, 22(1):1?20, 2007. ? pages 106[193] C. Sporleder and M. Lapata. Automatic Paragraph Identification: A Studyacross Languages and Domains. In Proceedings of the 2004 Conference onEmpirical Methods in Natural Language Processing, EMNLP ?04, pages72?79, 2004. ? pages 105, 110, 121, 123[194] C. Sporleder and M. Lapata. Discourse Chunking and its Application toSentence Compression. In Proceedings of the conference on HumanLanguage Technology and Empirical Methods in Natural LanguageProcessing, HLT-EMNLP?05, pages 257?264, Vancouver, BritishColumbia, Canada, 2005. ACL. ? pages 18, 19, 101, 107, 117, 128, 130195[195] C. Sporleder and A. Lascarides. Exploiting Linguistic Cues to ClassifyRhetorical Relations. In Proceedings of Recent Advances in NaturalLangauge Processing (RANLP), Bulgaria, 2005. ? pages 20, 106[196] C. Sporleder and A. Lascarides. Using Automatically Labelled Examplesto Classify Rhetorical Relations: An Assessment. Natural LanguageEngineering, 14(3):369?416, 2008. ISSN 1351-3249. ? pages 20, 105,106[197] M. Stede. The Potsdam Commentary Corpus. In Proceedings of theACL-04 Workshop on Discourse Annotation, Barcelona, 2004. ACL. ?pages 105, 140[198] M. Stede. Discourse Processing. Synthesis Lectures on Human LanguageTechnologies. Morgan And Claypool Publishers, 2011. ? pages 7, 19, 20,104, 128[199] M. Steyvers and T. Griffiths. Latent Semantic Analysis: A Road toMeaning, chapter Probabilistic Topic Models. Laurence Erlbaum, 2007. ?pages 51, 86[200] A. Stolcke, N. Coccaro, R. Bates, P. Taylor, C. Van Ess-Dykema, K. Ries,E. Shriberg, D. Jurafsky, R. Martin, and M. Meteer. Dialogue Act Modelingfor Automatic Tagging and Recognition of Conversational Speech.Computational Linguistics, 26:339?373, 2000. ? pages 23, 31, 147[201] R. Subba and B. Di-Eugenio. An Effective Discourse Parser that Uses RichLinguistic Information. In Proceedings of Human Language Technologies:The 2009 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, HLT-NAACL?09, pages566?574, Boulder, Colorado, 2009. ACL. ? pages 21, 103, 105, 108, 109,125, 126, 131, 136, 137, 139, 140[202] A. Subramanya and J. Bilmes. Semi-supervised learning with measurepropagation. J. Mach. Learn. Res., 12:3311?3370, Nov. 2011. ISSN1532-4435. ? pages 28[203] A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-basedsemi-supervised learning of structured tagging models. In Proceedings ofthe 2010 Conference on Empirical Methods in Natural LanguageProcessing, EMNLP ?10, pages 167?176, Cambridge, Massachusetts,2010. Association for Computational Linguistics. ? pages 28196[204] C. Sutton and A. McCallum. An Introduction to Conditional RandomFields. Foundations and Trends in Machine Learning, 4(4):267?373, 2012.? pages 30, 116[205] C. Sutton, A. McCallum, and K. Rohanimanesh. Dynamic ConditionalRandom Fields: Factorized Probabilistic Models for Labeling andSegmenting Sequence Data. Journal of Machine Learning Research(JMLR), 8:693?723, 2007. ISSN 1532-4435. ? pages 34, 103, 112[206] M. Taboada. Discourse Markers as Signals (or Not) of RhetoricalRelations. Journal of Pragmatics, 38(4):567?592, 2006. ? pages 105[207] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede. Lexicon-basedmethods for sentiment analysis. Comput. Linguist., 37(2):267?307, June2011. ISSN 0891-2017. ? pages 170[208] P. P. Talukdar and K. Crammer. New regularized algorithms fortransductive learning. In Proceedings of the European Conference onMachine Learning and Knowledge Discovery in Databases: Part II, ECMLPKDD ?09, pages 442?457, Bled, Slovenia, 2009. Springer-Verlag. ?pages 28, 29[209] S. Teufel and M. Moens. Summarizing scientific articles: experiments withrelevance and rhetorical status. Comput. Linguist., 28(4):409?445, Dec.2002. ISSN 0891-2017. ? pages 18, 170[210] M. Tofiloski, J. Brooke, and M. Taboada. A syntactic and lexical-baseddiscourse segmenter. In Proceedings of the ACL-IJCNLP 2009 ConferenceShort Papers, ACLShort ?09, pages 77?80, Stroudsburg, PA, USA, 2009.Association for Computational Linguistics. URLhttp://dl.acm.org/citation.cfm?id=1667583.1667609. ? pages 19[211] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predictingelections with twitter: What 140 characters reveal about political sentiment.In Proceedings of the Fourth International AAAI Conference on Weblogsand Social Media, pages 178?185, 2010. ? pages 170[212] P. D. Turney. Learning Algorithms for Keyphrase Extraction. InformationRetrieval, 2(4):303?336, May 2000. ISSN 1386-4564. ? pages 83[213] J. Ulrich, G. Murray, and G. Carenini. A Publicly Available AnnotatedCorpus for Supervised Email Summarization. In EMAIL-2008 Workshop,pages 428?435. AAAI, 2008. ? pages xiii, 12, 77, 153197[214] I. Varga, M. Sano, K. Torisawa, C. Hashimoto, K. Ohtake, T. Kawai, J.-H.Oh, and S. D. Saeger. Aid is Out There: Looking for Help from Tweetsduring a Large Scale Disaster. In Proceedings of the 51st Annual Meetingof the Association for Computational Linguistics, ACL ?13, Sofia, Bulgaria,2013. ACL. ? pages 1[215] S. Verberne, L. Boves, N. Oostdijk, and P. Coppen. EvaluatingDiscourse-based Answer Extraction for Why-question Answering. InProceedings of the 30th annual international ACM SIGIR conference onResearch and development in information retrieval, SIGIR?07, pages735?736, Amsterdam, The Netherlands, 2007. ACM. ? pages 18, 101, 170[216] P. Verna. The Blogosphere: Colliding with Social and Mainstream Media.eMarketer, 2010. URLhttp://www.emarketer.com/Reports/All/Emarketer 2000708.aspx. ? pages1[217] D. Vesset, B. McDonough, and M. Wardley. Worldwide business analyticssoftware 2010-2014 forecast and 2009 vendor shares. Technical report, idc,Stanford InfoLab, 2010. ? pages 2[218] N. Vliet and G. Redeker. Complex Sentences as Leaky Units in DiscourseParsing. In Proceedings of Constraints in Discourse, Agay-Saint Raphael,September 2011. ? pages 104, 126, 140[219] H. M. Wallach. Topic Modeling: Beyond Bag-of-Words. In Proceedings ofthe 23rd international conference on Machine learning, ICML ?06, pages977?984, Pittsburgh, Pennsylvania, 2006. ACM. ? pages 52[220] H. Wang, C. Wang, C. Zhai, and J. Han. Learning Online DiscussionStructures by Conditional Random Fields. In Proceedings of the 34thinternational ACM SIGIR conference on Research and development inInformation Retrieval, SIGIR ?11, pages 435?444, Beijing, China, 2011.ACM. ? pages 26, 48, 155[221] L. Wang and C. Cardie. Unsupervised Topic Modeling Approaches toDecision Summarization in Spoken Meetings. In Proceedings of the 13thAnnual Meeting of the Special Interest Group on Discourse and Dialogue,SIGDIAL ?12, pages 40?49, Seoul, South Korea, 2012. ACL. ? pages 168[222] L. Wang and C. Cardie. Domain-Independent Abstract Generation forFocused Meeting Summarization. In Proceedings of the 51st Annual198Meeting of the Association for Computational Linguistics, ACL?13, Sofia,Bulgaria, 2013. ACL. ? pages 169[223] L. Wang and D. W. Oard. Context-based Message Expansion forDisentanglement of Interleaved Text Conversations. In Proceedings ofHuman Language Technologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics,NAACL ?09, pages 200?208, Boulder, Colorado, 2009. ACL. ? pages 26,47, 48[224] Y.-C. Wang, M. Joshi, W. Cohen, and C. Rose?. Recovering Implicit ThreadStructure in Newsgroup Style Conversations. In Proceedings of theInternational AAAI Conference on Weblogs and Social Media, ICWSM?08.AAAI, 2008. ? pages 26, 48[225] B. Webber. D-LTAG: Extending Lexicalized TAG to Discourse. CognitiveScience, 28(5):751?779, 2004. ? pages 17, 101[226] B. Webber, M. Egg, and V. Kordoni. Discourse structure and languagetechnology. Natural Language Engineering, 18:437?490, 10 2012. ISSN1469-8110. ? pages 2, 7[227] M. Wen and C. P. Rose. Understanding participant behavior trajectories inonline health support groups using automatic extraction methods. InProceedings of the 17th ACM international conference on Supportinggroup work, GROUP ?12, pages 179?188, Sanibel Island, Florida, USA,2012. ACM. ? pages 171[228] D. Widdows and B. Dorow. A graph model for unsupervised lexicalacquisition. In Proceedings of the 19th international conference onComputational linguistics - Volume 1, COLING ?02, pages 1?7, Taipei,Taiwan, 2002. Association for Computational Linguistics.doi:10.3115/1072228.1072342. URLhttp://dx.doi.org/10.3115/1072228.1072342. ? pages 28[229] Y. Wilks. Artificial Companions as a New Kind of Interface to the FutureInternet. OII Research Report No. 13, 2006. ? pages 147[230] F. Wolf and E. Gibson. Representing Discourse Coherence: ACorpus-Based Study. Computational Linguistics, 31:249?288, June 2005.ISSN 0891-2017. ? pages 143[231] D. H. Wolpert. The lack of a priori distinctions between learningalgorithms. Neural Comput., 8(7):1341?1390, Oct. 1996. ? pages 31199[232] T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler, and W. M. Pottenger. PostingAct Tagging Using Transformation-Based Learning. In Proceedings of theWorkshop on Foundations of Data Mining and Discovery, IEEEInternational Conference on Data Mining, 2002. ? pages 23, 147[233] J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. A hiddenmarkov model approach to text segmentation and event tracking. InAcoustics, Speech and Signal Processing, 1998. Proceedings of the 1998IEEE International Conference on, volume 1, pages 333?336 vol.1, 1998.doi:10.1109/ICASSP.1998.674435. ? pages 9, 10[234] T. Zesch and I. Gurevych. Approximate Matching for EvaluatingKeyphrase Extraction. In Proceedings of the 7th International Conferenceon Recent Advances in Natural Language Processing, RANLP?09, pages484?489, Borovets, Bulgaria, 2009. ? pages 83[235] R. Zhang, D. Gao, and W. Li. Towards Scalable Speech Act Recognition inTwitter: Tackling Insufficient Training Data. In Proceedings of theWorkshop on Semantic Analysis in Social Media, pages 18?27, Avignon,France, 2012. ACL. ? pages 152[236] W. X. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E.-P. Lim, andX. Li. Topical Keyphrase Extraction from Twitter. In Proceedings of the49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies - Volume 1, HLT ?11, pages 379?388,Portland, Oregon, 2011. ACL. ? pages 46[237] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li.Comparing Twitter and Traditional Media using Topic Models. InProceedings of the 33rd European conference on Advances in informationretrieval, ECIR?11, pages 338?349, Dublin, Ireland, 2011. Springer-Verlag.? pages 46[238] D. Zhou, S. A. Orshanskiy, H. Zha, and C. L. Giles. Co-ranking Authorsand Documents in a Heterogeneous Network. In Proceedings of the 2007Seventh IEEE International Conference on Data Mining, ICDM ?07, pages739?744, Washington, DC, USA, 2007. IEEE Computer Society. ? pages17, 29, 73, 74200Appendix ASupporting MaterialsA.1 Metrics for Topic SegmentationA.1.1 One-to-One MetricConsider the two annotations of the same conversation having 10 sentences (de-noted by colored boxes) in Figure A.1(a). In each annotation, the topics are distin-guished by different colors. For example, the model output has four topics, whereasthe human annotation has three topics. To compute one-to-one accuracy, we takethe model output and map its segments optimally (by computing the optimal max-weight bipartite matching) to the segments of the gold-standard human annotation.For example, the red segment in the model output is mapped to the green segmentin the human annotation. We transform the model output based on this mappingand compute the percentage of overlap as the one-to-one accuracy. In our example,seven out of ten sentences overlap, therefore, the one-to-one accuracy is 70%.A.1.2 Lock MetricConsider the model output (at the left most column) and the human annotation (atthe right most column) of the same conversation having 5 sentences (denoted bycolored boxes) in Figure A.2. Similar to Figure A.1, the topics in an annotation aredistinguished using different colors. Suppose we want to measure the loc3 score for201Transform model output according to optimal mapping 70% Model output                                                                                                                                                              Human annotation                                           Model output                                                                                                                                                              Human annotation Transformed model output                                                                                                                                                             (a) (b) One-to-One accuracy                                                                                                                                                              Figure A.1: Computing one-to-one accuracy.the fifth sentence (marked with yellow arrows at the bottom of the two annotations).In each annotation, we look at the previous 3 sentences and transform them basedon whether they have same or different topics. For example, in the model outputone of the previous three sentences is same (red), and in the human annotation twoof the previous three sentences are same (green), when compared with the sentenceunder consideration. In the transformed annotations, same topics are denoted bygray boxes and different topics are denoted by black boxes. We compute loc3 bymeasuring the overlap of the same or different judgments in the 3-sentence window.In our example, two of three overlap, therefore, the loc3 agreement is 66.6%.202                                     66.6% Model  Output Human  annotation Transformed  model output Transformed  human annotation Loc3 accuracy  S ame or different?  Same or different?  Figure A.2: Computing loc3 accuracy.203A.2 Annotation Manual for Topic Segmentation andLabelingA.2.1 Instructions for Finding Topics in EmailsAbstractThis document is a manual that instructs annotators on how to find topics in emailconversations. After introducing some general ideas on the task, it explains how toperform the annotation step by step.IntroductionThe ultimate goal of this research is to be able to automatically generate summariesof email conversations. Finding topics is the first step towards this goal. It involvesclustering the sentences of a conversation into a set of clusters which reveal thetopics discussed in the conversation. Our assumption is that knowing the topicstructure of conversations simplifies extraction of information and summarization.A. Task OverviewThe task of finding topics can be divided into two subtasks as described below:A.1. Finding the discussed topicsYou will be given a set of email conversations (or threads). Each conversation isassociated with a human written summary that will give you a brief overview ofthe corresponding conversation. Read carefully the conversation first, then read thesummary and make sure that you understand the conversation.At this point, you should be able to find the underlying topics/issues discussedin the conversation. Our definition of ?topic? is something about which the partici-pant(s) discuss or argue or express their opinions. For example, an email threadabout an upcoming meeting may contain a discussion about the ?location andschedule?, another discussion about the ?meeting agenda?, etc.Here, you need to list the topics in the following format:< Topic number = X, a short description of the topic >204For example,< Topic number = 1, extending the meeting duration >< Topic number = 2, scheduling the meeting >The short description should provide a high-level overview on that topic. Thiscan usually be based on a few keywords from the discussion, and needs to bedetailed enough that someone else could figure out later what the topic was. Inorder to come up with a segment description, try to fill in the following statementwith a specific phrase:In this topic, people talk about ????????-For example, an email thread about a meeting could have topic descriptionssuch as ?extending the meeting?, ?scheduling the meeting?, ?meeting agenda?, etc.An email thread about arranging a conference can have topics such as ?locationand time?, ?registration?, ?food menu?, ?workshops?, etc. An email thread aboutbuilding a webpage that will contain a list of people working in a particular researcharea (e.g., User Modeling) all over the world, can have discussions about ?peoplewho are working?, ?creating a map?, ?design of the map? and so on.The target number of topics for each conversation will not be given to youin advance, so you will need to find as many topics as you see fit and natural toconvey the overall content structure of the conversation. You might be expectingus to tell you exactly how many topics each thread should have, but the truth isit is subjective and varies considerably. In theory each sentence says somethingdifferent from the previous sentence and therefore it should be possible to mark anew topic, but, as we mentioned previously, we don?t want this level of detail.There could be threads that discuss one topic extensively, and others that discussa very large number of topics, all briefly. There is also no optimal length for antopic; there is a fine, yet subjective balance as to how many topics one could detectin a single thread, before it all seems too fragmented. For this reason, you shoulddivide the thread into topics in the way that you find most natural. We will provideyou with some examples to help you initially.Sometimes you won?t be sure whether to mark part of a conversation as onetopic or two, because there are really two topics but they are related to each other.For instance, if a group was talking about an upcoming meeting, they might talk205first about the ?location? and then they might talk about the ?time?. If they talkedabout these two things separately, then they would be separate topics. However,if they discuss both location and time at the same time then the topic descriptionshould be something like ?location and time?.A.2. Assigning topics to sentencesHere the task is, for each sentence (which is separated by a line break) in the threadyou have to identify the most appropriate topic to which it belongs. In general,one sentence should be labeled with only one topic, however if you find sentencesthat you think cover more than one topic, please do label them with all the relevanttopics. Again, if you find any sentence that doesn?t fit into any topic, just labelthose as the predefined topic ?OFF-TOPIC?.Wherever appropriate you should also make use of 2 other predefined topic la-bels: ?INTRO? and ?END?. INTRO (e.g., ?hi?, ?hello X?, etc.) signifies the section(usually at the beginning) of an email that people use to begin their email. Like-wise, END (e.g., ?Cheers?, ?Best?, etc.) signifies the section (usually at the end)that people use to end their email.In some emails you will find people quote (usually preceded by > sign) fromother?s email(s). You do not need to label the quoted texts if you have alreadylabelled these texts in any of the previous emails. However, you may find somequotes that are new in the current email. You should label those. These quotesactually come from emails not in this thread (also called hidden emails). An emailthread may contain three to ten emails and it may take up to 25 minutes to find thetopics in a thread, so allocate enough time to be able to do so without interruptions.B. ExamplesTo help you through the process, in this section we have included two exampleannotations.1 Please study these carefully. Here, we use tab or indentation to givethe thread view. If you are facing any problem with this view and prefer to havean annotation tool (i.e., software that can be used to help humans to annotate),please let us know. If you have any questions/concerns while you are performing1The examples are not shown here to save space.206the annotation, do not hesitate to ask. Thanks for your help. Good luck!A.2.2 Instructions for Finding Topics in BlogsAbstractThis document is a manual that instructs annotators on how to find topics in blogconversations and summarize them. After introducing some general ideas on thetask, it explains how to perform the annotation step by step.IntroductionThe ultimate goal of this research is to be able to automatically generate summariesof blog conversations. Finding topics is the first step towards this goal. It involvesclustering the sentences of a blog discussion into a set of clusters which revealthe topics discussed in the conversation. Our assumption is that knowing the topicstructure of conversations simplifies extraction of information and summarization.A. Task OverviewThe task of finding topics and summarizing a blog discussion can be divided intofour different subtasks which are described below:A.1. Writing a short summary (? 3 sentences) of each thread:You will be given a set of blog discussions from Slastdot2. Each discussion consistsof an article followed by a number of threads and single comments. For examplesee the sample blogs provided in pages 4 and 8 (ignore texts in color).A thread is a sequence of comments organized hierarchically according to their?reply-to? relation. A single comment is a comment to the article nobody repliedto. The sentences in a comment were separated using an automatic segmentationtool which is not perfect in many cases (don?t worry about it). At the end of eachsuch sentence we have put a space to enter a number (i.e., Topic id ??) that willbe required in step 3 (as described in A.3).2http://slashdot.org/207In this initial A.1 step we ask you to first read through the article and thenread carefully the threads and the single comments (ignoring [Topic id ??]). Foreach thread once you finish reading it, write a short summary (?3 sentences) ofthe thread in the space provided just below the respective thread. Your summaryshould be short but as informative as possible. We provide examples of what weconsider good and bad short summaries in example 2 (pages 11-19). You need toread the single comments but do not need to summarize them.To ease the process of finding topics in step 2 (Section A.2) and labelling themin step 3 (Section A.3) you can keep notes in the provided scratch paper as youdiscover new topics while reading through the article, threads and single commentsin this step. This point will become clear once you read Sections A.2 and A.3.Please ask the experimenter if you have any question about this step.A.2. Finding the discussed topics:At this point, as you will have read the whole blog conversation and summarizedall the threads, you should have a pretty good understanding of the blog contentand should be able to find the underlying topics covered in the whole discussion.Our definition of ?topic? is something about which the participant(s) discuss or ar-gue or express their opinions. For example, a discussion about a new iphone maycontain topics such as ?date of arrival in market?, ?touch screen?, ?music applica-tion?,?charging and power?, ?outlook?, ?industrial espionage? etc. Note that eventhough a thread may have a single title, the sentences may discuss different topics.Even the sentences in the same comment may discuss different topics.Here, you need to list the topics discussed in the following format:< Topic number = X, a short description of the topic >For example,< Topic id 1: date of arrival >< Topic id 2: touch screen >< Topic id 3: music application >And so on.The short description should provide a high-level overview on that topic. Thiscan usually be based on a few keywords from the discussion, but needs to be de-tailed enough that someone else could figure out later what the topic was. In order208to come up with a segment description, try to fill in the following statement with aspecific phrase:Here, people talked about ????????-For example, a discussion about different issues of a country may include ?se-curity?, ?economy?, ?personnel?, ?industries and companies?, ?foreign policy?,etc. A discussion about a new scientific contribution (e.g., a proof of P is not equalto NP) may have topics such as ?the inventor?, ?the contribution itself?, ?objectionsfrom other researchers?, ?new ideas generated by the contribution?, ?possible ap-plications?, ?implications for theoretical computer science? and so on. See B.1 andB.2 on pages 4 and 8 for examples of topics identified in sample blogs.The target number of topics for each discussion will not be given to you inadvance, so you will need to find as many topics as you see fit and natural toconvey the overall content structure of the discussion. You might be expecting usto tell you exactly how many topics each discussion should have, but the truth isit is rather subjective and may vary considerably. In theory each sentence sayssomething different from the previous sentence and therefore it should be possibleto mark a new topic, but, as mentioned before, we don?t want this level of detail.There could be conversations that discuss one or two topics extensively, and othersthat handle a very large number of topics, all briefly. There is also no optimal lengthfor a topic; there is a fine, yet subjective balance as to how many topics one coulddetect in a blog discussion, before it all seems too fragmented. For this reason, youshould divide the discussion into topics in the way that you find most natural. Wewill provide you with some good and bad examples to help you initially.Sometimes you won?t be sure whether to mark part of a discussion as one topicor two, because there are really two topics but they are related to each other. Forinstance, if a group was talking about a country?s issues, they might talk first aboutthe ?security? and then they might talk about the ?economy?. If they talked aboutthese two things separately, then they would be separate topics. However, if theydiscuss both ?security? and ?economy? at the same time possibly exploring how thetwo can be related then a more appropriate topic description could be somethinglike ?security and economy?. Studying Example 1 (Section B.1) and 2 (SectionB.2) will help you with the task.Ask the experimenter if you have any question at this point.209A.3. Assigning topics to sentences:Now that you have identified the topics covered in the blog, for each sentence(which is separated by a line break) in the conversation, you have to identify themost appropriate topic to which it belongs. In general, one sentence should belabelled with only one topic, however if you find a sentence that you think covertwo topics almost equally (i.e., 50-50, 40-60, 60-40), please do label them withthe two relevant topics and also mention the percentage of coverage. Again, ifyou find any sentence that is not related to the original article, just label those asthe predefined label ?OFF-TOPIC?. In step 1 (Section A.1), when summarizing theconversations if you find all the sentences in that conversation are OFF-TOPIC,you can write the summary something like: ?the whole discussion is off the topic?.Wherever appropriate you should also use two other predefined topic labels:?INTRO? and ?END?. INTRO (e.g., ?hi?, ?hello X?, etc.) signifies the section (usu-ally at the beginning) of a comment that people use to begin their contribution.Likewise, END (e.g., ?Cheers?, ?Best?, etc.) signifies the section (usually at theend) that people use to end their comment.In some comments you will find people quote (usually preceded by > sign)from other?s comment(s). You do not need to label the quoted texts if you havealready labelled these texts in any of the previous comments.Finally, while you are annotating the sentences, if you feel you need to reviseyour topic list (e.g., adding a new topic, renaming an existing one), please do nothesitate to do so.At this point please ask any questions you have to the experimenter.A.4. Writing a 250 words summary of the whole conversation:As a final step you will author a single high level 250 words summary for thewhole blog conversation. The summary should be around 250 words. It is thereforecritical to capture the important information of the discussion in a concise manner.For example please see the 250 words summaries of the blogs on pages 4 and 8.A blog conversation may contain thirty (30) to one hundred (100) commentsand depending on the length and the number of comments it may take 40-60 min-utes to complete the above four tasks, so allocate enough time to be able to do so210without interruptions.B. Examples:To help you through the process, in this section we have included two exampleannotations.3 Please study these very carefully and ask questions if anything isunclear. Here, we use tab/indentation to give the conversations a thread view. Blackrepresents the original content, Blue represents annotation, and Orange representsour comments. If you are facing any problem with this view, please let us know.If you have any questions/concerns while you are performing the annotation,do not hesitate to ask. Thanks for your help. Good luck!A.3 EM for HMM+Mix modelThe expected complete data log likelihood can be written as:Q(? ,? old) =K?k=1E[N1k ]logpik +K?j=1K?k=1E[N jk]logA jk+K?k=1M?m=1E[Nkm]logCkm +K?k=1M?m=1E[Nkml]logBkml (A.1)where the expected counts are given by:E[N1k ] =N?n=1p(Dn,1 = k|Xn,? old) (A.2)E[N jk] =N?n=1Tn?i=1p(Dn,i = j,Dn,i+1 = k|Xn,? old) (A.3)E[Nkm] =N?n=1Tn?i=1p(Dn,i = k,Mn,i = m|Xn,? old) (A.4)E[Nkml] =N?n=1Tn?i=1p(Dn,i = k,Mn,i = m|Xn,? old)I(Xn,i = l) (A.5)3The examples are not shown here to save space.211A.3.1 E step:In E step, we compute the expected sufficient statistics mentioned above. Specif-ically, by running forwards-backwards algorithm on each sequence we get thesmoothed node and edge marginals:?n,i( j) := p(Dn,i = j|Xn,1:Tn ,?) (A.6)?n,i( j,k) := p(Dn,i = j,Dn,i+1 = k|Xn,1:Tn ,?) (A.7)Rabiner [168] (page 267) shows that the joint probability?n,i( j,k) := p(Dn,i = j,Mn,i = k|Xn,1:Tn ,?)= ?n,i( j)p(Mn,i = k|Dn,i = j)p(Xn,i|Dn,i = j,Mn,i = k)?m p(Mn,i = m|Dn,i = j)p(Xn,i|Dn,i = j,Mn,i = m)(A.8)A.3.2 M step:The Maximum Likelihood Estimates (MLE) of the parameters are given by:k? =E[N1k ]N(A.9)A? jk =E[N jk]?k E[N jk](A.10)C?km =E[Nkm]?m E[Nkm](A.11)B?kml =E[Nkml]E[Nkm](A.12)212

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0165726/manifest

Comment

Related Items