Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Exploring machine learning design options in discourse parsing Liao, Weicong 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2015_september_liao_weicong.pdf [ 1.84MB ]
Metadata
JSON: 24-1.0167698.json
JSON-LD: 24-1.0167698-ld.json
RDF/XML (Pretty): 24-1.0167698-rdf.xml
RDF/JSON: 24-1.0167698-rdf.json
Turtle: 24-1.0167698-turtle.txt
N-Triples: 24-1.0167698-rdf-ntriples.txt
Original Record: 24-1.0167698-source.json
Full Text
24-1.0167698-fulltext.txt
Citation
24-1.0167698.ris

Full Text

Exploring Machine Learning DesignOptions in Discourse ParsingbyWeicong LiaoB.Sc., Zhejiang University, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2015c Weicong Liao 2015AbstractDiscourse parsing recently attracts increasing interest among researchers since it isvery helpful for text understanding, sentiment analysis and other NLP tasks. In awell-written text, authors often use discourse to better organize the text, and sen-tences (or clauses) tend to interact with neighboring sentences (or clauses). Eachpiece of text locally exhibits a finer discourse structure called rhetorical structure.And a document can be organized to a discourse tree (this process is called dis-course parsing), which seeks to capture the discourse structure and logically bindsthe sentences (or clauses) together.However, despite the fact that discourse parsing is very useful, although intra-sentential level discourse parsing already achieves high performance, multi-sententiallevel discourse parsing remains a big challenge in terms of both accuracy and ef-ficiency. In addition, machine learning techniques are proved to be successful inmany NLP tasks including discourse parsing. Thus, in this thesis, we try to en-hance the performance (e.g., accuracy, efficiency) of discourse parsing by usingmachine learning techniques. To this aim, we propose a novel two-step discourseparsing system, which first builds a discourse tree for a given text by applying opti-mal probabilistic parsing to probabilities inferred from learned conditional randomfields (CRFs), then uses learned log-linear models to tag all discourse relations tothe nodes in the discourse tree.We analyze different aspects of the problem (e.g., sequential v.s. non-sequentialmodel, greedy v.s. optimal parsing, joint v.s. separate model) and discuss theirtrade-offs. We also carried out extensive experiments to study the usefulness ofdifferent feature families and over-fitting. Consequently, we find out that the mosteffective feature sets for different tasks are different: part-of-speech (POS) andcontext features are the most effective for intra and multi-sentential structure pre-diction respectively, while ngram features are the most effective for both intra andmulti-sentential relation labeling. Moreover, over-fitting does occur in our ex-periments, so we need proper regularization. Final result shows that our systemachieves state-of-the-art F-scores of 86.2, 72.2 and 59.2 in structure, nuclearityand relation. And it is more efficient than Joty’s [19] (training: 40 times faster;test: 3 times faster).iiPrefaceThis work is motivated by the past research of Shafiq Joty, Giuseppe Carenini,Raymond Ng and Yashar Mehdad on discourse parsing for RST-DT corpus fromWall Street Journal. The present writer conducted all the experiments and wrotemost of the manuscript for this thesis. Raymond Ng and Giuseppe Carenini werethe supervisory authors of this project and were involved throughout the project inconcept formation and manuscript editing.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Discourse Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation and Contributions . . . . . . . . . . . . . . . . . . . 31.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Discourse Parsing and Related Work . . . . . . . . . . . . . . . . . 72.1 Discourse Analysis and Rhetorical Structure Theory . . . . . . . 72.2 Log-linear Models and Conditional Random Fields . . . . . . . . 112.3 Existing Discourse Parsers . . . . . . . . . . . . . . . . . . . . . 132.4 Model Reasoning and Choice . . . . . . . . . . . . . . . . . . . 173 Our Discourse Parsing Framework . . . . . . . . . . . . . . . . . . 233.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Log-linear Relation Labeling Model . . . . . . . . . . . . . . . . 303.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32ivTable of Contents4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . 374.1 RST-DT Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Experiments and Results Analysis . . . . . . . . . . . . . . . . . 435 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 62Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64AppendicesA Complete Discourse Relations . . . . . . . . . . . . . . . . . . . . . 70vList of Tables1.1 Highlights of three most recent research studies on discourse parsing 52.1 The set of coherence relations defined by Mann and Thompson(1988) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Comparison of three most recent discourse parsers . . . . . . . . 152.3 Intra-sentential parsing reasoning . . . . . . . . . . . . . . . . . . 182.4 Multi-sentential parsing reasoning . . . . . . . . . . . . . . . . . 192.5 Result comparison of three recent discourse parsing systems. . . . 203.1 Structural features . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Semantic features . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1 18 rhetorical relation classes in RST-DT . . . . . . . . . . . . . . 374.2 Illustration for our pair-based evaluation approach on measuringcorrectness of the human-annotated and system-generated discoursetree in Figure 4.2 (a) and (b). . . . . . . . . . . . . . . . . . . . . 434.3 10-fold CV structure prediction results for intra-sentential parsing 454.4 Level-wise F-score . . . . . . . . . . . . . . . . . . . . . . . . . 584.5 Comparison on system accuracy performance . . . . . . . . . . . 594.6 Running time performance for our system . . . . . . . . . . . . . 604.7 Comparison of running time performance on different systems . . 61A.1 Definitions of Subject Matter Relations: Part I . . . . . . . . . . . 70A.2 Definitions of Subject Matter Relations: Part II . . . . . . . . . . 71A.3 Definitions of Presentational Relations . . . . . . . . . . . . . . . 72A.4 Definitions of Multinuclear Relations . . . . . . . . . . . . . . . 73viList of Figures1.1 Discourse tree example for a sentence that has three EDUs . . . . 22.1 RST tree (discourse tree) examples . . . . . . . . . . . . . . . . . 102.2 Discourse tree example for reasoning . . . . . . . . . . . . . . . . 213.1 Pipeline of our discourse parsing framework . . . . . . . . . . . 243.2 Intra-sentential Linear Chain CRF Structure Model . . . . . . . . 263.3 Multi-sentential CRF Structure Model . . . . . . . . . . . . . . . 283.4 Log-linear Relation Labeling Model . . . . . . . . . . . . . . . . 314.1 Distribution of most frequent rhetorical relations in intra and multi-sentential levels . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Illustration for constituent-based evaluation approach taken from[19]: (a) The human-annotated discourse tree. (b) The system-generated discourse tree. (c) Measuring precision (P) and recall(R) of the discourse parser. . . . . . . . . . . . . . . . . . . . . . 404.3 Example for the first type of error in measuring relation label-ing correctness: (a) The human-annotated discourse tree. (b) Thesystem-generated discourse tree. (c) Measuring precision (P) andrecall (R) of the discourse parser. . . . . . . . . . . . . . . . . . . 414.4 Example for the second type of error in measuring relation label-ing correctness: (a) The human-annotated discourse tree. (b) Thesystem-generated discourse tree. (c) Measuring precision (P) andrecall (R) of the discourse parser. . . . . . . . . . . . . . . . . . . 424.5 Training v.s. test F-score in intra-sentential structure predictionwith default L2 (0.1): (a) Using text organization and POS features(3846 parameters in total). (b) Using text organization, POS andngram features (48846 parameters in total) . . . . . . . . . . . . 464.6 Feature exploration results on structure prediction for intra-sententialparsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.7 Feature exploration results on structure prediction for multi-sententialparsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49viiList of Figures4.8 Relation labeling test F-score on different feature sets for: (a) Intra-sentential. (b) Complete parser. . . . . . . . . . . . . . . . . . . 514.9 Training v.s. test pair-based nuclearity&relation F-score in relationlabeling with default L2 (0.1) for: (a) Intra-sentential. (b) Multi-sentential. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.10 Training v.s. test pair-based nuclearity&relation F-score in relationlabeling using text organization and selected head and tail ngramfeatures with default L2 (0.1) for: (a) Intra-sentential. (b) Multi-sentential. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.11 Feature exploration results on relation labeling for intra-sententialparsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.12 Feature exploration results on relation labeling for multi-sententialparsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.13 Test pair-based nuclearity&relation F-score on top-k relations inrelation labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . 58viiiList of AbbreviationsBOW Bag of WordsCRF conditional random fieldsCYK Cocke–Younger–Kasami algorithmDT Discourse TreeEDU Elementary Discourse Uniti.i.d. Identical Independent DistributedNER Name Entity RecognitionNLP Natural Language ProcessingPOS Part-of-SpeechRST Rhetorical Structure TheoryRST-DT RST Discourse TreebankSVM Supporting Vector MachineUGM Univariate graphical modelixAcknowledgmentsI would like to offer my sincere gratitude to my advisers, Dr. Raymond T. Ngand Dr. Giuseppe Carenini who both provided me with a perfect balance of guid-ance and freedom that allowed me to develop and pursue this thesis. During ourdiscussions they both shared useful insights that drove me towards this research.I would also like to express my gratitude to Dr. Shafiq Joty who generouslymade his code publicly available, and to both Dr. Shafiq Joty and Dr. ShimaGerani for their help in running Dr. Shafiq Joty’s code.Last but not least, I would like to thank my family for their unconditional love,encouragement and support during all my studies.xChapter 1IntroductionIn the past decade, the rapid growth of Internet has led to information explosion,which in turn, led to the rise of the term “big data”. The “big data” challenge in-cludes analysis, capture, curation, search, sharing, storage, transfer, visualization,and privacy violations for large and complex data sets. There are all kinds of data,such as images, videos, audios, texts in different areas that are publicly available toall Internet users across the world. So scientists and researchers start to think aboutquestions like:• Can we retrieve potentially valuable data and store them efficiently?• How can we discover useful pattern from the massive data set, which maycontain noisy or incorrect information?• Can we do knowledge discovery accurately and efficiently without muchmanual labor?All these questions have led to the rapid expansion of research areas or technolo-gies such as information retrieval, data mining, machine learning, computer vision,natural language processing. In this thesis, we mainly focus our study on two ofthese areas - natural language processing and machine learning. More specifically,we focus on a text data problem, namely discourse parsing, which belongs to thedomain of natural language processing, with the help of some well known machinelearning technologies.1.1 Discourse ParsingIn [19], Joty et al. proposed the following:“A well-written text is not merely a sequence of independent and isolated sen-tences (or clauses), but instead a sequence of structured and related sentences (orclauses), where the meaning of a sentence (or clause) relates to the previous andthe following ones.”In other words, each sentence (or clause) follows smoothly from the ones be-fore and leads into the ones after. For instance, consider the following 2 examplesfrom [17]:11.1. Discourse Parsing• “John took a train from Paris to Istanbul. He has family there.”• “John took a train from Paris to Istanbul. He likes spinach.”Readers find it easy to understand the first example, in which the second sentenceprovides background for the first sentence, while most of them have problems un-derstanding the second example and consider it as “incoherent”. For most of thetime, authors always ensure a consistent coherence structure to make their textinterpretable and logical. (e.g., authors prefer writing text like first one than thesecond one in the above example.) And this is the foundation of discourse analy-sis.In discourse analysis, the goal is to uncover this kind of coherence structure un-derneath the text. Some formal discourse theories have been proposed to describethis kind of coherence structure by Mann and Thompson (1988), Martin (1992),Asher and Lascarides (2003), Webber (2004) and Danlos (2009) [51, 31, 1, 50, 9].The most popular one, Rhetorical Structure Theory (RST) proposed by Mann andThompson (1998) [51] represents texts by labeled hierarchical structures, calleddiscourse trees (DTs) in which leaves correspond to elementary discourse units1,internal nodes correspond to contiguous text spans (discourse spans), and connec-tions between nodes correspond to discourse relations2 between these text spans.For illustration, Figure 1.1 shows the discourse tree example for the sentence “TheFigure 1.1: Discourse tree example for a sentence that has three EDUs1EDUs, the basic unit we consider in discourse analysis and discourse parsing. As we will see inmore details in section 2.1, these typically correspond to clauses.2We use terms “discourse relation”, “rhetorical relation” and “coherence relation” interchange-ably in this thesis.21.2. Motivation and Contributionsbank also says, it will use its network to channel investigations” from [40], whichhas three EDUs. More complex examples of discourse tree will be presented andexplained in section 2.1. And the process of generating DT is called discourseparsing.One might ask why discourse parsing is useful. Imagine there are thousands ofreviews for a camera online and we want to do some textual analysis like sentimentanalysis on how the users like or dislike the camera. If we treat each review as abag of words (BOW), we may lose most of the structural and order informationof the review. For example, if we use BOW model to analyze this review: “It hasgreat appearance but I don’t like it since it has bad lens”, we may get a list ofwords like “great”, “don’t like” and “bad”. From these words, we just know thatthe user has mixed opinions (both good and bad) about the camera, but we cannot determine his opinion in smaller granularity. However, with discourse parsing,we can identify the coherence structure for this review: first part of the sentence -“It has great appearance”, is contradictory to the major part of the sentence - “Idon’t like it”; and the last part of the sentence - “it has bad lens”, is the reason forthe major part of the sentence - “I don’t like it”. After all, we are clearer aboutthe user’s opinion: he does not like the camera. As a matter of fact, the coherencestructure we identify by discourse parsing has been shown to be beneficial for manyNatural Language Processing (NLP) applications including:• Text summarization and compression, Louis, Joshi, and Nenkova (2010),Marcu (2000), Daumé and Marcu (2002), Sporleder and Lapata (2005) [26,28, 29, 41]• Text generation, Prasad et al. (2005) [37]• Machine translation evaluation, Guzman et al. (2014) [14]• Sentiment analysis, Somasundaran (2010), Lazaridou, Titov, and Sporleder(2013), Shima et al. (2014) [39, 25, 13]• Information extraction, Teufel and Moens (2002), Maslennikov and Chua(2007) [46, 32]• Question answering, Verberne et al. (2007) [48].1.2 Motivation and ContributionsAs discussed above, we can find many applications related to discourse parsing.However, manually building discourse trees for given texts is not practical since it31.2. Motivation and Contributionstakes too much human effort. Thus, many researchers started to build automaticsystem to do discourse parsing. Conventionally, there are two major sub-tasksrelated to discourse parsing: (1) discourse segmentation, which takes raw texts asinput and segments them into EDUs, and (2) tree building, which takes EDUs asinput and build a discourse tree for them, representing the discourse structure andrelations of the text. The first sub-task, discourse segmentation, is already closeto the performance of human-annotation with state-of-the-art accuracy at above90% [20]. However, the second sub-task, tree building is still far from perfect interms of both accuracy and efficiency, especially at multi-sentential level, whichdeals with many sentences compared to intra-sentential level, which deals withonly one sentence. Thus, researchers focus on the second sub-task recently; andwe also focus on the second sub-task and use the terms “discourse parsing” and“tree building” interchangeably in this thesis.Some of the most recent research studies have shown substantial improvementson both intra and multi-sentential discourse parsing. For example, Joty et al. [21]proposed a framework that uses CRF based joint models and optimal parsing strat-egy. This framework is the first that tries to apply the most popular and successful(in NLP) CRF machine learning algorithm proposed by Lafferty et al. [24] and itsvariation proposed by Sutton et al. [45] on discourse parsing, and achieves rela-tively high accuracy in both intra and multi sentential parsing. However, the majorbottleneck of this framework is its inefficiency due to the fact that it uses complexjoint model for both structure and relation, which makes inference slow. Further-more, it uses optimal parsing, which is O(n3), where n is the number of sentencesin the document. In this thesis, we argue that it is not necessary to model struc-ture and relation jointly because the joint model actually does not provide extrapredictive power. And the reason for this will be discussed in section 2.4.More recently, similar work by Feng et al. [11] achieved linear time complexityin discourse parsing by adopting the idea of bottom-up style greedy parsing andseparation of the structure prediction from relation labeling task. Their carefuldesign involving constraints and post-editing, as well as the usage of linear chainCRF, achieves good accuracy. However, linear chain CRF is not only slow in termsof training and inference, but also not necessary for relation labeling. The reasonfor this will be discussed in section 2.4. Moreover, their limitation of the lengthof their linear chain CRF in multi-sentential level parsing for scalability purposeseems ad hoc.In another recent work by Ji and Eisenstein [18], they formalized the tree build-ing process as a sequence of decision problems with the help of transition-basedshift-reduce parser. During training, their parser jointly learned a linear transforma-tion from the BOW lexical features (unigrams) to lower dimensional latent spacerepresentation and a large-margin decision classifier (SVM) to make sequential41.2. Motivation and Contributionsgreedy shift-reduce decisions. Their usage of unigrams shows good results forrelational labeling; however their greedy parsing approach, which only captureslocal information, fails to achieve good accuracy on structure prediction, whichrely more on global context. Joty et al. (2013) Feng et al. (2013) Ji and Eisenstein (2014) We pursue Discourse Segmentation 9 8 8 8 Discourse Parsing 9 9 9 9 Use all unigrams as features 8 8 9 9 Use separate structure and relation model 8 9 9 9 Use sequential model (linear chain CRF) 9 9 8 9 Use optimal parsing 9 8 8 9 High efficiency 8 9 N/A 9 Feature exploration 9(incrementally) 8 8 9(fully) Over-fitting exploration 8 8 8 9    Table 1.1: Highlights of three most recent research studies on discourse parsingTable 1.1 highlights the aspects studied in these most recent research studies ondiscourse parsing and presents what we are pursuing in this thesis (the right mostcolumn). In this thesis, we propose a novel discourse parsing framework and try toovercome the weaknesses of the previous studies mentioned above and shown inthe table. It involves two major components: (1) Structure prediction component,which builds the discourse tree structure without tagging discourse relations to itby applying optimal parsing algorithm on probabilities inferred from learned CRFmodels. (2) Relation labeling component, which tags all the discourse relations tothe tree built in the first component with learned log-linear models. More detailswill be discussed in chapter 3.We have four main contributions. Firstly, we perform deep analysis and com-parisons with previous work from various angles so that we know what options(e.g., sequential v.s. non-sequential model, greedy v.s. optimal parsing, joint modelv.s. separate model, which families of features to use and which regularizer to useto prevent over-fitting) researchers have when designing a discourse parsing sys-tem.Secondly, we propose a novel discourse parsing framework that achieves state-of-the-art performance in terms of both accuracy and efficiency. More specifically,our system achieves overall F-scores of 86.2, 72.2 and 59.2 in terms of structure(span), nuclearity and relation correctnesses. And our system is about 40 timesfaster in terms of training and 3 times faster in terms of testing than Joty’s system(We do not have access to other systems for the purpose of efficiency comparison).51.3. OutlineMoreover, our system has a very simple architecture and it addresses many limita-tions of previous studies as following: (1) It decomposes the problem into smallerparts (structure prediction and relation labeling), making our model much sim-pler and more efficient than Joty’s. (2) It does not require hand engineering rulesand unnecessary constraints, compared to Feng’s system, but still achieves betteraccuracy for both structure prediction and relation labeling while keeping high ef-ficiency. (3) It finds global optimal DT, which has better structure correctness thanlocal optimal DT found by Ji’s system. In addition, our system adopts good obser-vations and ideas from the previous studies and extend them as following: (1) Itmakes uses of unigram and bigrams of the whole unit. (2) It uses sequential linearchain CRF model when appropriate.Thirdly, we point out issues about the standard constituent-based evaluationmetric for measuring discourse parsing correctness and propose our pair-basedevaluation metric to solve the issues.Lastly, we carry out extensive experiments to explore different aspects of theproblem, which have not been examined before. For instance, we study the useful-ness of different families of features and find out that: for structure prediction, themost effective features are part-of-speech (POS) features and contextual featuresfor intra and multi-sentential levels respectively, while for relation labeling, themost effective features are ngram features for both intra and multi-sentential lev-els. Moveover, we study the over-fitting phenomenon and find out that over-fittingdoes occur in both intra and multi-sentential parsing. Thus, we use proper regular-ization in our models, resulting in better accuracy and smaller models. In general,we aim at finding out to what extend machine learning techniques can advance thischallenging NLP problem.1.3 OutlineIn chapter 2, we give more background knowledge in discourse parsing, as wellas some background on machine learning techniques, which we adopt in our re-search. We also give detailed analysis and summaries of the related research stud-ies. In chapter 3, we give descriptions of all necessary details for the components inour framework, including an intra-sentential linear chain CRF model and a multi-sentential CRF model for structure prediction, as well as a log-linear model forrelation labeling. In chapter 4, our experimental results for both structure pre-diction and relation labeling, followed by a deep analysis on features, over-fittingand etc, are presented. Finally, conclusions and future directions are discussed inchapter 5.6Chapter 2Discourse Parsing and RelatedWorkIn this chapter, we are going to discuss some background knowledge and relatedwork in discourse parsing. Firstly, we give some background knowledge in dis-course parsing in section 2.1, as well as some background on log-linear and CRFmodels in section 2.2. Then, we give a detailed description and a summary of re-lated research in section 2.3, followed by a deep reasoning on our model choice insection 2.4.2.1 Discourse Analysis and Rhetorical Structure TheoryIn section 1.1, we give a quick introduction on discourse parsing in order to providesome background on the discussion of our motivation. However, there are somemore important details about discourse analysis and Rhetorical Structure Theoryfrom Mann and Thompson [51], which we discuss in this section.As we can see in section 1.1, the goal of discourse analysis is to uncover co-herence structure embedded the text and relate different sentence (or clauses) withcoherence relations. Then what is coherence relation? We follow the definition ofcoherence relation by Stede in [42]:Definition 1. “Coherence relation: A specific relationship, holding on the seman-tic or the pragmatic level of description, between adjacent units of text. Definitionscan be given in semantic terms, or in terms of speaker intentions (as in Rhetori-cal Structure Theory, RST). The granularity of a postulated relation set can differwidely, but relatively common are the groups causality, similarity/contrast, andcontiguity (temporal or other). In the case of RST, most relations are said to holdbetween a unit that is more important for the purposes of the speaker (nucleus) andone unit that is less important, or supportive in nature (satellite).” [42]Following this definition, there are 2 issues that we need to address:1. What is a basic unit?72.1. Discourse Analysis and Rhetorical Structure TheoryIn discourse analysis, the smallest unit that we consider is called elementary dis-course unit (EDU) which serves as building blocks.“It is a span of text, usually a clause, but in general ranging from minimally anoun phrase (NP) to maximally a sentence. It should by themselves be structurallycomplete and denotes a single event or type of event, serving as a complete, distinctunit of information that the subsequent discourse may connect to. Also an EDUmay be structurally embedded in another.” [42]There are many different EDU definitions. In this thesis we use a popular RST-DT data set from [3] which has already segmented EDUs for us.2. What are the coherence relations?The set of coherence relations is not well defined, but the one defined by Mannand Thompson (1988) [51] has proven effective for many purposes. Table 2.1shows a short version of relations set they defined. The relations are partitionedinto Subject-Matter relations (also called Semantic relations) and Presentationalrelations (also called Pragmatic relations) based on their functionality. Subject-Matter relations express parts of the subject matter of the text and Presentationalrelations facilitate the presentation process. The complete relations set and theirdefinitions are shown in Appendix A.Table 2.1: The set of coherence relations defined by Mann and Thompson (1988)Given all the above details and building blocks about coherence, let us nowmove on to RST, a complete theory system proposed by Mann and Thompson82.1. Discourse Analysis and Rhetorical Structure Theory(1988) [51], that organizes a text as a RST tree which uncover its coherence struc-ture.In [42], Stede summurized the RST theory system as follows:“In RST, Mann and Thompson propose that coherence are recursively embed-ded in one another, and they cover the text completely; no portion may be leftbehind, because then the text would be not coherent. Next, RST posits that rela-tions always hold between adjacent text segments (EDUs or larger spans). Theintuition is that there are no ‘non-sequiturs’ in the text: There must not be a seg-ment that is not meaningfully connected to its left or to its right neighbor. Finally,they argue that a text segment should always fulfill a single function or ‘play asingle role’ within its context. This amounts to the constraint that a node in thecoherence-relational structure have exactly one parent.” [42]Following these theories, Mann and Thompson finally formalized discourseanalysis as RST tree building. They defined the RST tree as follows:Definition 2. “RST tree: A text is segmented into a sequence of EDUs. Coherencerelations hold between adjacent EDUs, thus forming a node in the structure, andrecursively between adjacent larger segments. Every node has exactly one parent(except for the root node), and all EDUs of the text take part in the tree structure,which does not have any crossing branches. Most relations assign different status(nucleus, satellite) to the segments. Almost all relations span over exactly twosegments (exceptions are JOINT, LIST, SEQUENCE).” [42]For illustration, consider the RST tree (we call it DT afterward) shown in Fig-ure 2.1 for the following text taken from the RST Discourse Treebank (RST-DT)corpus by Carlson, Marcu, and Okurowski (2002) [3]:“But he added: ‘Some people use the purchasers’ index as a leading indicator,some use it as a coincident indicator. But the thing it’s supposed to measure –manufacturing strength – it missed altogether last month.’”.The six leaves of this DT correspond to six EDUs. Adjacent EDUs are con-nected by discourse (or coherence) relations (e.g., ELABORATION, CONTRAST),forming larger discourse units (internal nodes), which in turn are also subject to thisrelation linking. “Discourse units linked by a rhetorical relation are further dis-tinguished based on their relative importance in the text: nuclei are the core partsof the relation, while satellites are the supportive parts of the relation. [42]” Here,ELABORATION is the relation between a nucleus (EDU 4) and a satellite (EDU5), CONTRAST (on the left) is a relation between two nuclei (EDUs 2 and 3), andSAME-UNIT is a relation between two nuclei (span of EDUs 4,5 and EDU 6). Aswe can see, all 6 EDUs are recursively connected together until a whole discoursetree, whose top node is an ATTRIBUTION relation between a satellite (EDU 1)and a nucleus (span of EDUs 2,3,4,5,6), is formed.92.1.DiscourseAnalysisandRhetoricalStructureTheoryFigure 2.1: RST tree (discourse tree) examples102.2. Log-linear Models and Conditional Random Fields2.2 Log-linear Models and Conditional Random FieldsMachine learning becomes a hot field since decades ago and it has been success-fully applied to various areas like computer vision, natural language processing,search engines, bio-informatics, stock market analysis, recommender systems andcomputational advertising. In this thesis, some of the very successful machinelearning techniques like log-linear models and CRFs are applied to our discourseparsing problem. Thus, we give brief introductions on them in this section.Log-linear models, one of the most famous machine learning models, are verywidely used in natural language processing and have been proved to be very suc-cessful in NLP tasks like part-of-speech (POS) tagging, name-phrase (NP) chunk-ing, name entity recognition (NER), parsing, sentiment analysis, speech recogni-tion and relation extraction (RE). A key advantage of log-linear models is theirflexibility of allowing richer representation of features than the simple logistic re-gression, as can be seen below.Assume we have a set of possible inputs X , which denotes random context, anda set of possible labelsY . The goal of a log-linear model is to model the conditionalprobability p(y|x) for any pair (x,y) such that x 2 X and y 2Y . For example, in thelanguage modeling task we have a finite set of possible words in the language; callthis setV . The setY is simply equal toV . The set X is the set of possible sequencesw1...wi1 such that i 1, and wj 2V for j 2 1...(i1). Log-linear models are thendefined as follows [6] by Michael:Definition 3. A log-linear model consists of the following components:• A set X of possible inputs. It could be finite or infinite.• A set Y of possible labels. It is assumed to be finite.• A positive integer d specifying the number of features and parameters in themodel.• A function f :X⇥Y !Rd that maps any (x,y) pair to a feature-vector f (x,y).• A parameter vector q2Rd .For any x 2 X ,y 2 Y , the model defines a conditional probability:p(y|x;q) = exp(q · f (x,y))Ây02Y exp(q · f (x,y0))(2.1)Here p(y|x;q) is the probability of y conditioned on x, parametrized on q ; exp(x) =ex; q · f (x,y)=Âdk=1 qk fk(x,y) is the inner product between q and f (x,y); f (x,y)2Rd112.2. Log-linear Models and Conditional Random Fieldsis a feature vector for pair (x,y) and each component fk(x,y) for k = 1...d in thisvector is a feature. The features capture characteristics of the input x, in conjunc-tion with the label y. For example, in the language modeling task, where the inputx is a sequence of words w1...wi1, and the label y is a word, a feature can be anindicator function which returns 1, if y ="model" and wi1 is an adjective, or 0otherwise. [6]Each feature has an associated parameter, qk, whose value is estimated usingtraining examples and is the key of the model. The learning process is normallya gradient decent optimization process which optimizes the sum of log-likelihoodobjective function:L(q) =nÂi=1logp(yi|xi;q) l2Âk q2k (2.2)CRF, proposed by Lafferty et al. [24] is a special form of log-linear mod-els, which tries to capture sequential dependencies that normal log-linear mod-els do not capture. In CRFs (or linear chain CRFs, in this thesis we only talkabout linear-chain style CRFs), the key difference is that we are trying to modelp(y1...yT |x1...xT ) = p(x|y) instead of p(y|x) because we are handling sequence(x1...xT ,y1...yT ) = (y,x) instead of pair (x,y). The feature vector is then definedto map an entire input sequence x1...xT paired with an entire state sequence y1...yTto some d-dimensional feature vector. Formally, the linear chain CRF is defined asfollows in [44] by Sutton:Definition 4. Let Y,X be random vectors, q = {qk} 2 RK be a parameter vector,and { fk(y,y0,xt)}Kk=1 be a set of real-valued feature functions. Then a linear-chainconditional random field is a distribution p(y|x) that takes the form:p(y|x) =1Z(x)T’t=1exp{KÂk=1qk ⇤ fk(yt ,yt1,xt)} (2.3)Here Z(x) is an instance-specific normalization function (a.k.a. partition function)which ensures distribution p sums to 1:Z(x) =ÂyT’t=1exp{KÂk=1qk ⇤ fk(yt ,yt1,xt) (2.4)As we can see here, another key difference between linear chain CRFs andlog-linear models is that the feature function fk(yt ,yt1,xt) in linear chain CRFscaptures sequential information by including state transition pair yt ,yt1 as wellas observation xt . Moreover, the multiplication over t = 1...T allows linear chain122.3. Existing Discourse ParsersCRFs to aggregate information for the whole sequence (x1...xT ,y1...yT ). The learn-ing process is the same as log-linear models’ which is a gradient decent process thatoptimizes the sum of log-likelihood objective function. The predicting process isthe famous Viterbi algorithm [49], which finds the most likely sequence of hiddenstates y? =argmaxyp(y|x) by dynamic programming.However, it is important to mention the time and space issue of linear chainCRFs, which is usually a big deal in practice:• Time issue: Inference in CRFs which is necessary during both trainingand predicting is done by the forward-backward algorithm (for computingmarginal distributions p(yt ,yt1|x)) and the Viterbi algorithm (for comput-ing most likely sequence of hidden states y? =argmaxyp(y|x)). Both algo-rithms have time complexity of O(TY 2), which can be quite significant whengiven long sequence (T is big) or many possible labels (Y is big).• Space issue: The space of possible values for y =y1...yT , i.e., YT , where Tis the length of the sequence is sometimes very big. Notice that the summa-tion in 2.4 is over this YT many possible assignments to y. For this reason,computing Z(x) is intractable in general, but much work exists on how toapproximate it.2.3 Existing Discourse Parsers2.3.1 Earlier StudiesDiscourse parsing research can date back to 1990s when Marcu [27] proposed us-ing machine learning techniques to build a shift-reduce discourse parser. While themajor discourse parsers are based on supervised learning, discourse parsers basedon unsupervised or semi-supervised learning are also proposed, as they do not re-quire (or require less) human labeled data and have some unique characteristics.For example, Hernault et al. (2010) [15] proposed a semi-supervised method tocompute the co-occurrence between different features, which is available from un-labeled data and then extend the feature vector using this co-occurrence to improveinfrequent relations labeling. However, we are not using any unsupervised or semi-supervised learning based approach in this thesis, and limit our discussion on onlysupervised learning based approaches.In 2003, Soricut and Marcu [40] considerably improved the earlier work ofMarcu in 1999 by introducing SPADE: a Sentence-level Discourse parsing usingSyntactic and Lexical Information. The main idea is to learn a generative proba-bilistic model to infer probability to every constituent at different level and then132.3. Existing Discourse Parsersuse bottom-up optimal parser to construct the optimal tree. In their model, theyused a feature derived from DS-LST (discourse segmented lexicalized syntactictree), which carries syntactic and lexical information and they claimed this fea-ture carries strong predictive power for discourse parsing. Kenji [38] used a shift-reduce parser with lexical and syntactic features for both sentence level and docu-ment level to allow linear time parsing. Subba [43] focused on discourse parsingon instruction corpus and used Inductive Logic Programming (ILP) to learn fromfirst-order logic representation to determine discourse relations, and then used ashift-reduce parser to greedily build a discourse tree with not only linguistic cuesand lexical information but also compositional semantics (when available) and seg-ment discourse structure data. Hernault [16] introduced a famous discourse parsingsystem HILDA, which separately models structure prediction and relation labelingwith two SVMs, and then iteratively alternate between choosing the best consec-utive two units with the highest score provided by SVM model (S) and mergingthem with the relation inferred by SVM model (R) until all units are merged. Thisis a greedy process with linear time complexity and their SVM models use richfeature set containing textual, lexical, syntactic and structural features. Later on,Feng et al. [10] improved HILDA by incorporating rich linguistic features fromother resources and performing feature selection to avoid high dimensionality andsparsity. All these studies have advanced discourse parsing to a new chapter. How-ever, all the studies did not try any advanced machine learning models like linearchain CRFs to capture sequential dependency in the data, and their accuracy andefficiency are still far from perfect. Thus, in our thesis, we are going to study someadvanced machine learning models like CRFs and log-linear models.2.3.2 Recent WorksMore recently, three pieces of research from Joty et al. (2012,2013) [20, 21], Fenget al. (2013) [11] and Ji and Eisenstein (2014) [18] have advanced both accuracyand efficiency of discourse parser to a higher level with the help of advanced ma-chine learning techniques, like CRFs and representation learning. In order to makebetter comparison, we summarize the major differences of the three systems interms of features, models, learning process, parsing process and time/space com-plexity in different rows in Table 2.2.As we can see in the first column, Joty et al. (2013) proposed a discriminativeframework which tried to build a discourse tree by applying an optimal parsingalgorithm to the probabilities of all the constituents inferred from two CRFs thatmodel structure and relation jointly:• a linear chain dynamic-CRF (DCRF) for intra-sentential parsing142.3. Existing Discourse Parsers• a uni-variate graphical model (UGM) which does not model sequential de-pendency for multi-sentential parsingJoty et al. (2013) Feng et al. (2013) Ji and Eisenstein(2014)Used features Textual organizationNgrams; Dominanceset; Lexical chains;Context; Sub-structureFeatures from Joty etal.; Syntactic features;Entity transitions; Cuephrase; Post-editingfeatures (depth)Unigrams; Selectedfeatures from Joty etal.; Head word fromeach EDUFeaturesencodingNo encoding, but usedas string (binary)during trainingBinary encoded Learned linearprojection to lowerdimension space.(Only unigrams areprojected. Then it isconcatenated withother features.)Intra andmulti-sententialmodelsSeparate Seperate TogetherParsing model Linear chain DCRF forintra-sentential parsing;UGM formulti-sentential parsing4 linear-chain CRFs forintra/multi-sententialstructure/relationpredictionLarge-margin (SVM)decision classifierLearningalgorithmL-BFGS L-BFGS (notmentionede in theirpaper)Alternating betweenstandard SVMoptimization andstochastic gradientdescentParsingprocessOptimal CKY parsing Greedy bottom-upparsingGreedy shift-reduceparsingIntra andmulti-sententialcombination1S-1S(one sentenceone sub-tree) or slidingwindow1S-1S N/ATimecomplexityO(n3) O(n) N/AMemorycomplexityO(n3 ⇤d) * O(n⇤d) N/A* n is the number of sentences in a document, d is the dimensions of feature spaceTable 2.2: Comparison of three most recent discourse parsers152.3. Existing Discourse Parsersand then combining parsing results to get the final discourse tree from the abovetwo models for intra and multi-sentential parsing with two different approaches:• 1S-1S (one sentence one sub-tree) which treat a sentence as a basic unit inmulti-sentential parsing• sliding window which is designed to treat cases where discourse structureviolates sentence boundaries.He used lots of features including textual organization, ngrams, dominance set,lexical chains, contextual and sub-structure features to enhance the quality of theclassifier. The main contribution of this work is its application of CRFs on dis-course parsing problem, showing rather high accuracy in both intra and multi sen-tential parsing in terms of both structure and relation, as can be seen in Table 2.5.However, the major bottleneck of this system, the inefficiency in terms of bothspeed and space as shown in both Table 2.2 and 2.5, makes it cumbersome in largeapplications.In the second column, Feng et al. (2013) proposed a linear time discourseparser based on Joty’s system with some modifications in the following aspects:(1) additional features which are not used in Joty’s system; (2) binary encodedfeatures; (3) usage of the sequential model (linear chain CRF) in both intra andmulti-sentential parsing; (4) separated modeling of structure and relation; (5) con-straints on 1-1 transitions and all-0s sequences in structure CRF’s decoding; (6)greedy bottom-up parsing procedure that allows linear time parsing; (7) novel ideaof post-editing which does a second pass parsing to incorporate information fromupper-level discourse constituents. The main principle of the system is similar toJoty’s, but because of the modifications above, it achieves better accuracy in termsof both structure and relation as can be seen in Table 2.5 and it is more efficient interms of both speed and space as shown in both Table 2.2 and 2.5.In the third column, Ji and Eisenstein (2014) proposed a very different ap-proach, namely representation learning approach for discourse parsing which for-malizes discourse tree building process as a sequence of decision problems by us-ing a transition-based shift-reduce parser. It jointly learns a linear transformationfrom the BOW lexical features (unigrams) to lower dimensional latent space rep-resentation and a large-margin (SVM) decision classifier in this space to make se-quential/greedy shift-reduce decisions. The learning procedure is performed by al-ternating between (1) fixing the projection matrix and performing a standard SVMoptimization and (2) fixing SVM parameters and performing gradient updates onthe projection matrix. They used features from Joty’s system to further enhance itsaccuracy. However, this paper seems not mentioning its running time complexity;and its major contribution is more on its application of representation learning on162.4. Model Reasoning and Choicediscourse parsing, showing the best accuracy by far on relation labeling, as can beseen in Table 2.5. It also shows the usefulness of unigrams in relational labeling,which we may adopt in our system.However, as pointed out before in section 1.2, all the studies described in thissection have their own disadvantages (which will be discussed in more detail inthe next section), and their accuracy and efficiency are still not perfect. More-over, none of them have fully reported exploration of over-fitting issue and featureusefulness (Joty et al. did an incremental but not full exploration of feature useful-ness). Thus, in our thesis, we are going to address their disadvantages and reportour exploration of over-fitting issues and feature usefulness in order to provide abetter understanding on the discourse parsing problem.2.4 Model Reasoning and ChoiceAs we see in the previous section, the three most recent discourse parsing systemsare different in many ways. However, in this section we are going to further inves-tigate their systems in terms of model characteristics, efficiency as well as accuracyto motivate our model reasoning and choice.First of all, we are going to present two tables (Table 2.3 and 2.4) for intra andmulti-sentential parsing reasoning (assuming trainings are done already) and onetable (Table 2.5) for results comparison. The reason why we separate the reasoningof intra-sentential parsing and multi-sentential parsing is that multi-sentential pars-ing is principally more difficult, and models that can be applied to intra-sententialparsing sometimes can not be applied to multi-sentential parsing due to scalabilityissue, which is the case in Joty el al. and Feng et al.s’ systems. As a result, they de-compose their pipeline into two separate components, intra-sentential componentand multi-sentential component to build different levels of the discourse tree, andthen combine them together with some strategies. However, the principle and ideaof intra-sentential parsing and multi-sentential parsing are very similar.As we can see in both Table 2.3 and 2.4, the discourse parsing models aremainly different in these three factors (first three columns) for both intra and multi-sentential parsing: (1) sequential v.s. non-sequential; (2) modeling structure andrelation jointly v.s. separately; (3) optimal v.s. greedy parsing. These are the majorfactors besides features, that affect their systems’ accuracy and efficiency. We alsoput the three systems as well as HILDA and our system in the last column in orderto see which direction researchers went in discourse parsing research in terms ofthese three factors.172.4.ModelReasoningandChoice    Model Relation and Structure model Parsing Strategy Complexity Feasibility & Future System Time Space Sequential Joint Optimal Feature preparation: O(𝑛4) O(𝑛4) Maybe not since joint model is too complex and inappropriate.  Joty et al. Inference: O(𝑛3) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛4) Greedy Feature preparation: O(𝑛) O(𝑛)  Inference: O(𝑛) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛) Separate Optimal (for structure) Feature preparation: O(𝑛4) O(𝑛4) YES. Good area to explore Our system Inference: O(𝑛3) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑆| Parsing: O(𝑛4) Greedy (for structure) Feature preparation: O(𝑛) O(𝑛) YES.  Feng et al. Inference: O(𝑛) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑆| Parsing: O(𝑛) Non-Sequential Joint Optimal Feature preparation: O(𝑛3) O(𝑛3) YES  Inference: O(𝑛3) * O(𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛3) Greedy O(𝑛) for all steps O(𝑛) YES Ji and Eisenstein Separate Optimal (for structure) O(𝑛3) for all steps O(𝑛3) YES  Greedy (for structure) O(𝑛) for all steps O(𝑛) YES HILDA * |R| = 41, is the number of discourse relations; |S| = 2, connected or not connected; n is the number of EDUs per sentence, rangingfrom 2 to 10.Table 2.3: Intra-sentential parsing reasoning182.4.ModelReasoningandChoice  Model Relation and Structure model Parsing Strategy Complexity Feasibility & Future System Time Space Sequential Joint Optimal Feature preparation: O(𝑛4) O(𝑛4) No. Too high complexity  Inference: O(𝑛3) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛4) Greedy Feature preparation: O(𝑛) O(𝑛) Probably not since joint model is too complex and inappropriate  Inference: O(𝑛) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛) Separate Optimal  (for structure) Feature preparation: O(𝑛4) O(𝑛4) Maybe only when n is small  Inference: O(𝑛3) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑆| Parsing: O(𝑛4) Greedy (for structure) Feature preparation: O(𝑛) O(𝑛) YES. Good area to explore. Feng et al. Inference: O(𝑛) * O(𝑛 ∗ 𝑀2); 𝑀 = |𝑆| Parsing: O(𝑛) Non-Sequential Joint Optimal Feature preparation: O(𝑛3) O(𝑛3) YES Joty et al. Inference: O(𝑛3) * O(𝑀2); 𝑀 = |𝑅 × 𝑆| Parsing: O(𝑛3) Greedy O(𝑛) for all steps O(𝑛) YES Ji and Eisenstein Separate Optimal (for structure) O(𝑛3) for all steps O(𝑛3) YES. Very simple. Our system Greedy  (for structure) O(𝑛) for all steps O(𝑛) YES HILDA * |R| = 41, is the number of discourse relations; |S| = 2, connected or not connected; n is the number of sentences per document,ranging from 3 to 190.Table 2.4: Multi-sentential parsing reasoning192.4. Model Reasoning and ChoiceSystem MethodAccuracyPerformance(F-score) *Runningtime(longestdocumentin RST-DT**)SpanNuclea-rityRelationFeaturegenera-tionParsingJoty et al. 1 Sentence 1 Sub-tree 82.56 68.32 55.83 6400s 1900sSliding window 83.84 68.9 55.87 N/A N/AFeng et al.Greedy linear-chainCRF84.9 69.9 57.2 N/A 41sGreedy linear-chainCRF with post editing85.7 71 58.2 N/A 85sJi andEisensteinNo projection (A = I) 79.85 69.01 60.21 N/A N/AConcatenation form;learned linearprojection82.08 71.13 61.63 N/A N/AGeneral form; learnedlinear projection81.6 70.95 61.75 N/A N/AHuman 88.7 77.72 65.75* Will be explained in chapter 4** 180 sentencesTable 2.5: Result comparison of three recent discourse parsing systems.For the first factor, whether using sequential or non-sequential model, we clearlysee in the Table 2.3, 2.4 and 2.5 that systems that use sequential models in both in-tra and multi-sentential parsing (except Joty’s system in multi-sentential parsingdue to scalability issue) achieve better result on structure prediction (Feng et al.and Joty et al.). The intuition behind this is that for parsing problem, the decisionin each position whether two adjacent units should be connected is always corre-lated which means that i.i.d. (identical independent distributed) models probablycan not model the data well. To see why the decision in each position is correlated,consider a very simple scenario in Figure 2.2. Apparently, since EDU2 and EDU3are connected, they each, as a single unit, can never be connected to other EDUs(e.g., EDU2 can not be connected to EDU1, EDU3 can not be connected to EDUsafter, i.e., EDU4,EDU5,EDU6) since every EDU as a single unit can only be con-202.4. Model Reasoning and Choicenected to one other EDU or span. Thus, making the decision to connect EDU2 andEDU3 is equivalent to making the decisions not to connect EDU1 and EDU2, andnot to connect EDU3 and EDUs after. In this sense, decisions are correlated, andto some extent determined by each other. On the other hand, text are sequential.Therefore, sequential models seem to be the best for modeling this correlation andwe are going to use sequential model for structure prediction in our framework(only for intra-sentential level due to scalability issue).Figure 2.2: Discourse tree example for reasoningHowever, the conclusion changes when it comes to relational labeling. Weclearly see that the best relation labeling result is achieved by Ji and Eisenstein,who use an i.i.d. model (SVM) which make decisions (of both structure and rela-tion) independently one after another greedily. Although there is no clear statisticalanalysis or theory showing that there is no correlation between nearby connectedunits’ discourse labels distributions, it is nearly impossible to model the correla-tion correctly even if there is, if we are only using pair-wised sequential mod-els (e.g., HMM, linear-chain CRFs). To see why, consider the same example inFigure 2.2: since EDU2 and EDU3 is connected by relation R2,3 = CONTRAST ,EDU1 and EDU2 will never have a chance to be connected and end up in relationR1,2 = NONE. More generally, Ri,i+1 = SOME RELATION implies Ri1,i =NONE and Ri+1,i+2 = NONE, which is 100% determined and does not need tobe learned. However, long range or hierarchical dependency may exists. For ex-ample, R2,3 =CONTRAST and R45,6 = SAMEUNIT can be correlated , R4,5 =ELABORATION and R45,6 = SAMEUNIT can also be correlated. If we reallywant to capture this kind of correlation if there is, we need a skip-chain CRF modelor models that consider long range or hierarchical dependency, which is not practi-cal or too expensive in terms of efficiency for most scenarios when handling texts.Thus, we argue in 1.2 that it is not appropriate to use linear chain CRF to modelcorrelation of relations, as linear chain CRF is not capturing the correlation even ifthere is, and it has high complexity, as shown in column 6 of both reasoning tables.212.4. Model Reasoning and ChoiceTherefore, we are going to use the i.i.d. log-linear models for relation labeling inour system.For the second factor, whether modeling structure and relation jointly or sep-arately, we argue that joint model is inappropriate because of the following tworeasons: (1) Joint model increases the complexity of the model by at least onemagnitude, thus making the model much bigger in term of numbers of parametersand much slower for both training and inference. (2) The purpose of modeling S(structure) and R (relation) jointly is to model the dependency between S and R.But actually the dependency is totally fixed when S= 0 (S= 0 implies R=NONE)which does not need to be learned. That is, there will not be any data point withS = 1 and R = NONE or S = 0 and R = SOMERELATION, and of course theirassociated model parameters will be 0 and never get updated during training, if nospecial technique is used. In this sense, there is no need and also inappropriate tomodel S and R jointly, at the expense of increasing model complexity and decreas-ing model efficiency. Moreover, decomposing the two tasks allows us to chooseappropriate features and models for each task and we can study each task deeperas we will do in this thesis.Lastly, for the third factor, whether using optimal or greedy parsing, there is noobvious result as we can see in Table 2.5 showing which is dominating which. Allwe know is that principally greedy parsing sometimes find local optimal instead ofglobal optimal solution. Moreover, here we are using an inferred probability fromparsing model, which is an estimated value (not ground truth), as building block forour parsing algorithms. Thus, we can not really make a clue that optimal parsingalways gives much higher accuracy, which is proved by the results shown in Table2.5. However, when we can maintain acceptable efficiency, optimal parsing isalways the first choice because optimal parsing will always find no worse solutionthan greedy parsing.In this chapter, some background knowledge in discourse parsing as well aslog-linear and CRFmodels are presented, in order to provide necessary backgroundon the problem we are studying and the machine learning techniques we will beusing. Then, a detailed summary of related research as well as a deep reasoning onthe model choice are presented, in order to provide sufficient theoretical supportfor the design of our novel discourse parsing framework, which will be discussedin the next chapter.22Chapter 3Our Discourse ParsingFrameworkAfter giving necessary background and discussion about discourse parsing in chap-ter 1 and 2, we now start describing our novel discourse parsing framework in thischapter. Firstly, we give a brief overview of the whole system and how each com-ponent is connected in section 3.1. Then, we give detailed descriptions for allcomponents in sections 3.2 and 3.3. Finally, we present all features we use andsome techniques for processing these features in section 3.4.3.1 System OverviewIn our discourse parsing framework, we separate intra-sentential parsing and multi-sentential parsing due to the scalibility issue mentioned in section 2.4 and moreimportantly, to the fact that from the linguistic point of view, authors tend to usedifferent rhetorical structure and relations to organize their text in the two differentlevels (intra-sentential v.s. multi-sentential). For instance, authors tend to makesummarization in multi-sentential level rather than intra-sentential level and wewill see that a summary text is always a sentence (or multiple sentences) ratherthen a segment within a sentence. As a result, SUMMARY rhetorical relationshould have much higher probability to appear in multi-sentential level than inintra-sentential level. In addition, seperation of models allow us to use differentsets of features or hyper-parameters for different levels of parsing to potentially getbetter performance. Thus, the statistical model (machine learning model) we useto describe the rhetorical structure should be different for intra and multi-sententiallevels.Moreover, we separate structure prediction and relation labeling due to effi-ciency issue and more importantly the fact that it is not necessary to model themjointly as discussed in section 2.4. In other words, we should use different statisti-cal models (machine learning models) as well as features and hyper-parameters todo structure prediction and relation labeling.233.1. System Overview                                                                                     ...    …                              D 𝑆1 𝑆𝑖 𝑆𝑛 𝑇𝑆1  𝑀𝑀𝑀𝑀𝑀𝑠−𝑀𝑀𝑀𝑀𝑖 𝑇𝑆𝑖  𝑇𝑆𝑛  𝑇𝐷  𝑀𝑀𝑀𝑀𝑀𝑠−𝐼𝑛𝑀𝐼𝐼 Intra-sentential branches Multi-sentential branches 𝑀𝑀𝑀𝑀𝑀𝑅−𝐼𝑛𝑀𝐼𝐼 𝑀𝑀𝑀𝑀𝑀𝑅−𝑀𝑀𝑀𝑀𝑖 Tagged Intra-sentential branches Tagged Multi-sentential branches 𝑇𝐷 − 𝑓𝑓𝑓𝑓𝑀  CYK-like parsing CYK-like parsing Figure 3.1: Pipeline of our discourse parsing framework243.2. Structure PredictionThe pipeline of our framework is shown in Figure 3.1. The input to our frame-work is a raw text document D that has one or more sentences S1...Si...Sn and eachsentence Si is segmented to one or more EDUs EDUi1...EDUi j...EDUim. Then,we apply a learned linear chain CRF model ModelSIntra to these EDUs, followedby a CYK-like (Cocke-Younger-Kasami) optimal parsing step to generate intra-sentential level discourse trees TS1 ...TSi ...TSn for S1...Si...Sn. The CYK algorithm isan bottom-up parsing algorithm, which uses dynamic programming, for context-free grammars. Previous studies on intra-sentential level discourse analysis indi-cate that most sentences have well-formed discourse sub-trees in the document DT[40]. Therefore, we next treat each sentence Si as a building block unit, and ap-ply another learned CRF model ModelSMulti to these units, followed by anotherCYK-like parsing step to generate a multi-sentential level discourse tree TD for thewhole document D (This is similar to the one sentence one sub-tree approach fromJoty et al. [21]). Up to this point, we have generated the complete tree structure(only) TD for D which means that the shape of our final discourse tree is deter-mined and the only thing left is determining what discourse relation we should useto relate two spans in each branch in TD. For this purpose, we apply two learnedlog-linear models ModelRIntra and ModelRMulti to tag discourse relations to allbranches in TD using information from spans under consideration and obtain ourfinal discourse tree TD f inal for D.3.2 Structure PredictionIn this section, we give more details about the structure prediction components inour discourse parsing framework. More specifically, we discuss our linear chainCRF model for intra-sentential structure prediction (ModelSIntra in Figure 3.1)and CRF model for multi-sentential structure prediction (ModelSMulti in Figure3.1), which infer probabilities of two arbitrary continuous spans of text being con-nected. We also discuss CYK-like parsing algorithm, which uses these probabili-ties to construct the most probable discourse tree by dynamic programming.3.2.1 Intra-sentential Linear Chain CRF Structure ModelAssume that a raw text document D has one or more sentences S1...Si...Sn and eachsentence is segmented to one or more EDUs as shown in Figure 3.1, the job of theintra-sentential structure modelModelSIntra is to infer p(Ck+1|Uk,Uk+1,Si,qSIntra)for all adjacent pairsUk,Uk+1 using the EDUs within Si as building blocks in orderto enable the CYK-like optimal parsing. Here p(Ck+1|Uk,Uk+1,Si,qSIntra) is theprobability of two continuous spans of text Uk,Uk+1 being connected given model253.2. Structure Predictionparameter qSIntra;Uk,Uk+1 stand for discourse units, which are text spans contain-ing one or more EDUs within sentence Si; Ck+1 stands for connected(Uk,Uk+1),andC = 1 means connected whileC = 0 means unconnected. As discussed beforein section 2.4, Ck+1 = 1 implies Ck = 0 and Ck+2 = 0. Therefore, we adopt a lin-ear chain CRF model, which naturally captures this sequential dependency duringlearning, for intra-sentential structure prediction. 𝐶2 𝐶3 𝐶𝑗  𝐶𝑡−1 𝐶𝑡 𝑈1 𝑈2 𝑈3 𝑈𝑗  𝑈𝑡−1 𝑈𝑡 Connections sequence Units sequence at level 𝑖  Figure 3.2: Intra-sentential Linear Chain CRF Structure ModelFigure 3.2 shows our linear chain CRF graphical model, which is similar toFeng et al.’s [11] and Joty et al.’s [21] (if we remove all the R nodes and their as-sociated edges). The first layer of the chain is composed of discourse unit nodesUs, and the second layer is composed of binary connection nodes Cs indicatingthe probability of connecting adjacent discourse units. This model defines a condi-tional probability on sequenceU1...Uj...Ut as follows3:p(C2:t |U1:t ,Si,qSIntra) =1Zt1’j=1exp{KÂk=1qSIntra,k ⇤ fk(Uj,Uj+1,Si,Cj+1)} (3.1)In order to infer p(Ck+1|Uk,Uk+1,Si,qSIntra) for all adjacent pairs Uk,Uk+1,we first apply the above linear chain CRF to all possible sequences at differentlevels, and then apply the forward-backward algorithm to each sequence to getthe marginal distribution over C nodes for each pair of adjacent units in this se-quence. This process is the same as Joty et al.’s [19] except that we only need toconsider one prediction layer of the chain (binary nodes Cs) and rule out all the Rnodes and their associated edges. For more details about sequence generation algo-rithm which generates all possible sequences and how this process works generally,please refer to section 4.1.2 in Joty’s paper [19]. Some important things to notice3please refer to section 2.2 for more details about what each term means263.2. Structure Predictionare that our system generates O(n3) sequences4, and it takes O(T ) time5 to run theforward-backward algorithm for each sequence, as discussed in section 2.2. As aresult, the time and space complexity of this process is O(n4). For intra-sententialstructure prediction, this is not a big issue since each sentence does not have toomany EDUs and n typically ranges from 1 to 10. However, it becomes a bottleneckfor multi-sentential structure prediction where n ranges from 1 to 1000 (or evenmore) since each document can have as many sentences as the author wants. Thisis the major reason why we adopt a much simpler CRF model for multi-sententialstructure prediction, as we will see in the next section. Lastly, for training thislinear chain CRF, we use the same sequence generation algorithm to generate allpossible sequences at different levels of the gold DT and then follow the standardCRF training procedure. For details about how to generate training sequences fora given training instance, please refer to Figure 7 in section 4.1.2 of Joty’s paper[19].3.2.2 Multi-sentential CRF Structure ModelAssume that a raw text document D has one or more sentences S1...Si...Sn as shownin Figure 3.1, similar to intra-sentential structure model, the job of the multi-sentential structure model ModelSMulti is to infer p(Ck+1|Uk,Uk+1,D,qSMulti)for all adjacent pairsUk,Uk+1 using sentences within D as building blocks in orderto enable the CYK-like optimal parsing. Here p(Ck+1|Uk,Uk+1,D,qSMulti) is theprobability of two continuous spans of text Uk,Uk+1 being connected given modelparameter qSMulti; Uk,Uk+1 stand for discourse units, which are text spans con-taining one or more sentences within D; Ck+1 stands for connected(Uk,Uk+1), andC = 1 means connected, whileC = 0 means unconnected.Figure 3.3 shows our CRFmodel for multi-sentential structure prediction. Sim-ilar to the intra-sentential structure model, the first layer is composed of two dis-course unitsUk,Uk+1, and the second layer is a single binary connection nodeCk+1indicating the probability of connecting Uk,Uk+1. The key difference here com-pared to the intra-sentential structure model is that we no longer use the linearchain CRF due to scalability issue. In another word, we cut the sequential de-pendency edges between Cs in the graphical model, which result in a simpler i.i.dmodel. We will show later in section 4.3.4 that by using this simpler graphicalmodel that does not consider sequential dependency, we are able to considerablydecrease the training/inference time and still achieve high accuracy. This model4n is the number of EDUs in the sentence.5T is the number of units in this sequence, which can go up to n273.2. Structure Prediction 𝐶𝑡 𝑈𝑡−1 𝑈𝑡 Connection node  Adjacent units at level 𝑖  Figure 3.3: Multi-sentential CRF Structure Modeldefines a conditional probability on adjacent units pairUk,Uk+1 as follows6:p(Ck+1|Uk,Uk+1,D,qSMulti) =1Zexp{KÂk=1qSMulti,k ⇤ fk(Uk,Uk+1,D,Ck+1)}(3.2)In order to infer p(Ck+1|Uk,Uk+1,D,qSMulti) for all adjacent pairs Uk,Uk+1,we apply the above CRF to all possible pairs of units at different level to get themarginal distributions over the associatedC nodes. This process is almost the sameas Joty et al.’s [19] except that we only need to consider one prediction variableC and rule out variable R and R’s associated edges. For more details about pairgeneration algorithm which generates all possible adjacent pairs of units and howthis process works generally, please refer to section 4.1.3 in Joty’s paper [19]. Aspointed out in the previous section, our system generates O(n3) pairs7 and for eachpair, it takes constant time to run the inference since we no longer need to doforward-backward inference. As a result, the time and space complexity of thisprocess is O(n3). Although the time and space complexity is lower here comparedto intra-sentential structure prediction, the actually running time is much longer formulti-sentential structure prediction since n ranges from 1 to 1000 (or even more)compared to 1 to 10 for intra-sentential structure prediction. To fundamentallyreduce the running time, the only way would be to use greedy parsing (in thisthesis we use CYK-like optimal parsing), which requires inference for only O(n)pairs of units. The reason for this will be clearer after we describe our parsingstrategy in the next section.6Please refer to section 2.2 for more details about what each term means7n is the number of sentences in the document.283.2. Structure Prediction3.2.3 CYK-like Parsing AlgorithmAs shown in Figure 3.1, CYK-like parsing components immediately follow theintra and multi-sentential structure models. The job of the parsing component is tofind a DT for D based on the information provided by the intra and multi-sententialstructure model. Similar to Joty et al. (2013) [21], we adopt a probabilistic CKY-like optimal parsing algorithm that uses dynamic programming to compute themost likely parses (or most probable DTs) [22] for D. Formally, the goal of CKY-like optimal parsing algorithm is to find:DT ⇤ = argmaxDT p(DT |qSIntra,qSMulti) (3.3)More specifically, given n units (e.g., EDUs for intra-sentential, sentences formulti-sentential), our CYK-like parsing algorithm use two n⇤n 2-D dynamic pro-gramming tables to store the probabilities and structures of partial locally optimalDTs. For example, cell[i, j] stores probability and structure of partial locally op-timal DT for partial document containing unit i, unit i+ 1, ..., unit j 1, unit j.Our CYK-like parsing algorithm is the same as Joty et al.’s [19] except that wehave only two dynamic programming tables instead of three, since we are onlyconsidering structure prediction now. Any computation related to relation labelingis performed in a following step (see Figure 3.1). For more details about how/whythis algorithm works and how to compute each cell iteratively, please refer to sec-tion 4.2 in Joty’s paper [19].One important thing to notice is that the time complexity of this algorithm isO(n3) since it takes O(n) time to iterate over n units to pick the best splitting pointwhen computing a cell.8 Despite the high complexity, the algorithm gives globallyoptimal DT assuming the information (e.g., connection probability of each pair ofunits) provided by the previous steps is accurate. However, in practice, our in-tra and multi-sentential structure models may not give very accurate probabilitiessince the models are trained with rather small amount of data (as we will see inchapter 4). Thus, whether we should use optimal parsing strategy at the expensesof increasing time complexity remains an open question. Some studies (e.g., Fenget al. [11]) actually use greedy parsing and achieve both high accuracy and ef-ficiency. If we adopt a greedy bottom up parsing approach which merges unitspair with highest probability greedily at each step like Feng did in [11], we onlyneed O(n) connection probabilities of units pairs at each iteration and this greedyprocess runs exactly n1 iterations to obtain the full DT in intra-sentential level.Thus, the time complexity for intra-sentential structure prediction drops to O(n2).In multi-sentential structure prediction, we only need amortized O(1) connection8n is the number of EDUs (or sentences) in the sentence (or document).293.3. Log-linear Relation Labeling Modelprobabilities of units pairs at each iteration. The reason for this is that at the firstiteration we need to apply our multi-sentential CRF structure model to all n 1units pairs to determine which pair to merge. But after that, we only need to applyour multi-sentential CRF structure model to obtain connection probabilites of thenewly merged unit (as one new unit) and its two neighboring units at each remain-ing iteration. And this greedy process runs exactly n 1 iterations to obtain thefull DT. Thus, the time complexity for multi-sentential structure prediction dropsto O(n). For more details of how greedy parsing works, please refer to [11]. Nowthat we have less connection probabilities to infer, the time complexities for in-tra and multi-sentential probabilities inference processes described in section 3.2.1and 3.2.2 drop to O(n3) and O(n) from O(n4) and O(n3) respectively. This allowsus to use a more complex linear-chain CRF model in multi-sentential level struc-ture prediction, which may enhance accuracy (time complexity for inference willreturn to O(n3) 9). However, optimal parsing has been shown to be more robust andaccurate than greedy parsing for many scenarios including discourse parsing. Sothere is a trade-off here: we can use either optimal parsing and a simple CRF multi-sentential structure model, or greedy parsing and a more complex linear chain CRFmulti-sentential structure model. In this thesis, we choose the former combinationand leave the latter for our future work.3.3 Log-linear Relation Labeling ModelIn this section, we give more details about the relation labeling components in ourdiscourse parsing framework. More specifically, we discuss our log-linear modelsfor intra-sentential level (ModelRIntra in Figure 3.1) and multi-sentential relationlabeling (ModelRMulti in Figure 3.1), which predict discourse relations given in-formation about two spans under consideration. We describe our log-linear modelsfor intra and multi-sentential levels together, since they have the same graphicalmodel and we use the same learning and inference algorithms for them. The rea-son to use two seperate models is discussed in the first paragraph of section 3.1.However, for comparison purpose, we also experiment with a unified relation la-beling model which does not distinguish between intra and multi-sentential cases.More details on this will come in section 4.3.3.Assume that we have already generated the discourse tree TD (only struc-ture) for the whole document D as shown in Figure 3.1, which means that theshape of our final discourse tree is determined, the job of the log-linear relationmodels ModelRIntra and ModelRMulti is to infer p(Rk+1|Uk,Uk+1,qR) for con-9In Feng’s paper [11], they limited the length of their linear chain CRF, making the time com-plexity staying at O(n).303.3. Log-linear Relation Labeling Modelnected pair Uk,Uk+1 in each branch (both intra and multi-sentential) in TD. Herep(Rk+1|Uk,Uk+1,qR) is the probability distribution of discourse relation betweentwo connected spans of textUk,Uk+1 given model parameter qR;Uk,Uk+1 stand fordiscourse units, which are text spans containing one or more EDUs (or sentences)within Si (or D); Rk+1 stands for discourse relation that relatesUk,Uk+1 and the as-sociated nuclearity status. For instance, Rk+1 = NSElaboration means satelliteUk+1 elaborates nucleusUk.Figure 3.4 shows our log-linear models for intra and multi-sentential relationlabeling. Similar to the multi-sentential structure model, the first layer is com- 𝑅𝑡 𝑈𝑡−1 𝑈𝑡 Discourse relation node  Adjacent units at level 𝑖  Figure 3.4: Log-linear Relation Labeling Modelposed of two discourse unitsUk,Uk+1, and the second layer is a single multinomialnode Rk+1 indicating the probability distribution of the discourse relation betweenUk,Uk+1. As discussed in section 2.4, we do not use linear chain CRF for relationlabeling, so there is no sequential dependency edge between the R nodes in thegraphical model. This model defines a conditional probability on adjacent unitspairUk,Uk+1 as follows10:p(Rk+1|Uk,Uk+1,qR) =1Zexp{KÂk=1qR,k ⇤ fk(Uk,Uk+1,Rk+1)} (3.4)In order to tag discourse relations in a DT, we only need to extract all connectedpair branches (there are n 1 of them for a document with n EDUs), apply theabove model to them and pick the discourse relations with highest probabilities.Thus, the tagging process is extremely fast and the time complexity is O(n). Andfor training, the process is exactly the same.10please refer to section 2.2 for more details about what each term means313.4. Features3.4 FeaturesIn this section, we present all the features we consider in both the structure andrelation labeling models of our discourse parsing framework. These features wereeither been found to be useful for discourse parsing in previous works, or proven tobe useful in many other NLP applications. We categorize the features into a struc-tural group and a semantic group in order to reflect their desired functionalities:whether they help structure prediction or relation labeling. However, the function-ality of each group of features remains unknown until we run the experiments.Finally we discuss feature selection since the feature dimension for our data set isover hundreds of thousands.3.4.1 Structural FeaturesTable 3.1 shows a list of structural features we consider in our system, exempli-fied for the second and third EDU in the following sample sentence from RST-DT(Notice that the EDU_BREAK in the sentence represents EDU boundary in thatlocation so there are 3 EDUs in this sentence):“Energetic and concrete action has been taken in Colombia during the past60 days against the mafiosi of the drug trade , EDU_BREAK but it has not beensufficiently effective , EDU_BREAK because , unfortunately , it came too late . ”Among the listed features: dominance set features are extracted from the Dis-course Segmented Lexicalized Syntactic Tree (DS-LST) and are for intra-sententialparsing only; lexical chain features are based on the sequences of semantically re-lated words that can indicate topical boundaries in a text and are for multi-sententialparsing only; the depth of units feature is obtained when we get the unlabeled DTafter structure prediction; sub-structure features are obtained after we finish thewhole process of structure prediction and relation labeling and are used in an ad-justment step to relation labels, similar to Joty [19]. Other features are very straightforward and are for both intra and multi sentential parsing. All the structural fea-tures we used here are motivated by Joty et al. and Feng et al. [19, 11]. Thus, formore details of these features, please refer to their papers [19, 11].323.4. FeaturesFeatures Description Examples (round to 1/2-digit)  Text Organization Number of EDUs in unit 1 (or unit 2) 1 Relative number of EDUs, tokens in unit 1 (or unit 2) 0.3, 0.2 Relative distance of unit 1 (or unit 2) to the beginning/end 0.3, 0.3 Relative number  of sentences, paragraphs in unit 1 (or unit 2) * 0.5, 0.0 * Whether they are in the same paragraph * true *  Dominance Set Lexical labels of the head node and the attachment node because, effective Syntactic heads of the head node and the attachment node. SBAR, VP Dominance relationship between the unit 1 and 2 unit1  Syntactic Feature Head and Tail n-grams POS tagging of unit 1 and 2 (n = 1,2) CC, CC_PRP  Lexical Chain * Normalized number of chains start in unit 1 and end in unit 2 0.08 * Normalized number of chains start (or end) in unit 1 (or unit 2) 0.25, 0.25, 0.67, 0.17 * Normalized number of chains skipping both unit 1 and unit 2 0.0 * Normalized number of chains skipping unit 1 (or unit 2) 0.0, 0.02 *  Context Previous and next feature vectors. … (Long vector), … (Long vector) Sub-structure Root rhetorical relation label of the left and right rhetorical sub-trees for unit 1 (unit 2) leaf, leaf Depth Depth of units in the DT (labeled or unlabeled) 7  * only multi-sentential level parsing, not for the sample sentence we use hereTable 3.1: Structural features3.4.2 Semantic FeaturesTable 3.2 shows a list of semantic features we consider in our system, exemplifiedfor the second and third EDU in the same sentence above. Among the listed fea-tures: cue phrase is from a phrase list based on the connectives collected by Knottand Dale (1994) [23]; ngram features are as described in the table and are for bothintra and multi-sentential parsing. All the semantic features we used here exceptfor ngrams of the whole unit and unit representation vector are motivated by Jotyet al. and Feng et al. [19, 11]. Thus, for more details and explanations of thesefeatures, please refer to their papers [19, 11]. To the best of our knowledge, no333.4. Featuresone has reported results for using ngrams of the whole unit and unit representationvector as features for discourse parsing (Ji et al. [18] reports results using unigramsof the whole unit features, but not other ngrams of the whole unit or unit represen-tation vector features). The two paragraphs below provide the motivation and moreexplanations on them.Features Description Examples  Cue Phrase Cue phrase appeared in unit 1 (or unit 2) but; not; because; too  Ngrams Selected Head and Tail Ngrams of unit 1 and 2 (n = 1,2,3) <e>; </e> **; <e>_but;  ,_</e>; <e>_but_it; unknown ** Unigram of the Whole Unit <e>; but; it; has; not; been; sufficiently; effective; ,; </e>  Selected Bigram of the Whole Unit it_has  Unit Representation Vector Dot product of all word representation vectors for words in unit 1 and 2 [−0.005118,−0.013534, . . . , 0.001883, 0.007816]200  * <e> and </e> stand for the beginning and the end of a sentence.** un-informatic unigrams, bigrams and trigrams are treated as unknown in our systemto reduce feature dimension, more details in the next section.Table 3.2: Semantic featuresMotivated by the fact that Ji et al. [18] achieve high (above 60) F-score forrelation labeling by using only unigrams of the whole unit (two units under con-sideration) and some basic features, we also want to include such kinds of featuresin our relation labeling log-linear model and believe that ngrams of the whole unitprovide useful semantic indicators for certain rhetorical relations. There are twotypes of features we come up with for this purpose. The first type is selected bi-grams of the whole unit (besides unigrams of the whole unit, which Ji has alreadyused). We can easily obtain them via tokenization on the two units. If there are20 tokens in the two units, there will be 20 unigram features and no more than 19bigram features .The second type is unit representation vector. Recently, word representationlearning has been proven successful in many NLP applications including discourseparsing in one of our motivated work from Ji [18]. Conventionally, we take a wordand convert it to a symbolic id, which is then transformed into a sparse featurevector with only one entry on. The feature vector has the same length as the size ofthe vocabulary. However, this representation of a word suffers from data sparsityand does not capture similarity between different words. These limitations have343.4. Featuresprompted researchers like Mikolov et al. [33], Joseph et al. [47] and Collobert et al.[7] to investigate distributed representation of words in a lower dimensional latentspace. Their distributed word representations are proven to significantly improveand simplify many NLP applications [7, 8], so we want to use them as featuresin our discourse parsing system, too. However, we cannot use them directly sincethe basic unit we are considering is an EDU (or a sentence) instead of a word.To overcome this, we try to compose word vectors in an EDU (or a sentence)by adopting the multiplication composition approach, in which we take the dotproducts of all the word vectors in an EDU (or a sentence). By doing this, wemodel an EDU’s (or a sentence’s) meaning in the word representation latent space.This approach achieved relatively good results in several tasks as shown in [34].The dimension of the resulting vector, which we call unit representation vector thatwe use as features, is the same as the dimension of the word vector (often from 10to 2000).3.4.3 Features SelectionIn many NLP or genetic-related machine learning tasks, we are usually dealingwith potentially large feature sets (e.g., millions). And discourse parsing is not anexception. In this setting, we naturally ask these questions:• Are all these features actually helpful?• Are our ML models capable of handling such high dimension data set effi-ciently?• Is there a way to pick better features for our task?Obviously, not all the features we use are helpful, and some of them might be noisyor irrelevant to our task. And as discussed before, efficiency is a big concern forour task. We cannot just throw in as many features as we want to our models,since the running time of the training/inference processes (for our CRFs and log-linear models) is slow and proportional (roughly) to the feature dimension. In thiscontext, feature selection seems to be a good solution.For our task, bigrams of the whole unit are one of the major contributors to thefeature dimension and many of them actually do not provide prediction power ifwe already use their two associated unigrams. For selecting bigrams with higherprobabilities to provide extra power, we rank all bigrams according to their loglikelihood ratios, which measures how likely two unigrams appear together ratherthan separately. The log likelihood ratio logl of a bigram w1w2 is defined asfollows:353.4. Featureslogl = logL(c1,2,c1, p)+ logL(c2 c1,2,N c1, p) (3.5)logL(c1,2,c1, p1) logL(c2 c1,2,N c1, p2)Here L(k,m,x) = xk(1 x)mk; c1, c2 and c1,2 are the frequency counts of w1, w2and w1w2 in the corpus; N is the total number of words in the corpus; p = c2N , p1 =c1,2c1and p2 =c2c1,2Nc1. Besides log likelihood ratio, we can also use chi-square, point-wise mutual information as metric to rank our bigrams as well as other features. Nomatter which metric we choose, the whole point here is to select more informativeportion of our features and reduce feature dimension. In addition, we also performfeature selection on the head and tail ngram features.Other than directly ranking and selecting features, there are also two alternativeways to reduce feature dimension. One is to use least absolute shrinkage and se-lection operator (lasso), for example L1-regularizer, to shrink the weights of somefeatures to zero. We experiment with this method, as will be discussed in section4.3. Another one is to use stepwise bi-directional feature elimination procedurewhich is a combination of the forward selection and the backward elimination.However, this method is extremely slow since we need to re-train and re-test ourmodel every time when we add or eliminate features. Thus, we do not use thismethod for selecting features for our problem. But we do adopt the idea of thisprocedure and experiment with different combinations of feature families to ex-plore feature usefulness, as will be discussed in section 4.3.In this chapter, a brief overview of our discourse parsing system and detaileddescriptions for the structure prediction and relation labeling components, as wellas explanations on features we use in our system are presented. Given all thesenecessary details about our system, we can now move on to the experiments andevaluation on our system in order to verify its performance, as will be discussed inthe next chapter.36Chapter 4Experiments and AnalysisIn this chapter, we are going to give a detailed description of our experiments.Firstly, we describe the RST-DT data set we use for our experiments in section4.1. Then, we discuss the standard constituent-based evaluation approach and itsweaknesses, and propose our own pair-based evaluation approach in section 4.2.Finally, we present and analyze our experimental procedure and results in section4.3.4.1 RST-DT CorpusThe corpus we use for experimental evaluation is the standard RST-DT data set(Carlson, Marcu, and Okurowski) [4] containing discourse annotations for 385Wall Street Journal news articles from the Penn Treebank corpus (Marcus et al.)[30]. It is partitioned into a training set of 347 articles containing 7300 sentencesand a test set of 38 articles containing 990 sentences by Carlson et al.. 53 arti-cles are randomly selected from the 385 articles to be annotated by two annotators,based on which we obtain the human agreement. On average, each article has20 sentences (ranging from 2 to 100) and each sentence has 2.8 EDUs (rangingfrom 1 to 10), which makes the average number of EDUs 56 for each article. InRST-DT, 25 rhetorical relations defined by Mann and Thompson in 1988 are fur-ther divided into a set of 18 rhetorical relation classes shown below in Table 4.1.Notice that since we are doing rhetorical relation labeling and nuclearity labelingElaboration Joint Attribution Same-Unit Contrast Explanation Background Temporal Cause Enablement Comparison Evaluation Topic-Change Topic-Comment Condition TextualOrganization Manner-Means Summary  Table 4.1: 18 rhetorical relation classes in RST-DTat the same time, which means that we attach nuclearity status to the 18 rhetori-cal relation classes during classification, we have a total number of 41 class labels(e.g., NS-Elaboration) for classification. As we mentioned before, the distributionof rhetorical relations varies between intra and multi-sentential levels. Figure 4.1374.2. Evaluationshows the distribution of the most frequent rhetorical relations in intra and multi-sentential levels. 05001000150020002500300035004000Frequency Intra-sententialMulti-sententialFigure 4.1: Distribution of most frequent rhetorical relations in intra and multi-sentential levels4.2 EvaluationIn this section, we describe the metrics used to measure how well discourse parsingsystems perform when compared to human annotations. Conventionally, discourseparsing community is following the standard unlabeled and labeled precision, re-call and F-score proposed by Marcu [28]. The unlabeled metric measures howaccurate the discourse parser is in finding the right structure (or shape) of the DT,while the labeled metrics measure how well the discourse parser does in identify-ing the nuclearity statuses and labeling discourse relations of connected tree nodes(or text spans in discourse parsing context) in addition to the right structure. Wecall this standard approach constituent-based since it is measuring the goodnessby correct portion of constituents. We do not dig into detail about this standardevaluation method since we are using exactly the same procedure as Joty used in[19]. Figure 4.2 shows an illustration taken from [19] for this method. As can beseen in the figure, in terms of structure (Span) correctness, there are seven matchedconstituents (1-1, 4-4, 7-7, 2-3, 5-6, 2-4, 5-7) out of ten gold constituents (1-1, 4-4,5-5, 6-6, 7-7, 2-3, 5-6, 2-4, 5-7, 2-7) and ten predicted constituents (1-1, 2-2, 3-3,4-4, 7-7, 2-3, 5-6, 2-4, 5-7, 1-4), thus the structure (Span) precision, recall and F-384.2. Evaluationscore are all 7/10. The labeled nuclearity&relation precision, recall and F-score arecomputed similarly. For more details, please refer to section 6.2.2 in [19]. How-ever, there are some limitations of this standard approach, as discussed below insection 4.2.1, so we propose another very similar pair-based evaluation approach,which computes the unlabeled and labeled precision, recall and F-score for pairsof nodes in DTs, to address those limitations.394.2. Evaluation   (c) Figure 4.2: Illustration for constituent-based evaluation approach taken from [19]:(a) The human-annotated discourse tree. (b) The system-generated discourse tree.(c) Measuring precision (P) and recall (R) of the discourse parser.404.2. Evaluation4.2.1 Our Pair-based ApproachAlthough the constituent-based metric is widely used for discourse parsing eval-uation, there are two major weaknesses of this metric. Firstly, for measuringstructure prediction correctness, the constituent-based evaluation approach alwaysgives credits to leaves of the system-generated DTs when we use gold discoursesegmentation, since all leaf-level constituents are naturally correct. For example,constituents (1), (4) and (7) in Figure 4.2(b) are naturally correct in structure sincethey are the leaf-level constituents in both human-annotated and system-generatedDT (see row 1,4,7 in Spans column of Figure 4.2(c)). This boosts the structureF-score by a large margin, since half of the tree nodes (or constituents in discourseparsing context) in the DT are leaves. As a result, the constituent-based approachoften hides details of structural incorrectnesses in system-generated DTs. How-ever, this weakness is minor compared to the second one below since this approachgives strictly higher structure F-score to better DTs. Thus, we still use the standardconstituent-based metric in our experimental result analysis on structure prediction(section 4.3.2) and the complete parser (section 4.3.4) for comparison purpose.  Elaboration                               Background   Spans Nuclearity Relations Constituents Human System Human System Human System 1-1 * * S S Elaboration Background 2-2 * * N N Span Span Correctness P = 2/2, R = 2/2 P = 2/2, R = 2/2 P = 1/2, R = 1/2 EDU1 EDU2   (1)    (2)    (1)    (2) (a) (b) (c) Figure 4.3: Example for the first type of error in measuring relation labeling cor-rectness: (a) The human-annotated discourse tree. (b) The system-generated dis-course tree. (c) Measuring precision (P) and recall (R) of the discourse parser.Secondly and more importantly, for measuring nuclearity and relation label-ing correctnesses, the constituent-based approach sometimes gives credits to con-stituents which are not even correctly labeled, and sometimes does not give creditsto correctly labeled discourse relations. Consider the following example shown inFigure 4.3 for the first type of error: although system incorrectly labels discourserelation between constituent (1) and (2) as BACKGROUND, we still get relation414.2. EvaluationF-score of 0.5 since constituent (2) is naturally correct, due to the fact that we al-ways attach Span to the nuclei (constituent (2) in this example) for single-nuclearrelations. Consider another example shown in Figure 4.4 for the second type of  Elaboration                               Elaboration    Spans Nuclearity Relations Constituents Human System Human System Human System 1-1 * * S N Elaboration Span 2-2 * * N S Span Elaboration Correctness P = 2/2, R = 2/2 P = 0/2, R = 0/2 P = 0/2, R = 0/2 EDU1 EDU2    (1)    (2)    (1)    (2) (a) (b) (c) Figure 4.4: Example for the second type of error in measuring relation labelingcorrectness: (a) The human-annotated discourse tree. (b) The system-generateddiscourse tree. (c) Measuring precision (P) and recall (R) of the discourse parser.error: although the system correctly labels discourse relation between constituent(1) and (2) as Elaboration, we only get relation F-score of 0 since the system doesnot identify the correct nuclei and satellite among constituents (1) and (2). Fromboth types of errors, we can clearly see that the constituent evaluation approachgives unreasonable F-score and sometimes lower F-score to better DTs in termsof relation labeling (in this example, it gives 0.5 to incorrect relation label and 0to correct relation label). Thus, we use our pair-based metric in our experimentalresult analysis on nuclearity and relation labeling for our own system in section4.3.As we see in the previous paragraphs, all the errors made by constituent-basedapproach are simply a result of using constituent as the basic unit for evaluation. Sohow about choosing another basic unit, say a pair of constituents? We do exactlythis by proposing a pair-based evaluation approach as follows: as shown in Table4.2, our pair-based approach measures how accurate the system is in finding correctpairs of constituents in terms of structure, nuclearity and relation. It only givescredits to the system-generated DT’s pairs, which fully match the correspondingpairs in the human-annotated DT. For instance, among the five gold constituentspairs ((2-3)-4, 5-6, (5-6)-7, (2-4)-(5-7), 1-(2-7)) and five predicted constituentspairs (2-3, (2-3)-4, (5-6)-7, 1-(2-4), (1-4)-(5-7)), only constituents pairs (2-3)-4 and(5-6)-7 match in terms of structure, and only constituents pair (2-3)-4 matches in424.3. Experiments and Results Analysis Span Nuclearity&Relation Constituents Pairs Human System Human System 2-3  *  SN-Elaboration (2-3)-4 * * NN-Contrast NN-Contrast 5-6 *  NS-Elaboration  (5-6)-7 * * NN-Same-Unit NN-Joint 1-(2-4)  *  SN-Attribution (2-4)-(5-7) *  SN-Contrast  (1-4)-(5-7)  *  NN-Contrast 1-(2-7) *  SN-Attribution  Correctness P = 2/5, R = 2/5 P = 1/5, R = 1/5  Table 4.2: Illustration for our pair-based evaluation approach on measuring cor-rectness of the human-annotated and system-generated discourse tree in Figure 4.2(a) and (b).terms of nuclearity and relation for the example in Figure 4.2. Thus, the structureand nuclearity&relation F-scores are 1/5 and 2/5. We can clearly see that thisapproach solves all the issues in the constituent-based approach.4.3 Experiments and Results Analysis4.3.1 Experiment Setup and PreprocessingIn order to focus our study on discourse parsing (tree building), we use gold EDUsegmentation in our experiments so we do not need to worry about segmentation.For the discourse parsing task, we perform some preprocessing to the RST-DT dataset: similar to Joty et al. and Feng et al. [11, 19], non-binary relations are convertedinto cascades of right-branching binary relations, and original rhetorical relationsare mapped to coarse-grained ones (18 rhetorical relation classes shown in Table4.1). We also extract necessary features using some existing packages as follows:• We compute lexical chains for a document following the approach proposedby Galley and McKeown [12], that extracts lexical chains after performingword sense disambiguation.• We compute dominance set features by using the Joty’s script which is usedfor his paper [19].434.3. Experiments and Results Analysis• We use Charniak reranking parser [5], one of state-of-the-art syntactic parsersto parse our documents for POS tagging features.• For word representation vectors, we choose the one from Joseph Turian et al.[47] since it covers most of the words in our data set and its dimension (200)is ideal for our task given that the size and training speed of our models arelinear (or greater) to the number of features.• Moreover, we extensively use python nltk (natural language toolkit) [2] inour feature extraction and selection.For structure prediction and relation labeling in both intra and multi-sentential lev-els, our models are trained using crfsuite, an open source CRF package developedby Naoaki [35], which has the following characteristics:• It provides fast training and tagging.• It supports both L1 and L2 regularization and uses the OWL-QN (Orthant-Wise Limited-memory Quasi-Newton) method for optimizing loss functionswith L1 terms. We need this when training our models with both L1 and L2regularization.• It supports inference of marginal probabilities for all labels in all positionsin a sequence. We need them for our CYK-like parsing.• It explicitly supports numerical features, which we need since we have nu-merical features like length of an EDU, whereas most of other CRF opensource packages like Mallet-grmm, CRF++, CRFSGD do not (as of 2013).Mallet-grmm does support numerical features implicitly if a small code patchis added, which we did for comparison purpose.For all other processing including the CYK-like parsing and the evaluation, we useour own scripts, which will be made publicly available soon.4.3.2 Structure Prediction ResultsIn this section, we present and analyze the performance of our system in discoursestructure prediction and our exploration of the over-fitting phenomenon, featureusefulness and etc. (Some structure prediction results will be presented later in sec-tion 4.3.4). One thing to notice is that in this section, we are using the constituent-based F-score for structure prediction results for comparison purposes. Firstly, weshow our 10-fold CV (cross validation) results for intra-sentential structure predic-tion with the standard feature set {text organization, dominance set, POS, context,444.3. Experiments and Results Analysisselected head and tail ngrams} from Joty [19] (except sub-structure features) anddefault hyper-parameter setting {L1= 0,L2= 0.111} in Table 4.3. The result shows Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Average Human Agreement Span F-score 95.74 93.69 94.29 93.27 94.74 94.63 95.33 94.9 94.98 94.88 94.65 95.7  Table 4.3: 10-fold CV structure prediction results for intra-sentential parsingthat the documents in the RST-DT data set are randomly (roughly) distributed andour model achieves high structure (Span) F-scores on all 10 folds with an averageof 94.65 compared to human agreement at 95.7. Thus, we can safely concludethat if our model does well in one fold, it generalizes to other folds as well. Then,we run our structure prediction on the standard split of the data set with our ownfeature set {Intra-sentential: text organization, dominance set, POS, context, headand tail ngrams, unigrams of the whole unit; Multi-sentential: text organization,POS, lexical chain, context, head and tail ngrams}, dense feature setting12 andthe default hyper-parameter setting; and get the structure (Span) F-scores of 95.0for intra-sentential parsing and 84.4 for the complete parser (both intra and multi-sentential parsing), which are already impressive compared to human agreement at95.7 and 88.7.However, when we examine our intra and multi-sentential structure models, wefind out that there are 220 thousands and 110 thousands parameters respectively,which is very large compared to the number of training samples (around 70000 and15000 respectively). Thus, we suspect that there may be some over-fitting in ourmodels and we run the over-fitting exploration experiment.Figure 4.5 shows the training v.s. test structure (Span) F-score for two modelsusing text organization, POS features (3846 parameters in total) and using textorganization, POS, ngram features (48846 parameters in total) respectively in intra-sentential structure prediction with default L2 (0.1) and increasing L1. We canclearly see that there is no over-fitting in Figure 4.5(a) since the training F-score isalmost the same as the test F-score no matter what value we set for L1. However,there is a little bit over-fitting in Figure 4.5(b) at the beginning (L1=0, 48846 non-zero parameters) where the training F-score is much higher than the test F-score.But as L1 coefficient increases and the number of non-zero parameters decreases,the training F-score approaches the test F-score, which indicates that we minimize11L1,L2 are the coefficients for L1,L2 penalty term12A dense feature is state features f (xi,yi)s that associate all of possible combinations betweenattribute x’s and label y’s which even do not occur in the training data. It improves our structureF-score by 0.3%-0.5% at the expenses of increasing 10% training/inference time.454.3. Experiments and Results Analysis 93.0 92.8 92.3 91.2 92.4 92.4 92.4 91.7 80.082.084.086.088.090.092.094.096.098.0100.03846 2110 632 850 1 10 100Structure F-score Number of non-zero parameters (above) L1 coefficient (below) (a) TrainTest98.6 95.9 92.8 91.3 91.5 93.0 92.5 91.3 80.082.084.086.088.090.092.094.096.098.0100.048846 10976 1014 920 1 10 100Structure F-score Number of non-zero parameters (above) L1 coefficient (below) (b) TrainTestFigure 4.5: Training v.s. test F-score in intra-sentential structure prediction withdefault L2 (0.1): (a) Using text organization and POS features (3846 parameters intotal). (b) Using text organization, POS and ngram features (48846 parameters intotal)over-fitting. The intuition behind this is simple: we have more features (the ngramfeatures) in Figure 4.5(b) than in Figure 4.5(a) (48846 v.s. 3846) and the number offeatures in Figure 4.5(b) is comparable to the number of training samples (48846v.s around 70000), so our model over-fits the data without proper regularization.Once L1 is big enough (e.g., 10), the number of non-zero parameters in Figure4.5(b) decreases to 1014 and is much less than the sample size (around 70000),464.3. Experiments and Results Analysisso there is no over-fitting anymore. Thus, in order to avoid over-fitting, we chooseproper L1 by cross validation in our final structure prediction models. However, forcomparison purpose (in section 4.3.4), we just pick L1 as 10 (as we prefer smallermodels with lower space and time complexity) in order to show that even withouttuning, our system still achieves state-of-the-art structure prediction performance.Since our full feature set for structure prediction is quite large (220 thousandsand 110 thousands for intra and multi-sentential respectively), besides using reg-ularizer to control over-fitting, we also want to explore the usefulness of differentfeature families so that we can have a better understanding of features, identify use-ful and less useful features, and potentially remove less useful features when nec-essary. We design our feature exploration experiment as follows: (1) We choosetext organization as baseline feature set f0 and carry out experiments on f0 withdifferent L1. (2) We create different feature sets f1... fi by adding each featurefamily (except those we have chosen already) to f0 and carry out experiments onf1... fi with different L1. (3) For each feature set in f1... fi that shows significantimprovement to f0, we treat it as new f0 and go back to step (2) to explore a newspace of feature sets recursively. By doing this, we explore necessary combinationsof feature families, which is enough for our purpose of identifying useful and lessuseful features, instead of every possible combination of feature families, whichwould take an extremely long time to run.Figure 4.6 and 4.7 shows the training v.s. test structure (Span) F-score for mod-els using different combinations of feature families we explored in intra-sententialstructure prediction with L1 = 1,L2 = 0.1 and in multi-sentential structure predic-tion with L1 = 10,L2 = 0.1. There are some abbreviations for feature families inthe two figures (and the figures in the next section): Org stands for text organi-zation; Dom stands for dominance set; Ngrams stands for selected head and tailngrams; Lex stands for lexical chain; All stands for our full feature set for structureprediction (mentioned at the beginning of this section). We can see that differ-ent families of features play different roles in intra and multi-sentential structureprediction. More specifically, for intra-sentential structure prediction:• Text organization, head and tail ngrams and POS features all help, as can beseen in Figure 4.6 column Org v.s. Org + Ngrams, Org + POS.• Context features also help, as can be seen in Figure 4.6 column Org + POSv.s Org + POS + Context and column Org + Ngrams + POS v.s. Org +Ngrams + POS + Context.• It is not clear whether dominance set features help, as can be seen in Figure4.6 column Org v.s. Org + Dom and column Org + Ngrams + POS v.s. Org+ Dom + Ngrams + POS.474.3. Experiments and Results Analysis                           88.2 89.1 95.4 92.8 95.9 95.8 95.8 97.6 98.1 88.5 89.0 92.4 92.4 93.0 92.8 95.1 95.2 95.9 80.082.084.086.088.090.092.094.096.098.0100.032 10134 45032 3846 48846 58948 10448 125518 226238Org Org + Dom Org +NgramsOrg + POS Org +Ngrams +POSOrg + Dom +Ngrams +POSOrg + POS +ContextOrg +Ngrams +POS +ContextAllStructure F-score Number of features (above) Combinations of feature families (below) TrainTestFigure 4.6: Feature exploration results on structure prediction for intra-sententialparsing.• Besides text organization features which we choose as the baseline features,POS features seem to be the most effective features in terms of both theirstrong predictive power and small size (384632= 3814), as can be seen inFigure 4.6 column Org + POS. The intuition behind this is that certain POStagging at the head (or tail) position has higher probability to trigger certaindiscourse relations between two text units.For multi-sentential structure prediction:• Text organization, head and tail ngrams, POS and lexical chain features allhelp by a small margin, as can be seen in Figure 4.7 column Org v.s. Org +Ngrams, Org + POS, Org + Lex.• Context features help the most, as can be seen in Figure 4.7 column Org +Ngrams + POS + Lex v.s All (Org + Ngrams + POS + Lex + Context). Theintuition behind this is that in multi-sentential level parsing, local informa-tion is not enough for determining discourse structure, we need more contextbefore and after.• The training F-score is lower than test F-score. A possible explanation is thatthe F-score here measures the final structure correctness of the DT but not theintermediate correctness (classification accuracy of merging decisions for all484.3. Experiments and Results Analysis   53.3 53.0 52.5 53.2 52.9 52.9 63.0 65.3 64.0 65.3 64.5 69.6 0.010.020.030.040.050.060.070.080.090.0100.036 38570 2208 52 40754 113144Org Org + Ngrams Org + POS Org + Lex Org + Ngrams +POS + LexAllStructure F-score Number of features (above) Combinations of feature families (below)  TrainTestFigure 4.7: Feature exploration results on structure prediction for multi-sententialparsing.possible pairs of units), so it is not unreasonable to have a higher test F-score.Actually when we examine the intermediate correctness of whether two unitshould be merge or not, we do find that the training accuracy is higher thanthe test accuracy, as expected.• It has much lower performance compared to intra-sentential structure predic-tion since multi-sentential level coherence is much harder to detect and thereare much more possibilities (of DT configuration) as document size grows.We also try to use the unigrams of the whole unit as features in multi-sententialstructure prediction (as we did in intra-sentential structure prediction). But it turnsout to be very inefficient (10 times slower compared to no unigrams of the wholeunit features) and there is no F-score gain (-0.8%). Thus, we exclude it from ourexperiment and conclude that semantic features do not help structure predictionmuch, especially at the multi-sentential level; and we do not use bigrams, trigramsor other semantic features in our structure prediction in order to keep our modelsefficient.Finally, as can be seen in the last column of both Figure 4.6 and 4.7: addingmore feature families always help in terms of accuracy if we have proper regular-ization on the model. However, to keep structure prediction not only accurate butalso efficient, only the most informative feature families, which is the last column494.3. Experiments and Results Analysisin 4.6 and 4.7, will be used when comparing with other competing systems later insection 4.3.4.4.3.3 Relation Labeling ResultsIn this section, we present and analyze the performance of our system in discourserelation labeling (including nuclearity labeling) and our exploration of over-fittingphenomenon, feature usefulness and labeling correctness trend with respect to labelfrequency (Some relation labeling results will be presented later in section 4.3.4).One thing to notice is that in this section, we firstly use the constituent-based F-score and then switch to the pair-based F-score for relation labeling results. Similarto structure prediction, we also carry out a 10-fold CV which gives the same con-clusion about the randomness of RST-DT data set with respect to discourse relationlabels (we omit the results here). Then, we run our relation labeling on the standardsplit of the data set with several different feature sets (e.g., the standard feature setfrom Joty, the standard feature set plus one of these features: sub-structure, depth,cue phrase, unit representation vector and more ngrams), dense feature setting andthe default hyper-parameters (L1 = 0,L2 = 0.1). Figure 4.8 shows the results onthis for both intra-sentential relation labeling and the complete parser (both intraand multi-sentential relation labeling). There are some abbreviations for featurefamilies in the figure: Std stands for the standard feature set from Joty; Sub-Strstands for sub-structure; Dep stands for depth; Cue stands for cue phrase; Vecstands for unit representation vector features; more Ngrams stands for all head(and tail) ngrams and unigrams (and selected bigrams) of the whole unit. There aretwo major conclusions we draw from the results:• With default regularization, on top of the standard feature set, none of thefeatures helps nuclearity and relation labeling except more ngrams (all headand tail ngrams, unigrams and selected bigrams of the whole unit), as canbe seen in the last column compared to first column in Figure 4.8(a) and(b). The potential reasons are as follows: (1) Sub-structure, depth and unitrepresentation vector features are noisy. (2) Cue phrase features do not pro-vide much additional power if we already have ngram features (3) We donot do proper regularization to the models and there is over-fitting. Thus, forsimplicity and efficiency purpose, we exclude these features in our later ex-periment and in our comparison against competing systems in section 4.3.4.• As can be seen in the last column of Figure 4.8(a) and (b), we get the nucle-arity/relation F-score of 87.1/78.53 for intra-sentential relation labeling and69.06/56.89 for the complete parser (both intra and multi-sentential relation504.3. Experiments and Results Analysis       87.0 86.8 87.0 86.9 86.3 87.1 78.2 78.0 78.2 77.9 76.5 78.5 0.010.020.030.040.050.060.070.080.090.0100.0Std Std + Sub-Str Std + Dep Std + Cue Std + Vec Std + moreNgramsNuclearity and Relation F-score Different feature families (a) NuclearityRelation69.1 68.9 69.1 69.3 68.7 69.1 56.4 56.2 56.4 56.2 55.4 56.9 0.010.020.030.040.050.060.070.080.090.0100.0Std Std + Sub-Str Std + Dep Std + Cue Std + Vec Std + moreNgramsNuclearity and Relation F-score Different feature families (b) NuclearityRelationFigure 4.8: Relation labeling test F-score on different feature sets for: (a) Intra-sentential. (b) Complete parser.labeling), compared to human agreement at 90.4/83.0 and 77.72/65.75. It isnot bad but we can certainly improve it further, as can be seen later.We also try experimenting with a unified model without distinguishing intra andmulti-sentential relation labeling. The nuclearity/relation F-score of the unifiedmodel, which allows data sharing, is 87.03/78.26 for intra-sentential relation la-beling and 68.95/56.30 for the complete parser (both intra and multi-sentential514.3. Experiments and Results Analysisrelation labeling) and is pretty much the same with two separate models shownabove in Figure 4.8. A possible explanation for this is that: although we claimedbefore that we should treat intra and mult-sentential separately, in practice our log-linear model and the features we use are capable of capturing the differences. Onething to notice is that from now on, we will be using our own pair-based nucle-arity&structure (credits only given to pairs with both correct nulearity status andrelation label) F-score in this section. Moveover, we only consider the correctlyretrieved spans when computing the F-scores to eliminate the effect of structureprediction.As mentioned before, we may face the over-fitting issue here, since we have avery large feature set. Similar to the last section, we run an over-fitting explorationexperiment. Figure 4.9 shows the training v.s. test pair-based nuclearity&relationF-score for models with large feature sets (greater than a hundred thousands param-eters) in intra and multi-sentential relation labeling with default L2 (0.1). There aresome new abbreviations for feature families in the figure: HT_Ngrams stands forselected head and tail ngrams; W_Unigram stands for unigrams of the whole unit;W_Bigram stands for selected bigrams of the whole unit; All stands for the stan-dard feature sets from Joty plus W_Unigram and W_Bigram; All - Context standsfor All except context. We can clearly see that there is large over-fitting in bothFigure 4.9(a) and (b) since the training F-scores are almost perfect (100 or 99.9),but the test F-scores are far apart. To confirm our conclusion, we show the resultson training v.s. test pair-based nuclearity&relation F-score for models using textorganization and selected head and tail ngram features (1153542 and 741116 pa-rameters in total) in both intra and multi-sentential relation labeling with default L2(0.1) and increasing L1 in Figure 4.10. The reason for over-fitting is very straight-forward: the number of features in Figure 4.10 is much larger than the number oftraining samples (1153542 and 741116 v.s around 11000 and 7000), so our modelover-fits the data without proper regularization. Once L1 is big enough (e.g., 10),the number of non-zero parameters in Figure 4.10 decreases to 273 and 307, and ismuch less than the sample size (around 11000 and 7000), so there is no over-fittinganymore. Thus, in order to avoid over-fitting, we choose proper L1 by cross vali-dation in our final relation labeling models. However, for comparing purposes (insection 4.3.4), we just pick L1 as 0 (intra-sentential) and 10 (multi-sentential, as weprefer smaller models with lower space and time complexity) in order to show thateven without tuning, our system still achieves high relation labeling performance.524.3. Experiments and Results Analysis 99.9 99.9 99.9 99.9 77.1 67.3 80.3 80.1 0.010.020.030.040.050.060.070.080.090.0100.01153542 1003821 2571465 3407157Org +HT_NgramsOrg +W_Unigram +W_BigramAll - Context AllPair-base Nuclearity&Relation F-score Number of non-zero parameters (above) Different feature families (below) (a) TrainTest99.9 100.0 100.0 100.0 51.6 46.3 49.1 51.6 0.010.020.030.040.050.060.070.080.090.0100.0741116 1505561 2327324 3514602Org +HT_NgramsOrg +W_Unigram +W_BigramAll - Context AllPair-base Nuclearity&Relation F-score Number of non-zero parameters (above) Different feature families (below) (b) TrainTestFigure 4.9: Training v.s. test pair-based nuclearity&relation F-score in relationlabeling with default L2 (0.1) for: (a) Intra-sentential. (b) Multi-sentential.534.3. Experiments and Results Analysis 99.9 61.8 47.3 44.2 51.6 51.6 52.2 43.5 0.020.040.060.080.0100.0741116 3864 273 830 1 10 100Pair-based Nuclearity&Relation F-score Number of non-zero parameters (above) L1 coefficient (below) (a) TrainTest99.9 81.2 69.8 55.3 77.1 76.4 70.1 55.2 0.020.040.060.080.0100.01153542 3047 307 510 1 10 100Pair-based Nuclearity&Relation F-score Number of non-zero parameters (above) L1 coefficient (below) (b) TrainTestFigure 4.10: Training v.s. test pair-based nuclearity&relation F-score in relationlabeling using text organization and selected head and tail ngram features withdefault L2 (0.1) for: (a) Intra-sentential. (b) Multi-sentential.Similar to structure prediction, we also carry out feature exploration experi-ment to study the usefulness of features in relation labeling. One thing to no-tice is that we are only studying text organization, dominance set, selected headand tail ngrams, unigrams of the whole unit, selected bigrams of the whole unit,544.3. Experiments and Results Analysislexical chain and context features here, since as mentioned before: sub-structure,depth, unit representation vector and cue phrase features are excluded from ourexperiments. Figure 4.11 and 4.12 shows the training v.s. test pair-based nucle-arity&relation F-score for models using different combinations of feature familieswe explored in intra-sentential relation labeling with L1= 1,L2= 0.1 and in multi-sentential relation labeling with L1 = 10,L2 = 0.1.We can see that different fam-ilies of features play different roles in intra and multi-sentential relation labeling.More specifically, for intra-sentential relation labeling: 48.2 74.0 74.3 81.2 85.3 91.8 93.4 48.3 73.6 71.8 76.4 68.0 79.4 79.2 0.010.020.030.040.050.060.070.080.090.0100.0546 192426 140517 1153542 1003821 2571465 3407157Org Org + Dom Org + POS Org +HT_NgramsOrg +W_Unigram +W_BigramAll - CONTEXT AllPair-base Nuclearity&Relation F-score Number of features (above) Combinations of feature families (below) TrainTestFigure 4.11: Feature exploration results on relation labeling for intra-sententialparsing.• Text organization, dominance set, head and tail ngrams, unigrams (or bi-grams) of the whole unit, POS features all help, as can be seen in Figure4.11 column Org v.s. Org + Dom, Org + POS, Org + HT_Ngrams, Org +W_Unigram + W_Bigram.• Context features do not help, as can be seen in Figure 4.11 column All -CONTEXT v.s. All. The potential reason for this is that most sentences onlyhave 2-3 EDUs so there is not much contextual information to learn/model.• Besides text organization features which we choose as the baseline features,head and tail ngram features seem to be the most effective features, as canbe seen in Figure 4.11 column Org + HT_Ngrams, since they have strongcapability of capturing semantic information for discourse relation labeling.554.3. Experiments and Results Analysis• One thing to notice is that all conclusions above still hold if we use theconstituent-based F-score instead of the pair-based F-score. However, weomit the constituent-based results here. For comparison between constituentand pair-based F-score, please refer to Table 4.4 in the next section.For multi-sentential relation labeling:          44.6 46.7 44.6 47.3 50.5 52.8 54.0 46.6 49.4 47.2 52.2 49.4 50.6 51.86 0.010.020.030.040.050.060.070.080.090.0100.0656 84870 984 741116 1505561 2327324 3514602Org Org + POS Org + Lex Org +HT_NgramsOrg +W_Unigram +W_BigramAll - CONTEXT AllPair-base Nuclearity&Relation F-score Number of features (above) Combinations of feature families (below) TrainTestFigure 4.12: Feature exploration results on relation labeling for multi-sententialparsing.• Text organization, POS, lexical chain, head and tail ngrams, unigrams (or bi-grams) of the whole unit features all help, as can be seen in Figure 4.12 col-umn Org v.s. Org + POS, Org + Lex, Org + HT_Ngrams, Org + W_Unigram+ W_Bigram.• Context features also help, as can be seen in Figure 4.12 column All - CON-TEXT v.s. All.• Besides text organization features which we choose as the baseline features,head and tail ngram features seem to be the most effective features, as canbe seen in Figure 4.12 column Org + HT_Ngrams, since they have strongcapability of capturing semantic information for discourse relation labeling.564.3. Experiments and Results Analysis• It has much lower performance compared to intra-sentential relation labelingsince multi-sentential level coherence is much harder to detect.• One thing to notice is that if we use the constituent-based F-score instead ofthe pair-based F-score, all conclusions above still hold with one exception:context features do not help any more (constituent-based nuclearity/relationF-score drop from 44.1/23.9 to 43.8/23.8). However, we believe that contextfeature should be helpful in multi-sentential relation labeling since a docu-ment has many sentences, and sequential or long range rhetorical dependen-cies do exist at the multi-sentential level. In addition, context features areshown to be helpful by other researchers (e.g., Joty et al. [19]). Thus, thisobservation in a way shows the weaknesses of the constituent-based evalua-tion approach empirically.As can be seen in the last column of both Figure 4.11 and 4.12: adding more featurefamilies always help if we have proper regularization on the model. However, tokeep relation labeling not only accurate but also efficient, only the most informativefeature families, which is the second last column in 4.6 and the fourth column in4.12, will be used when comparing with other competing systems later in section4.3.4.As mentioned before, we have uneven distribution of relation labels in RST-DT. Thus, we want to examine whether the frequency of rhetorical relations af-fect the accuracy. Figure 4.13 shows the average pair-based nuclearity&relationF-scores in the standard test set on top-1, top-2, top-10, 2-10 and all rhetoricalrelations (with respect to frequency) for both intra and multi-sentential relation la-beling. The feature sets we use are the ones mentioned in the previous paragraph;and the hyper-parameters setting are the same with the above feature explorationexperiments: L1 = 1,L2 = 0.1 for intra-sentential level and L1 = 10,L2 = 0.1 formulti-sentential. As can be seen in Figure 4.13, F-score decreases as we take lessfrequent relations into consideration. However, there is one exception as can beseen in Figure 4.13 column Top-2 v.s. Top-1 for intra-sentential relation labeling:the top-2 F-score is higher than the top-1 F-score, indicating that the second mostfrequent relation (Attribution) has higher F-score than the most frequent relation(Elaboration) at intra-sentential level. The reason for this is straight-forward: At-tribution is very easy to classify as it can be triggered by ngrams like “he says”,compared to Elaboration. In general, more frequent relations get higher F-scoresince we have more training samples for them.574.3. Experiments and Results Analysis      90.3 94.6 84.3 80.6 79.4 66.3 57.5 56.9 34.5 51.9 0.010.020.030.040.050.060.070.080.090.0100.0Top-1 Top-2 Top-10 2 to 10 AllAverage test pair-based Nuclearity&Relation F-score  Top-k rhetorical relations (with respect to frequency)  Intra-sententialMulti-sententialFigure 4.13: Test pair-based nuclearity&relation F-score on top-k relations in rela-tion labeling.4.3.4 Complete Parser Analysis and ComparisonAfter our discussion of some results for structure prediction and relation labeling,we now present some more results for the complete parser involving both parts inthis section. Firstly, we examine the test F-score (with default feature and hyper-parameter setting) on different levels of the DT. As shown in Table 4.4, the F-scoresSpan level (height) in the DT Leaf 1 2 3 4 5 AverageStructure F-scoreConstituent-based 100 90.4 79.8 74.4 64.0 56.2 86.2Pair-based N/A 90.4 69.9 57.6 43.1 34.8 61.6Nuclearity, Relation F-scoreConstituent-based, Nuclearity 88.5 74.5 63.7 49.7 44.3 40.0 72.2Constituent-based, Relation 76.0 57.4 48.9 34.8 31.0 21.5 58.9Pair-based, Nuclearity&Relation N/A 72.1 54.3 40.6 27.9 20.5 46.4* All the F-score are rounded to the nearest integer.Table 4.4: Level-wise F-scoreon structure, nuclearity and relation all decrease as spans goes higher in the DTwith both constituent-based and pair-based evaluation. This is not surprising sincehigher level spans (branches) in the tree have more units (EDUs or sentences),584.3. Experiments and Results Analysismaking it harder for our models to find the right discourse structure as well asdiscourse relations. Moreover, the correctness of higher level spans depends on thecorrectness of lower level spans.Then, there is an important observation from our experiments: we are ableto achieve good enough accuracy by using only state feature (function) f (xi,yi)and transition feature (function) f (yi,yi+1) instead of clique feature (function)f (xi,yi,yi+1) with higher order complexity in our CRFs and log-linear models.Please refer back section 2.2 for what “state”, “transition” and “feature (function)”mean.Now let us compare our system against several competing systems in termsof accuracy. Table 4.5 shows the F-scores for our discourse parsing system asSystem and Method Intra-sentential F-score Complete parser F-score Structure (Span) Nucleari-ty Relation Structure (Span) Nucleari-ty Relation Our system 96.0 89.0 80.0 86.2 72.2 59.2 Joty et al. 1 Sentence 1 Sub-tree 96.5 89.4 79.8 82.6 68.3 55.8 Sliding window 83.8 68.9 55.7 Feng et al. Greedy linear-chain CRF N/A N/A N/A 84.9 69.9 57.2 Greedy linear-chain CRF with post editing 85.7 71.0 58.2 Ji and Eisenstein Concatenation form N/A N/A N/A 82.1 71.1 61.6 General form 81.6 71.0 61.7 Human agreement 95.7 90.4 83 88.7 77.7 65.7  * N/A means we do not have access to their intra-sentential results, or they do not distin-guish between intra and multi-sentential parsing.Table 4.5: Comparison on system accuracy performancewell as other competing systems from Joty el al., Feng et al. and Ji et al. [19,11, 18]. The feature set and hyper-parameter setting we use for our system are nottuned and as mentioned before: for intra-sentential parsing, we use feature set {textorganization, dominance set, POS, context, head and tail ngrams, unigrams of thewhole unit} and hyper-parameter {L1 = 10,L2 = 0.1} for structure prediction, andfeature set {text organization, dominance set, POS, head and tail ngrams, unigrams(selected bigrams) of the whole unit} and hyper-parameter {L1 = 0,L2 = 0.1} forrelation labeling; for multi-sentential parsing, we use feature set {text organization,POS, lexical chain, context, head and tail ngrams} and hyper-parameter {L1 =10,L2 = 0.1} for structure prediction, and feature set {text organization, head and594.3. Experiments and Results Analysistail ngrams} and hyper-parameter {L1= 10,L2= 0.1} for relation labeling. As canbe seen in the table: for intra-sentential parsing, both our system and Joty’s achieveclose-to-human-agreement performance. For the complete parsing (both intra andmulti-sentential), our system achieves the best F-scores in structure and nuclearity,and they are very close to human agreement (97.2% and 92.9%). However, oursystem achieves lower F-score in relation labeling than the best result from Ji’ssystem, and the potential reason is that their mechanism of using unigrams as wellas the representation learning techniques improves the quality of BOW featuresfor relation labeling. In summary, our system achieves the best combinations ofF-scores (structure, nuclearity and relation) compared to other systems.Finally let us present some efficiency results on our system as well as com-parison between different discourse parsing systems. The performance of our sys-tem is based on our experiments on a desktop with CPU:Intel(R) Core(TM)2 DuoCPU E8135 @ 2.66GHz and 4GB RAM. Table 4.6 shows the running time per- Train Performance  (7300 sentences, 347 documents) Test Performance on average  (per sentence/document)  Structure Prediction Relation labeling Structure Prediction  Relation labeling Intra or multi-sentential Intra Multi Intra Multi Intra Multi Intra Multi Time Cost (Second) 450 80 200 120 0.007 50 0.002 0.04  Table 4.6: Running time performance for our systemformance for our system for different components on both training and testing. Ascan be seen in the table, it takes much longer time (450 seconds) to train the intra-sentential structure model than other models. The intuition behind this is that weare using linear chain CRF for intra-sentential structure prediction which is princi-pally slow as explained before in section 2.2, while we are using the simpler log-linear models for multi-sentential structure prediction and intra/multi-sentential re-lation labeling. However, for testing, it takes much longer time on average (50seconds per document) in multi-sentential structure prediction. The reason for thisis that we need to infer probabilities for O(n3) pairs of units in structure predictionand n ranges from 1 to 1000 (or even more) in multi-sentential level compared to 1to 10 in intra-sentential level; and it takes only linear time for intra/multi-sententialrelation labeling. Then, Table 4.7 shows the comparison of running time perfor-mance on our own system and some other state-of-the-art systems. As can be seenin the table, our system is 40 times faster in terms of training and 20 times fasterin terms of testing (for the longest document with 180 sentences in RST-DT) thanJoty’s. Thus, we can safely conclude that our separation of structure prediction and604.3. Experiments and Results AnalysisSystem and Method Train performance (7300 sentences, 347 documents) Test performance for the longest document (180 sentences) in RST-DT Intra-sentential Multi-sentential Our system 650s 200s 80s Joty et al. 1 Sentence 1 Sub-tree 26000s 7000s 1900s Sliding window N/A N/A Feng et al. Greedy linear-chain CRF N/A   41s Greedy linear-chain CRF with post editing 85s Ji et al. N/A N/A  * N/A means we do not have access to the running time.Table 4.7: Comparison of running time performance on different systemsrelation labeling results in much higher efficiency and lower memory consumptionin both training and testing, as well as better accuracy for both structure and rela-tion label in the standard test data, as can be seen in Table 4.5 and 4.7. Moreover,our system has comparable efficiency in terms of testing (for the longest documentwith 180 sentences in RST-DT) compared to Feng’s system which requires handengineering rules and constraints. In summary, our system achieves best (if not,close to best) performance in every dimension (e.g., efficiency, accuracy, training,test, structure, relation) without tuning.61Chapter 5Conclusion and Future WorkIn this thesis, we summarize several discourse parsing research studies, and analyzethem from various angles to provide a better understanding of the discourse parsingproblem. Based on our analysis and reasoning, we propose our novel two-stepdiscourse parsing system, which first builds a discourse tree for a given text byapplying optimal probabilistic parsing to probabilities inferred from learned CRFs,then uses learned log-linear models to tag all discourse relations to the nodes in thediscourse tree.We carry out extensive experimental studies on different aspects of our systemand find out that: for structure prediction, the most effective features are POSand context features for intra and multi-sentential levels respectively, while forrelation labeling, the most effective features are head and tail ngram features forboth intra and multi-sentential levels. Moreover, we find out that over-fitting doesoccur in both intra and multi-sentential parsing and we use proper regularizationin our models, resulting in better accuracy and smaller models. Not surprisingly,more frequent relations have higher accuracy and accuracy decreases as level goeshigher in the DT. Finally, empirical evaluation shows that our system achievesstate-of-the-art performance in terms of both accuracy and efficiency compared toother systems, and it is very close (above 90%) to the human agreement.However, there are some open issues and challenges on discourse parsing,which we have not fully explored and solved. One issue is that although our sys-tem is rather efficient, it may not be fast enough if we want to do textual analysisonline (as opposed to offline). The fundamental way to lower the time complexityof discourse parsing is to use greedy parsing instead of optimal parsing. Thus, infuture work, we may want to study how we can achieve the same accuracy withgreedy parsing by using more advanced techniques and strategies.Another issue is the measure of similarity between rhetorical relations. Forexample, ELABORATION and BACKGROUND have high similarity since a pieceof text T1 which is a background of another piece of text T2 also provides someextent of elaboration on T2. Image a scenario where:• T1 is a background of another piece of text T2• T1 provides some extent of elaboration on T2.62Chapter 5. Conclusion and Future Work• Human annotator relates T1 and T2 with BACKGROUND• DPS1 (discourse parsing system #1) relates T1 and T2 with BACKGROUND• DPS2 relates T1 and T2 with ELABORATION• DPS3 relates T1 and T2 with LISTwe can say DPS1 is correct and DPS3 is incorrect. But we can not say DPS2 andDPS3 make equally bad choice on the rhetorical relation between T1 and T2. Infact, we can not even say DPS2 is incorrect since there is in fact some elaborationon T2 from T1. Thus, in this kind of scenarios, we do need a measure of similaritybetween rhetorical relations to determine whether a discourse parsing system reallymake an incorrect choice on rhetorical relations. However, this problem is beyondthe scope of this thesis.From the machine learning prospective, a big challenge of discourse parsing isthe imbalance distribution of rhetorical relations together with the data scarcity forinfrequent rhetorical relations (not enough samples). As can be seen in section 4.1,the frequency of relations varies from 10 to 7000. This poses a big challenge forthe discourse parsing system to find correct rhetorical relation labels for sampleswith infrequent rhetorical relations. Thus, in future work, we may use samplingtechniques like down-sampling and up-sampling to handle the imbalance issue.We may also use the information from other rhetorical relation data set like thePenn Discourse Treebank (PDTB) [36] to enhance our relation labeling compo-nent. More specifically, we can use the information in a semi-supervised mannerto enhance the quality of our feature representation, or we can use the informationin a supervised manner to increase the training sample size.Finally, we wish that our analysis on discourse parsing is helpful for not onlyresearchers who need to design their own discourse parsing systems, but also re-searchers who study other NLP problems to achieve better performance on theirsystems. Thus, we will make our system demo and source code publicly avail-able13.13https://sites.google.com/site/liaoweicong/projects63Bibliography[1] Nicholas Asher and Alex Lascarides. Logics of conversation. CambridgeUniversity Press, 2003.[2] Steven Bird, Ewan Klein, and Edward Loper. Natural language processingwith Python. " O’Reilly Media, Inc.", 2009.[3] Daniel Marcu Carlson, Lynn and Mary Ellen Okurowski. Rst discourse tree-bank. LDC2002T07. Philadelphia: Linguistic Data Consortium, 2002.[4] Lynn Carlson, Mary Ellen Okurowski, Daniel Marcu, Linguistic Data Con-sortium, et al. RST discourse treebank. Linguistic Data Consortium, Univer-sity of Pennsylvania, 2002.[5] Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and max-ent discriminative reranking. In Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics, pages 173–180. Association forComputational Linguistics, 2005.[6] Michael Collins. Log-linear models. Columbia University Naturual Lan-guage Processing Tutorial, 2013.[7] Ronan Collobert and Jason Weston. A unified architecture for natural lan-guage processing: Deep neural networks with multitask learning. In Pro-ceedings of the 25th international conference on Machine learning, pages160–167. ACM, 2008.[8] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, KorayKavukcuoglu, and Pavel Kuksa. Natural language processing (almost) fromscratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.[9] Laurence Danlos. D-stag: a formalism for discourse analysis based on sdrtand using synchronous tag. In Proceedings of the 14th Conference on FormalGrammar (FG’09), 2009.64Bibliography[10] Vanessa Wei Feng and Graeme Hirst. Text-level discourse parsing with richlinguistic features. In Proceedings of the 50th Annual Meeting of the Asso-ciation for Computational Linguistics: Long Papers-Volume 1, pages 60–68.Association for Computational Linguistics, 2012.[11] Vanessa Wei Feng and Graeme Hirst. A linear-time bottom-up discourseparser with constraints and post-editing. ACL (1), 2013.[12] Michel Galley and Kathleen McKeown. Improving word sense disambigua-tion in lexical chaining. In IJCAI, volume 3, pages 1486–1488, 2003.[13] Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond T Ng, and BitaNejat. Abstractive summarization of product reviews using discourse struc-ture.[14] Francisco Guzmán, Shafiq Joty, Lluís Màrquez, and Preslav Nakov. Usingdiscourse structure improves machine translation evaluation. In Proceedingsof the 52nd Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), pages 687–698, Baltimore, Maryland, June 2014.Association for Computational Linguistics.[15] Hugo Hernault, Danushka Bollegala, and Mitsuru Ishizuka. A semi-supervised approach to improve classification of infrequent discourse rela-tions using feature vector extension. In Proceedings of the 2010 Conferenceon Empirical Methods in Natural Language Processing, pages 399–409. As-sociation for Computational Linguistics, 2010.[16] Hugo Hernault, Helmut Prendinger, Mitsuru Ishizuka, et al. Hilda: a dis-course parser using support vector machine classification. Dialogue & Dis-course, 1(3), 2010.[17] Jerry R Hobbs. Coherence and coreference*. Cognitive science, 3(1):67–90,1979.[18] Yangfeng Ji and Jacob Eisenstein. Representation learning for text-level dis-course parsing. In Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), pages 13–24. Asso-ciation for Computational Linguistics, 2014.[19] Shafiq Joty, Giuseppe Carenini, and Raymond T Ng. Codra: A novel dis-criminative framework for rhetorical analysis.65Bibliography[20] Shafiq Joty, Giuseppe Carenini, and Raymond T Ng. A novel discriminativeframework for sentence-level discourse analysis. In Proceedings of the 2012Joint Conference on Empirical Methods in Natural Language Processing andComputational Natural Language Learning, pages 904–915. Association forComputational Linguistics, 2012.[21] Shafiq R Joty, Giuseppe Carenini, Raymond T Ng, and Yashar Mehdad. Com-bining intra-and multi-sentential rhetorical parsing for document-level dis-course analysis. In ACL (1), pages 486–496, 2013.[22] Dan Jurafsky and James H Martin. Speech & language processing, chapterchapter 14. Prentice Hall, 2008.[23] Alistair Knott and Robert Dale. Using linguistic phenomena to motivate a setof coherence relations. Discourse processes, 18(1):35–62, 1994.[24] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditionalrandom fields: Probabilistic models for segmenting and labeling sequencedata. 2001.[25] Angeliki Lazaridou, Ivan Titov, and Caroline Sporleder. A bayesian modelfor joint unsupervised induction of sentiment, aspect and discourse represen-tations. In ACL (1), pages 1630–1639, 2013.[26] Annie Louis, Aravind Joshi, and Ani Nenkova. Discourse indicators for con-tent selection in summarization. In Proceedings of the 11th Annual Meetingof the Special Interest Group on Discourse and Dialogue, pages 147–156.Association for Computational Linguistics, 2010.[27] Daniel Marcu. A decision-based approach to rhetorical parsing. In Pro-ceedings of the 37th annual meeting of the Association for ComputationalLinguistics on Computational Linguistics, pages 365–372. Association forComputational Linguistics, 1999.[28] Daniel Marcu. The theory and practice of discourse parsing and summariza-tion. MIT press, 2000.[29] Daniel Marcu and Abdessamad Echihabi. An unsupervised approach to rec-ognizing discourse relations. In Proceedings of the 40th Annual Meeting onAssociation for Computational Linguistics, pages 368–375. Association forComputational Linguistics, 2002.66Bibliography[30] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre,Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penntreebank: annotating predicate argument structure. In Proceedings of theworkshop on Human Language Technology, pages 114–119. Association forComputational Linguistics, 1994.[31] James R Martin. English text: System and structure. John Benjamins Pub-lishing, 1992.[32] Mstislav Maslennikov and Tat-Seng Chua. A multi-resolution frame-work for information extraction from free text. In ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page592. Citeseer, 2007.[33] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima-tion of word representations in vector space. arXiv preprint arXiv:1301.3781,2013.[34] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composi-tion. In ACL, pages 236–244, 2008.[35] Naoaki Okazaki. Crfsuite: a fast implementation of conditional random fields(crfs), 2007.[36] Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo,Aravind K Joshi, and Bonnie L Webber. The penn discourse treebank 2.0. InLREC. Citeseer, 2008.[37] Rashmi Prasad, Aravind Joshi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki,and Bonnie Webber. The penn discourse treebank as a resource for naturallanguage generation. In Proc. of the Corpus Linguistics Workshop on UsingCorpora for Natural Language Generation, pages 25–32, 2005.[38] Kenji Sagae. Analysis of discourse structure with syntactic dependencies anddata-driven shift-reduce parsing. In Proceedings of the 11th InternationalConference on Parsing Technologies, pages 81–84. Association for Compu-tational Linguistics, 2009.[39] Swapna Somasundaran. Discourse-level relations for Opinion Analysis. PhDthesis, University of Pittsburgh, 2010.[40] Radu Soricut and Daniel Marcu. Sentence level discourse parsing using syn-tactic and lexical information. In Proceedings of the 2003 Conference of the67BibliographyNorth American Chapter of the Association for Computational Linguisticson Human Language Technology-Volume 1, pages 149–156. Association forComputational Linguistics, 2003.[41] Caroline Sporleder and Mirella Lapata. Discourse chunking and its applica-tion to sentence compression. In Proceedings of the conference on HumanLanguage Technology and Empirical Methods in Natural Language Process-ing, pages 257–264. Association for Computational Linguistics, 2005.[42] Manfred Stede. Discourse processing. Synthesis Lectures on Human Lan-guage Technologies, 4(3):1–165, 2011.[43] Rajen Subba and Barbara Di Eugenio. An effective discourse parser that usesrich linguistic information. In Proceedings of Human Language Technolo-gies: The 2009 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 566–574. Association forComputational Linguistics, 2009.[44] Charles Sutton and Andrew McCallum. An introduction to conditional ran-dom fields. arXiv preprint arXiv:1011.4088, 2010.[45] Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. Dy-namic conditional random fields: Factorized probabilistic models for labelingand segmenting sequence data. The Journal of Machine Learning Research,8:693–723, 2007.[46] Simone Teufel and Marc Moens. Summarizing scientific articles: ex-periments with relevance and rhetorical status. Computational linguistics,28(4):409–445, 2002.[47] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: asimple and general method for semi-supervised learning. In Proceedingsof the 48th annual meeting of the association for computational linguistics,pages 384–394. Association for Computational Linguistics, 2010.[48] Suzan Verberne, Lou Boves, Nelleke Oostdijk, and Peter-Arno Coppen. Eval-uating discourse-based answer extraction for why-question answering. InProceedings of the 30th annual international ACM SIGIR conference onResearch and development in information retrieval, pages 735–736. ACM,2007.[49] Andrew J Viterbi. Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm. Information Theory, IEEE Transactions on,13(2):260–269, 1967.68[50] Bonnie Webber. D-ltag: Extending lexicalized tag to discourse. CognitiveScience, 28(5):751–779, 2004.[51] Mann William and Sandra Thompson. Rhetorical structure theory: Towardsa functional theory of text organization. Text, 8(3):243–281, 1988.69Appendix AComplete Discourse RelationsRelationNameConstraints on either S or NindividuallyConstraints on N + S Intention of WCircumstance on S: S is not unrealized S sets a framework in the subjectmatter within which R is intended tointerpret NR recognizes that S provides theframework for interpreting NCondition on S: S presents a hypothetical, future,or otherwise unrealized situation(relative to the situational context of S)Realization of N depends onrealization of SR recognizes how the realization of Ndepends on the realization of SElaboration none S presents additional detail about thesituation or some element of subjectmatter which is presented in N orinferentially accessible in N.R recognizes S as providing additionaldetail for N. R identifies the element ofsubject matter for which detail isprovided.Evaluation none on N + S: S relates N to degree of W’spositive regard toward N.R recognizes that S assesses N andrecognizes the value it assignsInterpretation none on N + S: S relates N to a frameworkof ideas not involved in N itself andnot concerned with W’s positive regardR recognizes that S relates N to aframework of ideas not involved in theknowledge presented in N itselfMeans on N: an activity S presents a method or instrumentwhich tends to make realization of Nmore likelyR recognizes that the method orinstrument in S tends to makerealization of N more likelyNon-volitionalCauseon N: N is not a volitional action S, by means other than motivating avolitional action, caused N; withoutthe presentation of S, R might notknow the particular cause of thesituation; a presentation of N is morecentral than S to W’s purposes inputting forth the N-S combination.R recognizes S as a cause of NTable A.1: Definitions of Subject Matter Relations: Part I70Appendix A. Complete Discourse RelationsRelationNameConstraints on either S or NindividuallyConstraints on N + S Intention of WNon-volitionalResulton S: S is not a volitional action N caused S; presentation of N is morecentral to W’s purposes in puttingforth the N-S combination than is thepresentation of S.R recognizes that N could have causedthe situation in SOtherwise on N: N is an unrealized situationon S: S is an unrealized situationrealization of N prevents realization ofSR recognizes the dependency relationof prevention between the realizationof N and the realization of SPurpose on N: N is an activityon S: S is a situation that is unrealizedS is to be realized through the activityin NR recognizes that the activity in N isinitiated in order to realize SSolutionhood on S: S presents a problem N is a solution to the problempresented in S;R recognizes N as a solution to theproblem presented in SUnconditional on S: S conceivably could affect therealization of NN does not depend on S R recognizes that N does not dependon SUnless none S affects the realization of N; N isrealized provided that S is not realizedR recognizes that N is realizedprovided that S is not realizedVolitionalCauseon N: N is a volitional action or else asituation that could have arisen from avolitional actionS could have caused the agent of thevolitional action in N to perform thataction; without the presentation of S, Rmight not regard the action asmotivated or know the particularmotivation; N is more central to W’spurposes in putting forth the N-Scombination than S is.R recognizes S as a cause for thevolitional action in NVolitionalResulton S: S is a volitional action or asituation that could have arisen from avolitional actionN could have caused S; presentation ofN is more central to W’s purposes thanis presentation of S;R recognizes that N could be a causefor the action or situation in STable A.2: Definitions of Subject Matter Relations: Part II71Appendix A. Complete Discourse RelationsRelationNameConstraints on either S or NindividuallyConstraints on N + S Intention of WAntithesis on N: W has positive regard for N N and S are in contrast; because of theincompatibility that arises from thecontrast, one cannot have positiveregard for both of those situations;comprehending S and theincompatibility between the situationsincreases R’s positive regard for NR’s positive regard for N is increasedBackground on N: R won’t comprehend Nsufficiently before reading text of SS increases the ability of R tocomprehend an element in NR’s ability to comprehend N increasesConcession on N: W has positive regard for Non S:W is not claiming that S does not hold;W acknowledges a potential orapparent incompatibility between Nand S; recognizing the compatibilitybetween N and S increases R’s positiveregard for NR’s positive regard for N is increasedEnablement on N: presents an action by R(including accepting an offer),unrealized with respect to the contextof NR comprehending S increases R’spotential ability to perform the actionin NR’s potential ability to perform theaction in N increasesEvidence on N: R might not believe N to adegree satisfactory to Won S: Rbelieves S or will find it credibleR’s comprehending S increases R’sbelief of NR’s belief of N is increasedJustify none R’s comprehending S increases R’sreadiness to accept W’s right topresent NR’s readiness to accept W’s right topresent N is increasedMotivation on N: N is an action in which R is theactor (including accepting an offer),unrealized with respect to the contextof NComprehending S increases R’s desireto perform action in NR’s desire to perform action in N isincreasedPreparation none S precedes N in the text; S tends tomake R more ready, interested ororiented for reading NR is more ready, interested or orientedfor reading NRestatement none on N + S: S restates N, where S and Nare of comparable bulk; N is morecentral to W’s purposes than S is.R recognizes S as a restatement of NSummary on N: N must be more than one unit S presents a restatement of the contentof N, that is shorter in bulkR recognizes S as a shorter restatementof NTable A.3: Definitions of Presentational Relations72Appendix A. Complete Discourse RelationsRelation Name Constraints on each pair of N Intention of WConjunction The items are conjoined to form a unit in which eachitem plays a comparable roleR recognizes that the linked items areconjoinedContrast No more than two nuclei; the situations in these twonuclei are (a) comprehended as the same in manyrespects (b) comprehended as differing in a fewrespects and (c) compared with respect to one ormore of these differencesR recognizes the comparability and thedifference(s) yielded by the comparison isbeing madeDisjunction An item presents a (not necessarily exclusive)alternative for the other(s)R recognizes that the linked items arealternativesJoint None noneList An item comparable to others linked to it by the ListrelationR recognizes the comparability of linked itemsMultinuclearRestatementAn item is primarily a reexpression of one linked toit; the items are of comparable importance to thepurposes of WR recognizes the reexpression by the linkeditemsSequence There is a succession relationship between thesituations in the nucleiR recognizes the succession relationshipsamong the nuclei.Table A.4: Definitions of Multinuclear Relations73

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167698/manifest

Comment

Related Items