Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improve classification on infrequent discourse relations via training data enrichment Jiang, Kailang 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_february_jiang_kailang.pdf [ 961.8kB ]
Metadata
JSON: 24-1.0340024.json
JSON-LD: 24-1.0340024-ld.json
RDF/XML (Pretty): 24-1.0340024-rdf.xml
RDF/JSON: 24-1.0340024-rdf.json
Turtle: 24-1.0340024-turtle.txt
N-Triples: 24-1.0340024-rdf-ntriples.txt
Original Record: 24-1.0340024-source.json
Full Text
24-1.0340024-fulltext.txt
Citation
24-1.0340024.ris

Full Text

Improve Classification on Infrequent Discourse Relationsvia Training Data EnrichmentbyKailang JiangB. Eng., Shanghai Jiao Tong University, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)November 2016c© Kailang Jiang, 2016AbstractDiscourse parsing is a popular technique widely used in text understanding, sen-timent analysis, and other NLP tasks. However, for most discourse parsers, theperformance varies significantly across different discourse relations. In this the-sis, we first validate the underfitting hypothesis, i.e., the less frequent a relationis in the training data, the poorer the performance on that relation. We then ex-plore how to increase the number of positive training instances, without resortingto manually creating additional labeled data. We propose a training data enrich-ment framework that relies on co-training of two different discourse parsers onunlabeled documents. Importantly, we show that co-training alone is not sufficient.The framework requires a filtering step to ensure that only “good quality” unlabeleddocuments can be used for enrichment and re-training. We propose and evaluatetwo ways to perform the filtering. The first is to use an agreement score betweenthe two parsers. The second is to use only the confidence score of the faster parser.Our empirical results show that agreement score can help to boost the performanceon infrequent relations, and that the confidence score is a viable approximation ofthe agreement score for infrequent relations.iiPrefaceThis dissertation is an original intellectual product of the author, Kailang Jiang.The author conducted all the experiments and wrote the manuscript, under thesupervision of Dr. Giuseppe Carenini and Dr. Raymond Ng.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Discourse Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Approach and Contributions . . . . . . . . . . . . . . . . . . . . 51.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Existing Discourse Parsers . . . . . . . . . . . . . . . . . . . . . 92.2 Training Data Expansion in Discourse Parsing . . . . . . . . . . . 122.2.1 Training Data Expansion for Implicit Relations . . . . . . 122.2.2 Training Data Expansion for Infrequent Relations . . . . . 132.3 Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13iv3 Enrichment Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Selection of Discourse Parsers . . . . . . . . . . . . . . . . . . . 153.3 Enrichment Process . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Filtering at Finer Granularity . . . . . . . . . . . . . . . . . . . . 184 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 The Underfitting Hypothesis: Performance vs Frequency . . . . . 224.3 Effect of Enrichment on Infrequent Relations . . . . . . . . . . . 234.4 The Impact of the Filtering Threshold . . . . . . . . . . . . . . . 264.5 Using the Confidence Score to Approximate the Agreement Score 284.6 Adding Enriched Training Instances in an Iterative Manner . . . . 314.7 Filtering at a Finer Granularity . . . . . . . . . . . . . . . . . . . 325 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35vList of TablesTable 1.1 18 Discourse Relations in RST-DT Dataset . . . . . . . . . . . 3Table 4.1 Relative F-scores Improvements (%) on the Top-8 InfrequentRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Table 4.2 Relative F-scores Improvements (%) on the Top-8 InfrequentRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Table 4.3 Relative F-scores Improvements (%) at Different Filtering Gran-ularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33viList of FiguresFigure 1.1 Discourse Tree of the Example Sentence . . . . . . . . . . . . 4Figure 3.1 Workflow of Our Enrichment Approach . . . . . . . . . . . . 16Figure 3.2 Filter at Different Granularity . . . . . . . . . . . . . . . . . 19Figure 4.1 Distribution of the Most Frequent and the Least Frequent 5Relations in RST-DT . . . . . . . . . . . . . . . . . . . . . . 22Figure 4.2 Performance versus Frequency for Each Relation . . . . . . . 23Figure 4.3 Relative F-score Improvements on Different Relations . . . . 24Figure 4.4 Actual Number of Training Instances Enriched (%) . . . . . . 25Figure 4.5 Changes in Relative F-score with Varying Filtering AgreementScore Threshold . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 4.6 The Impact of More Unlabeled Resources . . . . . . . . . . . 28Figure 4.7 Agreement Score vs Confidence Score . . . . . . . . . . . . . 29Figure 4.8 Overall F-score Improvements with Different Enriched DataQuality via Confidence Score . . . . . . . . . . . . . . . . . 30viiAcknowledgmentsI would like to offer my wholehearted gratitude to everyone who has inspired orsupported my work during my master study.Special thanks to my supervisors Dr. Giuseppe Carenini, and Dr. RaymondNg! Thanks you for teaching me how to do research and how to write a paper,providing me so many good ideas and suggestions during every meeting, giving mevery detailed feedback on everything, and always being so patient and encouraging.Thank all my fellow students and friends in UBC for providing me your usefulexperiences and suggestions on study, research and life, and for all your concern,company and support.Particular thanks to my good friends and my parents that are far away. Yourunconditional love and support are the reason why I want to become a better person.viiiChapter 1Introduction“Clauses and sentences rarely stand on their own in an actual discourse; rather,the relationship between them carries important information that allows the dis-course to express a meaning as a whole beyond the sum of its individual parts.Discourse analysis seeks to uncover this coherence structure.” (Joty, et al., 2015)[16]Research and application of natural language processing (NLP) has been grow-ing rapidly in the past decade, and the value of discourse structure and relation inNLP is getting more and more attention. Discourse parsing, which discovers howsentences and clauses are connected together, is now widely used in many NLPtasks, including text understanding [2], machine translation evaluation [11], senti-ment analysis [3], text summarization [10], etc.Studies in the past decade on discourse parsing, such as [3], [16], have greatlyimproved the performance of discourse parsing in general. However, it has beenobserved that the performance across the discourse relations varies significantly[3], and that poor performance may be linked to underfitting, i.e., a lack of trainingdata [16]. In this thesis, we investigate the underfitting hypothesis and study howto improve the situation.11.1 Discourse ParsingMost text analysis tasks focus on only properties of each single sentence or clause.However, those sentences and clauses are not arbitrarily put together, the way theyare constructed and related in fact carries a lot of important information that couldnot be discovered from their segregated parts. And capturing the relations betweensentences and clauses can help us better understand the entire text.Consider the following two examples:• “While the pound has attempted to stabilize, currency analysts say it is incritical condition.”• “The pound has attempted to stabilize, currency analysts say it is in criticalcondition.”The first sentence is extracted from an Wall Street Journal article in our unla-beled dataset that will be further introduced in Section 4.1. And the second sen-tence just takes away the first conjunction “While” from the first sentence, whichdoes not impair the understanding of the first clause. Most readers will find thefirst sentence easy to understand, while the second sentence confusing, since therelation between pound’s stabilization and critical condition is not clear. Normallyauthors always try to construct their text in a coherent and logical way so that it iseasy to interpret and understand. So uncovering the the coherence structure under-neath the text is very helpful for understanding it, and is the foundation of discourseanalysis.A multi-sentential discourse parser takes a document as input, and returns itsdiscourse structure that shows how clauses and sentences are related in the doc-ument, via the use of various discourse relations. For instance, the very popularcorpus - Rhetorical Structure Theory Discourse Treebank (RST-DT) [5] groups dif-ferent types of discourse relations between sentences and clauses into 18 classes,as listed in Table 1.1.As an illustration, Figure 1.1 shows the discourse structure of the first sentence“While the pound has attempted to stabilize, currency analysts say it is in criti-cal condition.” produced by CODRA[16] - one of the state-of-the-art discourseparsers used in this thesis. As shown in the figure, this discourse tree has three2Elaboration Joint AttributionSame-Unit Contrast ExplanationBackground Cause TemporalEnablement Comparison EvaluationTopic-Comment Condition TextualOrganizationTopic-Change Manner-Means SummaryTable 1.1: 18 Discourse Relations in RST-DT Datasetleaves that correspond to contiguous atomic text spans, called elementary discourseunits (EDUs). EDUs are clause-like units that serve as building blocks [38]. Ad-jacent EDUs are then related by coherence relations (e.g., Attribution, Contrast),thereby forming larger units (represented by internal nodes), which in turn are alsolinked by coherence relations. Discourse units linked by a relation are further dis-tinguished based on their relative importance in the text: the nucleus being thecentral part, whereas satellites are peripheral ones. For example, in Figure 1.1,“Attribution” is a relation between a nucleus (EDU 3) and a satellite (EDU 2), and“Contrast” is a relation between a nucleus (EDU [2, 3]) and a satellite (EDU 1).From this discourse tree we can see clearly that the clause “While the pound has at-tempted to stabilize” contrasts with the clause “currency analysts say it is in criticalcondition”.A better understanding of such relations between clauses and sentences is veryhelpful to many NLP tasks such as text understanding [2], machine translationevaluation [11], sentiment analysis [9] [3], text summarization [10], etc. For in-stance, when we try to analyze the sentiment of the sentence mentioned above -“While the pound has attempted to stabilize, currency analysts say it is in criticalcondition.”, if we only look at the properties of each word without considering thediscourse structure, we might regard the word “stabilize” to be positive and theword “critical” to be negative, and when we sum it up, it will be hard to tell whattype of sentiment this sentence is carrying in total. But if we incorporate its dis-course structure and relations shown in Figure 1.1, we will find that the two clausescontrast with each other and the second half is the nucleus. So the negative word inthe second clause should be given a higher weight when calculating the sentiment3of the entire sentences, and get a more accurate result.[ currency analysts say ] 2 [ it is in critical condition . ] 3  [ While the pound has attempted to stabilize , ] 1  ATTRIBUTION CONTRAST [2, 3] 2 3 1 Figure 1.1: Discourse Tree of the Example Sentence1.2 MotivationSince a better understanding of discourse structures and relations can bring greatbenefits to so many NLP tasks, a lot of work has been done recently to improvethe performance of discourse parsing [15] [8] [16]. However, all these works suf-fer from the same problem of lacking training data for certain discourse relations,which prevents them from achieving a better result on those relations.It has been observed that the performance across the discourse relations variessignificantly[3]. Meanwhile, different discourse relations are usually unevenly dis-tributed in a dataset, and some of them occur much less frequently than other rela-tions. We call the former the infrequent relations. For example, in the RST-DT cor-pus [5] which contains 385 documents, the frequency of “Elaboration” is 31.04%,while the frequency of “Summary” is only 0.88%.In another benchmark corpus the Penn Discourse Treebank (PDTB) 2.0 [25],which contains about 2400 documents with discourse relations labeled for eachpair of adjacent sentences, the relation “Conjunction” occurs 8759 times throughthe entire corpus, while the relations “Exception” and “Pragmatic concession” onlyappear 17 and 12 times respectively [12].It has been noticed that poor performance of one discourse relation may belinked to underfitting, i.e., a lack of training data [16]. And given that the perfor-4mance of most discourse parsers depends on the availability of training data, thekey question here is whether underfitting affects the infrequent relations more thanthe frequent ones. In Section 4.2, we will explicitly show that parsing performanceof relations is correlated with the frequency of the relations.Clearly, every discourse relation, infrequent or not, would benefit from theavailability of more high-quality training data. However, creating such high-qualitylabeled data takes much time and effort to manually annotate documents with theirdiscourse structures and relations. The question here is whether the infrequent re-lations are worthy of the extra effort required. It turns out that many infrequentrelations actually play important roles in various NLP tasks. For example, the“Comparison” relation from RST-DT is known to indicate disagreement in a con-versation [14] [2]. Moreover, the “Instantiation” relation from PDTB is regarded asan important feature for sentence specificity prediction [19]. Since these infrequentrelations are very important to many NLP tasks as mentioned above, it is clearlyworth extra efforts to acquire more training instances for infrequent relations.1.3 Approach and ContributionsThe main objective of this thesis is to explore how to mitigate the underfittingproblem for infrequent relations - without manually creating labeled data for thoserelations. In particular, we aim to exploit the availability of a much larger amount ofunlabeled data, with the help of a small amount of existing labeled data. That is, weneed to adopt a semi-supervised learning algorithm and apply it on our discourseparsing problem here.So the first step of our approach is to apply existing discourse parsers to theunlabeled data to generate more instances of infrequent relations, which are thenused to re-train the existing parsers. Such co-training approaches have proved tobe effective in solving similar problems in natural language processing [19] andinformation retrieval [4].There is, however, a fatal flaw relying on co-training alone. If existing dis-course parsers are poor in determining infrequent relations, the extra (re-)traininginstances of infrequent relations created from unlabeled data may not be of highquality. Indeed, adding poor quality re-training instances would exacerbate the un-5derfitting problem of infrequent relations. The second step of our approach is toapply a filtering step to the instances created from unlabeled data. The intentionbehind the filtering step is to enrich the re-training - that is, to select only the “highquality” instances to be used for re-training.The workflow of our enrichment approach will be further described in Section3.1, when it is applied to two discourse parsers, P1 and P2. The two parsers areinitially trained on labelled data and then are applied to unlabelled data to generatenew high-quality Training Examples for further re-training.Our experiments have shown that the performance one a discourse relation isreleted to its frequency in the dataset, and our empirical results show that agreementscore filtering can boost the performance of infrequent relations considerably, andthe confidence score of the SR-parser can also be used as a fast approximation ofthe agreement score. So far our results show that our data enrichment frameworkis not effective for frequent relations, but filtering on the node level with differentthreshold for infrequent and frequent relations has shown to be helpful to improvethis situation. We believe that with more unlabeled documents and a more precisesliding threshold for every relation, the performance can continue to improve.The specific contributions of our thesis are as follow:• We explore one form of enrichment based on the notion of agreement scorebetween two discourse parsers. Inspired by the theory on the success ofensembling for general classification [6], we choose two very different dis-course parsers, namely the CKY-like CODRA parser by [16] and the Shift-Reduce (SR) parser by [15]. While Section 3.2 will give more details onwhy these two parsers are chosen, the key is that the parsers are based onvery different algorithms and feature sets for discourse parsing. Our agree-ment score is based on the F-score measure for comparing discourse treesas proposed in [22]. Only the discourse relation instances in discourse treeswith high-enough agreement scores pass through the filter for re-trainingpurposes. Chapter 4 will show that such enrichment with agreement scoreimproves the performance of infrequent relations.• We explore another form of enrichment based on just the confidence scoreof the SR-parser. The rationale is that while the CODRA parser is generally6more accurate than the SR-parser, the SR-parser is two orders of magnitudefaster. If a high-enough threshold on the confidence score of the SR-parseris used for enrichment, Chapter 4 will investigate whether the confidencescore is a good approximation of the agreement score. If this approach issuccessful, an even larger number of unlabeled documents can be parsedrapidly to be used for re-training.The advantages of our approach include:• It is effective to boost the performance of infrequent relations.• It makes good use of unlabeled documents and does not require any extramanual labeling.• It takes advantage of two very different discourse parsers, and combinesthem together to reach better performance.• The confidence score of the SR-parser is a good approximation of agreementscore for filtering, so a larger number of unlabeled documents can be parsedrapidly to be used for re-training.Disadvantages of our approach can be:• It is not very effective on frequent discourse relations so far, while they con-sist of most of the testing instances, the total performance does not have asignificant boost.• It might require the two discourse parsers used in the co-training algorithmto be very different in order to achieve a good result, thus in the future ifbetter discourse parsers appear we can not just take any two of them to feedinto our framework, and results can be parser specific.1.4 OutlineIn Chapter 2, we will introduce existing discourse parsers, both those earlier dis-course parsers and state-of-the-art discourse parsers, including the two parsers usedin our framework. We will also introduce how other researchers have tried to enrich7training examples for discourse parsing, and provide more background informationand applications of co-training algorithm. In Chapter 3, we will describe the de-tails of our approach, including the workflow and enrichment process, the choicesof the two parsers, filtering at different granularity, and different ways to set thethreshold. Chapter 4 will show various experiments we have performed and theresults and analysis. And in the end, Chapter 5 will summarize the contributionsof this thesis and discuss future work.8Chapter 2Related WorkIn this chapter, we discuss some related work that has inspired our approach, orprovided us the tools and information we needed to conduct our experiments. Sec-tion 2.1 introduces several existing discourse parsers, including the two that will beused in our experiments. Section 2.2 explores how other researchers tried to tacklethe training data sparsity problem in discourse parsing, both for implicit relationclassification and infrequent relation classification. Section 2.3 provides a briefdescription of the Co-training algorithm and its application in natural languageprocessing and related areas.2.1 Existing Discourse ParsersIn the early stage of discourse parsing research, (Marcu, 1999) [21] used machinelearning techniques to build a shift-reduce discourse parser, which relies on de-cision tree classifiers to learn the rules from training data. To learn the shift-reduce actions, the discourse parser encodes five types of features: lexical (e.g,discourse cues), shallow-syntactic, similarity, operational (previous n shift-reduceoperations) and discourse sub-structural features. Though its performance is notcomparable to recent parsers, this work has inspired many recent machine learningapproaches in discourse parsing.In 2003, (Soricut et al., 2003) [27] developed the SPADE system that comeswith probabilistic models for sentence-level discourse parsing. Their segmentation9and parsing models are based on lexicosyntactic patterns (features) extracted fromthe lexicalized syntactic tree of a sentence. The discourse parser uses an optimalparsing algorithm to find the most probable rhetorical tree structure for a sentence.SPADE was trained and tested on the RST-DT corpus. This work, by showingempirically the connection between syntax and discourse at the sentence level, hasgreatly influenced all major contributions in this area ever since. However, it is lim-ited in several ways. First, SPADE does not produce a full-text (document-level)parse. Second, its parsing model makes an independence assumption between thelabel and the structure of a discourse tree constituent, and it ignores the sequentialand the hierarchical dependencies between the constituents. Third, it relies onlyon lexico-syntactic features, and it follows a generative approach to estimate themodel parameters.In 2010, (Hernault et al., 2010) [13] introduced the HILDA system that is basedon Support Vector Machines (SVMs). It feeds the lexical and syntactic featuresused in SPADE plus more context to its segmenter, which is a binary SVM classi-fier. While for the discourse parser, SVM classifiers are applied iteratively, two at atime, one used to decide which adjacent unit to merge, the other used to choose themost reasonable relation label between the selected units. They report improvedperformance in discourse parsing on the RST-DT corpus.On the other hand, (Subba et al., 2009) [29] proposes a shift-reduce parser thatuses Inductive Logic Programming (ILP) to learn first-order logic rules from a largeset of features for relation labeling, including the rich compositional semanticsfrom a semantic parser. This work shows that compositional semantics with otherfeatures are helpful to improve relation classification performance.However, both HILDA and the ILP-based approach mentioned above have sev-eral limitations. First, they do not differentiate between intra-sentential parsing andmulti-sentential parsing, and use a single uniform model in both scenarios. Sec-ond, they take a greedy (sub-optimal) approach to construct a discourse tree. Third,they disregard sequential dependencies between discourse tree constituents, whichhas been recently shown to be critical by [7]. Furthermore, HILDA considers thestructure and the labels of a discourse tree separately.Recent works [16] [15] have overcome these constraints and improved theperformance and efficiency of discourse parser. (Joty et al.) [17] [16] proposed10a Cocke-Kasami-Younger(CKY)-like discourse parser which tries to build a dis-course tree by applying an optimal parsing algorithm to the probabilities of all theconstituents inferred from two conditional random fields (CRFs) jointly: a linearchain dynamic-CRF for intra-sentential parsing, and a uni-variate graphical modelfor multi-sentential parsing. It combines the results returned by the two parsersto build the final discourse tree. A log of features are used to improve the classi-fier, including ngrams, lexical chains, dominance set, contextual and sub-structurefeatures, etc. The application of CRFs to discourse parsing problem has showedto improve the parsing performance at both intra and multi sentential level. How-ever, its inefficiency in terms of both speed and space makes it impractical in largeapplications.Based on Joty’s idea [17], (Feng et al., 2014) [8] proposed a linear time dis-course parser. They made several modifications in order to reduce the complexity:greedy bottom-up parsing procedure that allows linear time parsing; usage of thelinear chain CRF in both intra and multi-sentential parsing; separated modeling ofstructure and relation; novel idea of post-editing which does a second pass pars-ing to incorporate information from upper-level discourse constituents, etc. Theyalso adopted additional features that are not used in [17], which help them achievebetter accuracy.On the other hand, [15] proposed a representation learning approach for dis-course parsing which formalizes discourse tree building process as a sequence ofdecision problems by using a transition-based shift-reduce parser. It jointly learnsa linear transformation from unigrams to lower dimensional latent space represen-tation and a SVM decision classifier in this space to make shift-reduce decisions.We have reproduced the result of this parser on the RST-DT dataset, and the resultshows that it does have a great advantage over [17] and [8] in terms of efficiency,while it is the other way around concerning the performance.Furthermore, these existing parsers all suffer from the same problem of trainingdata sparsity, as it takes too much time and effort to manually annotate documentswith their discourse structures and relations. So in the next section, we will inves-tigate existing works in enriching training data for discourse parsing.112.2 Training Data Expansion in Discourse ParsingThe training data sparsity problem impacts several aspects of discourse parsing. Inthis section, we first introduce the one for parsing implicit relations. Experiences inexpanding training data for implicit relations have inspired us to tackle the secondproblem, training data enrichment for infrequent relations — the key issue in thisthesis.2.2.1 Training Data Expansion for Implicit RelationsA key distinction in discourse parsing is between explicit and implicit relations.The former are signaled by a cue phrase like “because” while the latter are not andconsequentially are more difficult to identify. Several studies have been conductedto tackle the problem of classifying implicit relations which do not have manyexplicit features and examples. (Zhou et al., 2010) [31] presents a method to predictthe missing connective based on a language model trained on an unlabeled corpus.The predicted connective is then used as a feature to classify the implicit relation.(Mckeown et al., 2013) [24] tackles the feature sparsity problem by aggregatingimplicit relations into larger groups. (Lan et al., 2013) [18] combines different datathrough multi-task learning. The method performs implicit and explicit relationclassification in PDTB framework as two tasks and relies on multi-task learning toobtain higher performance.[20] proposes a multi-task neural networks that combines RST-DT, PDTB andunlabeled data together through multi-task learning process, and gets performanceimprovements on implicit relations, though they only apply their scheme on thefour coarse top-level relation types. Their scheme is based on retrieving moretraining instances from unlabeled data through cue phrases. This approach of usingexplicit examples to predict implicit examples has been shown to produce mixedresults [28]. Moreover, [16] has shown that there are many more features beyondcue phrases that are useful for discourse parsing.Though training data expansion for implicit relations are different from that forinfrequent relations, we can still get a lot of insights from it about what may ormay not be effective in producing more useful training data.122.2.2 Training Data Expansion for Infrequent Relations[12] proposes a feature vector extension approach to improve classification of in-frequent discourse relations. The approach is based on word co-occurrence. Theypropose the method that first computes the co-occurrence between features usingunlabeled data and use that information to extend the feature vectors during train-ing and testing, thereby reducing the sparseness in test feature vectors. Partly be-cause a simple discourse parser was used, their approach is shown to produce onlyminimal improvements in performance.Unlike [20] and [12], we aim to exploit more advanced parsers with higher per-formance, and also keep the finer-granularity of the relations, especially focusingon the infrequent relations.2.3 Co-trainingCo-training is a semi-supervised learning technique first introduced by [4], with itsapplication in helping the search engine better classify whether a webpage is an“academic course home page”. It requires two views of the data and assumes thateach example is described using two different feature sets that provide different,complementary information about the instance. Ideally, the two views are condi-tionally independent (i.e., the two feature sets of each instance are conditionallyindependent given the class) and each view is sufficient (i.e., the class of an in-stance can be accurately predicted from each view alone). Co-training first learnsa separate classifier for each view using any labeled examples. The most confidentpredictions of each classifier on the unlabeled data are then used to iteratively con-struct additional labeled training data. After one thousand iterations, the classifierreaches very high accuracy with a very small amount of initial labeled web pagesas training examples.Similar co-training efforts have been found to be effective in many NLP prob-lems when only a small amount of labeled data is available. For example, [30]proposes a co-training approach for cross-lingual sentiment classification, whichleverages an available English corpus for Chinese sentiment classification by usingthe English corpus as training data. Machine translation services are used for elim-inating the language gap between the training set and test set, and English features13and Chinese features are considered as two independent views of the classificationproblem.While [19] applies co-training on predicting sentence specificity. To train theirsemi-supervised model for sentence specificity, they use a repurposed corpus of bi-nary annotations of specific and general sentences drawn from Wall Street Journalarticles originally annotated for discourse analysis, and then make use of unlabeleddata from New York Times and Wall Street Journal articles (no overlap betweenthem and the labeled examples and the testing data) for co-training.However, there’s a fatal flaw relying on co-training alone, as we have previ-ously discussed in Section 1.3. If existing discourse parsers are poor in determininginfrequent relations, the extra (re-)training instances of infrequent relations createdfrom unlabeled data may not be of high quality, and might exacerbate the under-fitting problem of infrequent relations. So in the next section, we will describeour approach that adopts the idea of co-training algorithm with a filtering step, andcombine the advantages of recent discourse parsers to select the “high quality” in-stances for re-training. And this is what we mean by enrichment — different fromsimply expanding the training set with more data, we also control the quality ofnew training instances through filtering out the unconfident ones.14Chapter 3Enrichment Approach3.1 WorkflowThe general workflow of our enrichment approach is shown in Figure 3.1, whenit is applied to two discourse parsers, P1 and P2. First we use the labeled data toprovide initial training of the two parsers. Then each parser is used to produce adiscourse tree for each unlabeled document. After that, we apply a filtering step toselect those “high quality” discourse trees, which are added to the original labeleddata to form the “enriched training data” to re-train the two parsers.3.2 Selection of Discourse ParsersIn our approach, the first parser we pick is the CODRA parser [16], which appliesa CKY parsing algorithm to probabilities inferred from two Conditional RandomFields for both intra-sentential and multi-sentential parsing. We pick the CODRAparser because of its optimal CKY parsing algorithm and its accuracy. The secondparser we pick is the SR-parser [15], which transforms the surface features into alatent space that facilitates RST discourse parsing. The main advantage of the SR-parser is that it can train and parse documents in almost linear time (regarding thedocument length), while the CODRA parser needs cubic time. Our choice of thetwo parsers is partly based on the fact that they rely on very different algorithmsand feature sets, which is desired by the co-training algorithm. Although another15P2	P1	 Unlabelled New training examples D-Tree 1 D-Tree 2 Filter Parse Parse Labelled Ini,al	Training Ini,al	Training Re-train Re-train Figure 3.1: Workflow of Our Enrichment Approachdiscourse parser [8] also delivers state-of-the-art performance, its approach andfeatures are very similar to CODRA’s, so we only wanted to select one of them.And due to the fact that Feng’s parser is not publicly available and our existingexperience on CODRA, we picked CODRA in our approach. Another reason ofour choice on the SR-parser is that discourse parsing of documents in general canbe slow in both training and parsing. Thus, the SR-parser is attractive in allowingus to explore the tradeoffs between accuracy and efficiency.3.3 Enrichment ProcessA co-training algorithm alone is not sufficient for the enrichment process, sinceboth the CODRA parser and the SR-parser perform poorly for infrequent relations.The extra (re-)training instances of infrequent relations created from unlabeled datamay not be of high quality. The key idea is to enrich the re-training by selectingonly the “high quality” instances. In this thesis we investigate two forms of enrich-ment, based on the agreement score between the two parsers, and the confidence16score given by each parser individually.To produce the agreement score between the two parsers, we use both parsersto parse every unlabeled document. Then we treat the parse tree produced by theCODRA parser as the ground truth, and the one produced by the SR-parser astesting, and use the F-score for comparing discourse trees proposed in [22] as theagreement score. Finally, if the agreement score passes a preset threshold, theunlabeled document is regarded as reliable and the discourse tree is added to enrichre-training.The second form of enrichment examined in this thesis is based on using theconfidence score of each parser individually. Instead of using both parsers to parsethe same document and compute the agreement score, we use the confidence scoregiven by only one parser when it produces a discourse tree for one document asthe criteria to filter new discourse trees added to re-train this discourse parser. Theadvantage of using the confidence score produced by only one parser as an approx-imation of the agreement score between the two parsers is to reduce the amount ofdiscourse parsing needed to produce and select new training instances, especiallywhen one of the parser is much faster than the other one.The SR-parser does not provide a confidence score for a discourse tree gener-ated for a document directly. It generates a discourse tree by performing a set ofactions. More specifically, each action creates a node in the tree by combining twotext spans and by selecting a discourse relation for the pair. Since each action ischosen with a certain confidence score (which technically is the distance betweenthe chosen action and the hyperplane, provided by the underlying Linear SVC al-gorithm), we use the average confidence of the actions performed to create the treeas the confidence score of the entire tree. If this approach is successful, an evenlarger number of unlabeled documents can be parsed rapidly for re-training.As for the CODRA parser, it provides confidence score both for a relation labelat one node and for the structure of an entire discourse tree. This give us moreflexibility to choose which score to use and at which level to filter the data. Similarto the SR-parser, we can use the confidence score for the entire tree to filter newtraining instances at the document. Also, if we look at the confidence score for thelabel at one node, we can try to filter new training examples at a finer granularityas described in the next section.173.4 Filtering at Finer GranularityThe filtering process described above is performed at the document level. Thatis, we calculate an agreement/confidence score for an entire discourse tree of adocument, and if the score passes the threshold, this entire discourse tree alongwith every node on it will be added to the new training instances, even thoughsome nodes on this tree may have low scores. In this case, these nodes with lowscores are very likely to harm the performance of the discourse parser when addedto its new training instances.In order to reduce such noise brought by the low-score nodes in a high-scoretree, we seek to filter at a finer granularity — the node level. That is, we com-pute the confidence/agreement score for each specific relation label of a node, andcompare it to the threshold. If the confidence score of one node passes the thresh-old, this node will be added to the new training set, otherwise the node will bediscarded, no matter if the entire discourse tree has a high score.For example, as shown below in Figure 3.2, the threshold is set to 0.5 for bothcases. In Figure 3.2(a), the confidence score of the discourse tree passes the thresh-old, every node on this tree will be added to the new training instances, even thoughthe confidence scores of some nodes are below the threshold. While in Figure3.2(b), each node’s confidence score is compared to the threshold, and only nodeswith confidence score higher than the threshold will be added to new training in-stances. So from the same discourse tree, the nodes we select to add to new traininginstances can be different under different filtering granularity.In our approach, we will use both types of filtering under different situations.The advantage of filtering at a finer granularity is obvious, this way we can pick thehigh quality training instances more precisely, avoiding some “noisy nodes” thathide within a high quality discourse trees. However, it is not always possible tobreak a discourse tree and add only some of its unconnected pieces for re-training.For examples, the SR-parser will need the entire tree structure to do the trainingfor document-level parsing. So we will apply node level filtering for the CODRAparser and document level filtering for the SR-parser.Since node level filtering is possible for CODRA, we can have a more preciseway to control the threshold. That is, we can set different thresholds for different18Threshold:		0.5 Filter New	Training	Data	 But he added: “Some people use the purchasers’ index as a leading indicator, some use it as a coincident indicator. But the thing it’s supposed to measure  - manufacturing strength - it missed altogether last month.” Attribution: 0.9 Contrast: 0.3 Contrast: 0.4 Same-Unit  Elaboration: 0.8 Confidence	Score	Of	This	Tree:		0.7 Filtering	at	Document	Level (a) At Document LevelThreshold:		0.5 Filter New	Training	Data	 But he added: “Some people use the purchasers’ index as a leading indicator, some use it as a coincident indicator. But the thing it’s supposed to measure  - manufacturing strength - it missed altogether last month.” Attribution: 0.9 Contrast: 0.3 Contrast: 0.4 Same-Unit  Elaboration: 0.8 Filtering	at	Node	Level (b) At Node LevelFigure 3.2: Filter at Different Granularitytypes of nodes. More specifically, when the parser labels a node with one rela-tion, depending on what relation it is, we can use different thresholds to determinewhether this node can be added to the new training instances. More discussionabout why we want to set different thresholds for different relations and how to setit can be found in Section 4.7. But generally, we need to be more strict with addingnew training instances of frequent relations, while less strict with adding thoseof infrequent relations. A simple approach is to divide all the relations into two19groups, frequent relations and infrequent relations, according to their frequency inthe gold standard dataset. Then we can use one threshold for the frequent relations,and a different threshold for those infrequent relations. If more precise control ofthe threshold is desired, a sliding threshold for every different relation can also beapplied to node-level filtering.20Chapter 4Empirical Evaluation4.1 DatasetsIn this thesis, we use the RST-DT dataset as the gold standard labeled data. It con-sists of 385 documents selected from Penn Treebank [23], which are all originallyarticles from the Wall Street Journal. Those 385 documents in the RST-DT datasetare divided into two fixed groups: the training set consisting of 347 documents,and the test set 38 documents. For results reported in this thesis, we used those347 documents as the initial training set. The remaining 38 documents made upthe test set used to evaluate the performance of the parser, which is re-trained usingthe enriched dataset.For the unlabeled documents, we used 2000 Wall Street Journal articles fromthe Penn Treebank dataset [23]. In other words, the gold standard dataset andunlabeled dataset are from the same source; but there is no document belonging toboth.In discourse parsing, there are various performance measurements, such as onthe structure (i.e., hierarchical spans) and the labels (i.e., nuclearity and relationclassification). The results reported here focuses on relation classification. Toevaluate the parsing performance based on the gold standard, we use the standardF-score measure, which is the harmonic mean of precision and recall [1]. Morespecifically, we use the F-score measure for comparing discourse trees, as proposedin [22].214.2 The Underfitting Hypothesis: Performance vsFrequencyAs for the discourse relations, we examine all the 18 coarse-grained relations in-troduced in Section 1.1. Figure 4.1 shows the most frequent and the least frequentfive relations in all the 385 documents in the RST-DT dataset. We can see that themost frequent relations can be two order of magnitude higher in frequencies thanthose of the infrequent ones. For example, the “Elaboration” relation makes upover 31% of all the nodes in the entire dataset, while the “Topic Change” relationaccounts for less than 0.5%.0	5	10	15	20	25	30	35	Elabora-on	Joint	A3ribu-on	Same-Unit	Contrast	 …	Condi-on	TextualOrganiza-on	Topic-Change	Manner-Means	Summary	Frequency(%)	Figure 4.1: Distribution of the Most Frequent and the Least Frequent 5 Rela-tions in RST-DTGiven the large disparity in relation frequencies, we next examine whether in-frequent relations suffer from worse performance than the frequent relations, i.e.,the underfitting hypothesis of a lack in training data of the infrequent relations.Here we used the 347 documents to train the SR-parser, and then tested the parseron the 38 documents. Figure 4.2 shows the performance of each relation (i.e.,F-score) versus its frequency. We can see that for each relation, its performancehas high correlation with its frequency. Indeed, the Pearson correlation coefficientis 0.87, validating the underfitting hypothesis. This suggests that it would be a22reasonable approach to boost the performance of infrequent relations by enrichingtheir training instances.0	10	20	30	40	50	60	70	80	90	0	 1000	 2000	 3000	 4000	 5000	 6000	 7000	 8000	Performance(%)	Number	of	Instances	Performance	v.s.	Frequency	Pearson	Correla+on	=	0.87290512	Figure 4.2: Performance versus Frequency for Each Relation4.3 Effect of Enrichment on Infrequent RelationsThe first form of enrichment examined below is based on the agreement scorebetween the two parsers, as discussed in the previous section. Table 4.1 belowshows the improvements on the F-scores from the SR-parser of the top-8 infrequentrelations, based on a threshold of 0.5 in the filtering step. The different columns ofthe table show an increasing number of unlabeled documents used in enrichment,from 500 documents to 2000 documents. Figure 4.3 shows the relative F-scoreimprovements across all the 18 relations, ranked from left to right in ascendingorder of frequency. As a specific example, the F-score of “Topic Change” improves5.88% with 500 documents, and 13.15% with 2,000 documents.As shown in the table and the figure, there is a positive effect on performanceby enrichment based on the agreement score. The larger the number of unlabeleddocuments used, the higher is the gain in performance for the top-8 infrequentrelations. The exact magnitude of the gain varies.23Relation 500 1000 1500 2000Summary 2.13 2.80 3.91 5.16Manner-Means 16.62 21.13 21.61 22.08Topic-Change 5.88 7.21 12.88 13.15TextualOrganization 1.42 3.31 7.49 8.14Condition 3.91 8.69 12.44 18.55Comparison 3.19 6.06 6.95 10.42Evaluation 2.83 4.76 8.09 10.98Topic-Comment 2.69 4.55 6.73 9.48Table 4.1: Relative F-scores Improvements (%) on the Top-8 Infrequent Re-lations-10	-5	0	5	10	15	20	25	Summary	Manner-Means	Topic-Change	TextualOrganiza>on	Condi>on	Comparison	Evalua>on	Topic-Comment	Enablement	Cause	Temporal	Background	Explana>on	Contrast	Same-Unit	AGribu>on	Joint	Elabora>on	Rela>ve	F-score	improvements	(%) Rela>ons	(Ranked	by	frequency,	from	infrequent	to	frequent) Rela>ve	F-score	improvements	on	different	rela>ons	#500	#1000	#1500	#2000	Figure 4.3: Relative F-score Improvements on Different RelationsSo far we have described data enrichment in terms of the number of unlabeleddocuments. The more detailed analysis is to examine the actual number of traininginstances created from the unlabeled documents for each relation. Figure 4.4 showsthe actual number of training instances added for each relation, represented as apercentage relative to the frequency of the instances in the original training dataset.For example, for the “Condition” relation, there is a 35% increase in the actual24number of instances with 500 documents, and this figure jumps to over 150% with2,000 documents. With these additional training instances, the gain in F-scorefor the “Condition” relation is 18.55% from Table 4.1. For the “Topic Change”relation, it is a pleasant surprise that there is a relative F-score improvement of13.15% based on about 50% more training instances.0	50	100	150	200	250	300	350	400	450	Summary	Manner-Means	Topic-Change	TextualOrganiza@on	Condi@on	Topic-Comment	Evalua@on	Comparison	Enablement	Temporal	Cause	Background	Explana@on	Contrast	Same-Unit	AIribu@on	Joint	Elabora@on	Percentage	increased	(%)	Actual	number	of	training	instances	enriched	#500	#1000	#1500	#2000	Figure 4.4: Actual Number of Training Instances Enriched (%)The reader may wonder with 2000 more unlabeled documents, why there isonly a modest increase in training instances for some of the infrequent relations.This increase of course depends on the filtering threshold. One temptation basedon Table 4.1 is to lower the threshold to admit more training instances. This leadsus to one of the most striking features of Figure 4.3 on how the relations are sepa-rated into two clusters. While there are improvements for the infrequent relations,there is no gain, or even small negative impact, on the frequent relations. Thisphenomenon clearly shows that co-training without filtering can be harmful to per-formance. The filtering step is essential to guard against adding “false positive”instances for re-training. If the filtering threshold is set too low, then the frequentrelations may suffer. On the other hand, if the filtering threshold is set too high,then only few training instances will be added to benefit the infrequent relations.254.4 The Impact of the Filtering ThresholdThe results presented so far are based on a filtering threshold of 0.5. To examinethe impact of the filtering threshold on performance, we vary the threshold. Figure4.5(a) shows how the relative F-score improvement changes with a filtering thresh-old from 0.3 to 0.7 aggregated across all the 18 relations. The results shown in thefigure are based on all the instances in the entire dataset. In other words, the perfor-mance of the frequent relations, due to their much higher frequencies, completelydominates the performance of the infrequent ones. Thus, Figure 4.5(b) shows acorresponding graph aggregated across only the top-8 infrequent relations.Compared with the filtering threshold of 0.5 shown previously, there is furtherimprovement when the threshold is raised to 0.6 and 0.7. Particularly from Figure4.5(b), there is considerable improvement across the top-8 infrequent relations.Interestingly, the peak performance gain occurs with the threshold of 0.6 – not 0.7.This shows that when the threshold is raised from 0.6 to 0.7, the reduction in thenumber of documents passing through the filter hurts the gain in performance.The reader may wonder whether this kind of performance improvements willcontinue to grow under the effective threshold with more unlabeled resources addedin. To explore the answer to this question, we employ the New York Times textcorpus [26] by adding a small subset of its documents to our existing unlabeleddocuments. Then we conduct the same experiment with the expanded unlabeledresources, and the result in Figure 4.6 shows that the performance will continue toimprove at a lower rate and finally tend to stabilize.Next let us examine the situation when the filtering threshold is reduced from0.5 to 0.4 and 0.3. Aggregated across all the 18 relations, Figure 4.5(a) clearlyshows that there is performance loss. Consistent with the performance loss shownin Figure 4.3 for the frequent relations, this is the situation when the extra train-ing instances passing through the filter introduce too much noise and hurt overallperformance. Interestingly, Figure 4.5(b) shows that there is always a positive per-formance gain for the top-8 infrequent relations, regardless of whether the filteringthreshold is 0.3 or 0.7. This suggests that infrequent relations and frequent rela-tions may need different threshold. We will follow up on this heuristic in Section4.7.26-8.0000		-6.0000		-4.0000		-2.0000		0.0000		2.0000		4.0000		6.0000		0	 500	 1000	 1500	 2000	Rela/ve	F-score	improvements	(%)	Number	of	increased	training	documents	Rela/ve	F-score	improvements	with	different	filtering	threshold	0.7	0.6	0.5	0.4	0.3	(a) Across All the 18 Relations0	2	4	6	8	10	12	14	0	 500	 1000	 1500	 2000	Rela-ve	F-score	improvements	(%) Number	of	increased	training	documents		Rela-ve	F-score	improvements	of	top-8	infrequent	rela-ons	with	different	filtering	threshold		0.7	0.6	0.5	0.4	0.3	(b) Across the Top-8 Infrequent RelationsFigure 4.5: Changes in Relative F-score with Varying Filtering AgreementScore Threshold270.0000		2.0000		4.0000		6.0000		8.0000		10.0000		12.0000		14.0000		16.0000		0	 500	 1000	 1500	 2000	 4000	 6000	 8000	Rela.ve	F-score	improvements	(%) Number	of	increased	training	documents		Rela.ve	F-score	improvements	of	top-8	infrequent	rela.ons	with	more	unlabeled	documents	0.7	0.6	0.5	Figure 4.6: The Impact of More Unlabeled Resources4.5 Using the Confidence Score to Approximate theAgreement ScoreAs discussed in Section 3, we explore a second form of enrichment. The agreementscore reported so far requires the use of both the CODRA parser and the SR-parser.The former takes cubic time and the latter takes linear time. The idea here is toassess whether the confidence score generated from the faster SR-parser can beused to approximate the agreement score. If this approach is successful, an evenlarger number of unlabeled documents can be parsed rapidly to be used for re-training.The first step of the assessment is to calculate the correlation between theagreement score and the confidence score of the SR-parser. As shown in Figure4.7, which plots the correlation for all the 2,000 unlabeled documents, there is aweak correlation between the two scores. While the overall correlation is 0.36, it ispromising to see that when the confidence score becomes higher (e.g., greater than1.5), the correlation with the agreement score becomes stronger. It is also impor-tant to note that there is a significant drop in the number of documents passing theconfidence score threshold of 2.Corresponding to the two graphs in Figure 6, the two graphs in Figure 8 showthe performance change using the confidence score of the SR-parser with varying280	0.2	0.4	0.6	0.8	1	1.2	0	 0.5	 1	 1.5	 2	 2.5	Agreement	score Average	Distance	(Confidence	score) Agreement	v.s.	Confidence Correla'on:	0.35676029	Figure 4.7: Agreement Score vs Confidence Scorefiltering threshold. Figure 4.8(a) shows how the relative F-score changes with afiltering threshold from 0.5 to 2 aggregated across all the 18 relations. Like inFigure 4.5(a) before, the performance of the frequent relations, due to their su-perior frequencies, completely dominates the performance of the infrequent ones.Thus, Figure 4.8(b) shows a corresponding graph aggregated across only the top-8infrequent relations.In Figure 4.8(b), the peak performance gain occurs when the confidence scorethreshold is 1.5. Even when the confidence score is lowered to 1.0, the perfor-mance gain is still reasonable with 2,000 documents. But somewhat surprisingly,the performance gain drops significantly when the confidence score threshold israised to 2. This can be explained by looking more closely back at Figure 4.7.The confidence score threshold of 2 is too restrictive and very few unlabeled doc-uments satisfy it; hence, the actual number of additional documents admitted forre-training is significantly reduced.A first glance of Figure 4.8(a) seems to suggest that using the confidence scoreof the SR-parser is ineffective. The best performance gain across all the 18 relationsis barely above 1%, which is smaller than the corresponding gain in Figure 4.5(a).This ineffectiveness is completely due to the behavior of the frequent relations.However, Figure 4.8(b) paints a rather different picture. For the top-8 infrequent29-12	-10	-8	-6	-4	-2	0	2	0	 500	 1000	 1500	 2000	Rela.ve	F-score	improvements	(%) Number	of	documents	parsed	for	enrichment Rela.ve	F-score	improvements	with	different	filter	threshold	of	confidence	score	0.5	1	1.5	2	(a) On All 18 Relations-4	-2	0	2	4	6	8	10	12	0	 500	 1000	 1500	 2000	Rela.ve	F-score	improvements	(%)	Number	of	documents	parsed	for	enrichment	Rela.ve	F-score	improvements	with	different	filter	threshold	of	confidence	score	on	top8	infrequent	rela.ons	0.5	1	1.5	2	(b) On Top-8 Infrequent RelationsFigure 4.8: Overall F-score Improvements with Different Enriched DataQuality via Confidence Score30relations, there is a peak performance gain of about 10% with 2,000 documents.This gain is almost as good as the peak performance gain shown in Figure 4.5(b)with 2,000 documents. Given that the SR-parser is significantly faster than theCODRA parser, it is promising to use the confidence score of the SR-parser toapproximate the agreement score, so that a larger number of unlabeled documentscan be used for enrichment.4.6 Adding Enriched Training Instances in an IterativeMannerThe results shown so far are based on one round of re-training. As shown in Figure1, data enrichment can be done iteratively. The table below shows the relative F-score improvement on the top-8 infrequent relations when enrichment is done inincrements of 500 documents. Here we process 500 unlabeled documents, re-trainthe SR-parser with the documents passing through the filter, then process the nextbatch of 500 documents, and so on.# of documents Basic Iterative (batches of 500 documents)1000 4.05 4.611500 6.65 7.472000 7.90 8.95Table 4.2: Relative F-scores Improvements (%) on the Top-8 Infrequent Re-lationsThe results shown in the table used the confidence score of 1 as the filteringthreshold. The first column is precisely the curve in Figure 4.8(b) for the confi-dence score of 1. The first row in the table, for example, shows that doing re-training twice (500 documents each time) boosts the performance when comparedwith re-training done once at the end. Similarly, the other rows show that there issome value in iterative re-training.314.7 Filtering at a Finer GranularityAll the filtering experiments above are done at the document level. That is, wecalculate an agreement/confidence score for an entire discourse tree of a document,and if the score passes the threshold, this entire discourse tree along with everynode on it are added to the new training instances, even though some nodes on thistree may have low scores. So in this section, we will explore the idea of filteringat a finer granularity, e.g. at the node level. Due to the different mechanism ofthe two parsers used in our framework, we picked the CODRA to conduct thisexperiment, because it is easier to filter discourse structures at node level and trainits new model with partial discourse structures using CODRA. While we could notfind a direct way to do it with the SR-parser.In this experiment, we have performed both doc-level filtering and node-levelfiltering using the same experiment setting: we use the confidence score of CO-DRA itself to filter new candidate training examples, and the threshold is set to0.5 here. The number of unlabeled documents used here is 500. The doc-levelfiltering works as described above, and for node-level filtering, every node with aconfidence score higher than the threshold will be added to the new training set toretrain CODRA, no matter whether the document’s discourse tree has a high confi-dence score that passes the threshold. Results of the two experiments are shown inTable 4.3. We can see that filtering at node-level has an advantage over filtering atdoc-level for most discourse relations. And it is noteworthy that frequent relationsare generally unharmed at node-level filtering, unlike at doc-level filtering.Based on the control of filtering at a finer granularity, we can actually do morewith the filtering threshold. Since in this case we can compare the score of eachnode to a threshold to determine whether it should be added to the new trainingset, we can actually set different thresholds for different types of relations. Thoughhow to set different thresholds for different relations is still to be explored, wehave run a small experiment with two different thresholds for infrequent and fre-quent relations separately and it shows a small increase on the performance. Sowe believe with more reasonable threshold set for different relations, in the future,greater improvements can be expected from using a varying threshold.32Relation Doc-level Node-levelSummary 4.265 6.811Cause 1.827 1.965Manner-Means 8.677 12.581Temporal -0.296 -0.246Topic-Change 1.201 1.801Background -0.209 0.105TextualOrganization 5.669 7.122Explanation -0.317 -0.106Condition 6.656 7.488Contrast -0.066 0.131Comparison 4.527 4.527Same-Unit -0.109 0.145Evaluation 1.696 1.993Attribution -0.120 0.052Topic-Comment 1.360 2.039Joint -0.211 -0.015Enablement 2.999 3.314Elaboration -0.058 0.014Table 4.3: Relative F-scores Improvements (%) at Different Filtering Granu-larities33Chapter 5ConclusionAs the number of applications of discourse parsing in NLP is constantly growing,any improvement in discourse parsing performance can have considerable impact.In this thesis, we first validate the underfitting hypothesis, i.e., the less frequent arelation is in the training data, the poorer the performance on that relation. This isa phenomenon that applies to most discourse parser. One solution is, of course, tocreate more labeled data, ideally for all the relations. However, given the resourcesrequired for manually creating labeled data for discourse parsing, we explore inthis thesis a training data enrichment framework that relies on co-training of theCODRA parser and the SR-parser on unlabeled documents. We also investigateusing both the agreement score and the confidence score of the SR-parser to filteraway “low quality” documents, whose presence in the re-training can hurt the per-formance. Our empirical results show that agreement score filtering can boost theperformance of infrequent relations considerably. Our results also show that forinfrequent relations, the confidence score of the SR-parser can also be used as afast approximation of the agreement score.So far our results show that our data enrichment framework is not effectivefor frequent relations. In ongoing work, we are studying how to augment ourframework to boost the performance of even the frequent relations, and the varyingthreshold might be a promising solution. In the future, we plan to apply our frame-work to enrich training data for discourse structure and nuclearity analysis, andalso to apply it to other discourse dataset(s) labeled in different ways (e.g. PDTB).34Bibliography[1] S. Abney, S. Flickenger, C. Gdaniec, C. Grishman, P. Harrison, D. Hindle,R. Ingria, F. Jelinek, J. Klavans, M. Liberman, et al. Procedure forquantitatively comparing the syntactic coverage of english grammars. InProceedings of the workshop on Speech and Natural Language, pages306–311. Association for Computational Linguistics, 1991. → pages 21[2] K. Allen, G. Carenini, and R. T. Ng. Detecting disagreement inconversations using pseudo-monologic rhetorical structure. In EMNLP,pages 1169–1180, 2014. → pages 1, 3, 5[3] P. Bhatia, Y. Ji, and J. Eisenstein. Better document-level sentiment analysisfrom rst discourse parsing. In Proceedings of the Empirical Methods inNatural Language Processing, (EMNLP), 2015. → pages 1, 3, 4[4] A. Blum and T. Mitchell. Combining labeled and unlabeled data withco-training. In Proceedings of the eleventh annual conference onComputational learning theory, pages 92–100. ACM, 1998. → pages 5, 13[5] L. Carlson, D. Marcu, and M. E. Okurowski. Building a discourse-taggedcorpus in the framework of rhetorical structure theory. In Proceedings of theSecond SIGdial Workshop on Discourse and Dialogue - Volume 16,SIGDIAL ’01, pages 1–10, Stroudsburg, PA, USA, 2001. Association forComputational Linguistics. doi:10.3115/1118078.1118083. URLhttp://dx.doi.org/10.3115/1118078.1118083. → pages 2, 4[6] T. G. Dietterich. Ensemble methods in machine learning. In Internationalworkshop on multiple classifier systems, pages 1–15. Springer, 2000. →pages 6[7] V. W. Feng and G. Hirst. Text-level discourse parsing with rich linguisticfeatures. In Proceedings of the 50th Annual Meeting of the Association forComputational Linguistics: Long Papers-Volume 1, pages 60–68.Association for Computational Linguistics, 2012. → pages 1035[8] V. W. Feng and G. Hirst. A linear-time bottom-up discourse parser withconstraints and post-editing. In ACL (1), pages 511–521, 2014. → pages 4,11, 16[9] S. Gerani, Y. Mehdad, G. Carenini, R. T. Ng, and B. Nejat. Abstractivesummarization of product reviews using discourse structure. In EMNLP,pages 1602–1613, 2014. → pages 3[10] S. Gerani, G. Carenini, and R. T. Ng. Modeling content and structure forabstractive review summarization. Computer Speech & Language, 2016. →pages 1, 3[11] F. Guzma´n, S. Joty, L. Ma`rquez, and P. Nakov. Using discourse structureimproves machine translation evaluation. In ACL (1), pages 687–698, 2014.→ pages 1, 3[12] H. Hernault, D. Bollegala, and M. Ishizuka. A semi-supervised approach toimprove classification of infrequent discourse relations using feature vectorextension. In Proceedings of the 2010 Conference on Empirical Methods inNatural Language Processing, pages 399–409. Association forComputational Linguistics, 2010. → pages 4, 13[13] H. Hernault, H. Prendinger, D. A. DuVerle, M. Ishizuka, and T. Paek. Hilda:a discourse parser using support vector machine classification. Dialogue andDiscourse, 1(3):1–33, 2010. → pages 10[14] L. Horn. A natural history of negation. Chicago: University of ChicagoPress, 1989. → pages 5[15] Y. Ji and J. Eisenstein. Representation learning for text-level discourseparsing. In ACL (1), pages 13–24, 2014. → pages 4, 6, 10, 11, 15[16] S. Joty, G. Carenini, and R. T. Ng. Codra: A novel discriminative frameworkfor rhetorical analysis. Computational Linguistics, 2015. → pages 1, 2, 4, 6,10, 12, 15[17] S. R. Joty, G. Carenini, R. T. Ng, and Y. Mehdad. Combining intra-andmulti-sentential rhetorical parsing for document-level discourse analysis. InACL (1), pages 486–496, 2013. → pages 10, 11[18] M. Lan, Y. Xu, Z.-Y. Niu, et al. Leveraging synthetic discourse data viamulti-task learning for implicit discourse relation recognition. In ACL (1),pages 476–485. Citeseer, 2013. → pages 1236[19] J. J. Li and A. Nenkova. Fast and accurate prediction of sentence specificity.In AAAI, pages 2281–2287, 2015. → pages 5, 14[20] Y. Liu, S. Li, X. Zhang, and Z. Sui. Implicit discourse relation classificationvia multi-task neural networks. arXiv preprint arXiv:1603.02776, 2016. →pages 12, 13[21] D. Marcu. A decision-based approach to rhetorical parsing. In Proceedingsof the 37th annual meeting of the Association for Computational Linguisticson Computational Linguistics, pages 365–372. Association forComputational Linguistics, 1999. → pages 9[22] D. Marcu. The theory and practice of discourse parsing and summarization.MIT press, 2000. → pages 6, 17, 21[23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a largeannotated corpus of english: The penn treebank. Computational linguistics,19(2):313–330, 1993. → pages 21[24] K. McKeown and O. Biran. Aggregated word pair features for implicitdiscourse relation disambiguation. In Proceedings of the 51st AnnualMeeting of the Association for Computational Linguistics, pages 69–73. TheAssociation for Computational Linguistics, 2013. → pages 12[25] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. K. Joshi, andB. L. Webber. The penn discourse treebank 2.0. In LREC. Citeseer, 2008.→ pages 4[26] E. Sandhaus. The new york times annotated corpus. Linguistic DataConsortium, Philadelphia, 6(12):e26752, 2008. → pages 26[27] R. Soricut and D. Marcu. Sentence level discourse parsing using syntacticand lexical information. In Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics onHuman Language Technology-Volume 1, pages 149–156. Association forComputational Linguistics, 2003. → pages 9[28] C. Sporleder and A. Lascarides. Using automatically labelled examples toclassify rhetorical relations: An assessment. Natural Language Engineering,14(3):369–416, 2008. → pages 12[29] R. Subba and B. Di Eugenio. An effective discourse parser that uses richlinguistic information. In Proceedings of Human Language Technologies:37The 2009 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 566–574. Association forComputational Linguistics, 2009. → pages 10[30] X. Wan. Co-training for cross-lingual sentiment classification. InProceedings of the Joint Conference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natural Language Processingof the AFNLP: Volume 1-Volume 1, pages 235–243. Association forComputational Linguistics, 2009. → pages 13[31] Z.-M. Zhou, Y. Xu, Z.-Y. Niu, M. Lan, J. Su, and C. L. Tan. Predictingdiscourse connectives for implicit discourse relation recognition. InProceedings of the 23rd International Conference on ComputationalLinguistics: Posters, pages 1507–1514. Association for ComputationalLinguistics, 2010. → pages 1238

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0340024/manifest

Comment

Related Items