Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A semi-joint neural model for sentence level discourse parsing and sentiment analysis Nejat, Bita 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_november_nejat_bita.pdf [ 597.79kB ]
JSON: 24-1.0354697.json
JSON-LD: 24-1.0354697-ld.json
RDF/XML (Pretty): 24-1.0354697-rdf.xml
RDF/JSON: 24-1.0354697-rdf.json
Turtle: 24-1.0354697-turtle.txt
N-Triples: 24-1.0354697-rdf-ntriples.txt
Original Record: 24-1.0354697-source.json
Full Text

Full Text

A Semi-joint Neural Model for Sentence Level DiscourseParsing and Sentiment AnalysisbyBita NejatB. Science, The University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2017c© Bita Nejat, 2017AbstractDiscourse Parsing and Sentiment Analysis are two fundamental tasks in NaturalLanguage Processing that have been shown to be mutually beneficial. In this work,we design and compare two Neural Based models for jointly learning both tasks. Inthe proposed approach, we first create a vector representation for all the segmentsin the input sentence. Next, we apply three different Recursive Neural Net models:one for discourse structure prediction, one for discourse relation prediction andone for sentiment analysis. Finally, we combine these Neural Nets in two differentjoint models: Multi-tasking and Pre-training. Our results on two standard corporaindicate that both methods result in improvements in each task but Multi-taskinghas a bigger impact than Pre-training.iiLay SummaryIn Natural Language Processing, gathering and processing human-labeled data re-quires considerable amount of time, money and resources. As a result of not havingan abundance of such data, learning complex NLP tasks is a challenge. Therefore,being able to transfer and apply the knowledge learned in one task to another rele-vant task can be very beneficial.In this work, we study two fundamental and closely related NLP tasks, Dis-course Parsing and Sentiment Analysis, and explore two ways in which knowledgelearned in one task could be transferred to the other task. Our research confirmsthat the knowledge-sharing between these two tasks helps boost the performanceof each one individually.iiiPrefaceThis dissertation is an original intellectual product of the author, Bita Nejat.I conducted all the experiments and wrote the manuscript. Dr. Raymond Ng andDr. Giuseppe Carenini were the supervisory authors of this project and were in-volved throughout the project in concept formation and manuscript editing. Thisthesis is a paper-based thesis.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Discourse Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Approach and Contributions . . . . . . . . . . . . . . . . . . . . 51.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Discourse Parsing and Sentiment Analysis . . . . . . . . . . . . . 62.2 Learning Text Embeddings . . . . . . . . . . . . . . . . . . . . . 82.3 Joint models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9v3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Joint Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Learning Text Embeddings . . . . . . . . . . . . . . . . . . . . . 174.2 Neural Net Models . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Individual Models . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Joint Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Training and Evaluating The Models . . . . . . . . . . . . . . . . . 266 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1 Comparison to previous Sentiment Analyzers . . . . . . . . . . . 306.2 Comparison to previous Discourse Parsers . . . . . . . . . . . . . 317 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.1 Improving Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 327.2 More Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.3 Recursive Neural Nets coupled with Recurrent Neural Nets . . . . 337.4 Document Level Discourse Parsing and Sentiment Analysis . . . . 337.5 Simultaneous Pre-training and Multi-tasking . . . . . . . . . . . . 348 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37viList of TablesTable 3.1 Distribution of RST-DT relations . . . . . . . . . . . . . . . . 15Table 5.1 Discourse Parsing results based on manual discourse segmentation 27Table 5.2 Contrastive Relation Prediction results under different trainingsettings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 5.3 Sentiment Analysis over Discourse Tree . . . . . . . . . . . . 29Table 7.1 CODRA’s Discourse Parsing results at sentence-level and document-level, based on manual and automatic segmentation . . . . . . 33viiList of FiguresFigure 1.1 The Discourse Tree of a sentence from Sentiment Treebankdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 The Sentiment annotation (over Discourse Tree structure) of asentence from Sentiment Treebank dataset . . . . . . . . . . . 4Figure 2.1 Three architectures for modelling text with multi-task learning.(Figure adapted from [18]) . . . . . . . . . . . . . . . . . . . 11Figure 2.2 Architecture of Multi-task Neural Networks for Discourse Re-lation Classification (Figure adapted from [19]) . . . . . . . . 12Figure 2.3 Illustration of the architecture of the basic model. (Figureadapted from [29]) . . . . . . . . . . . . . . . . . . . . . . . 13Figure 3.1 Distribution of sentiment labels over Sentiment Treebank sen-tences at all nodes of created Discourse Trees . . . . . . . . . 16Figure 4.1 The Sentiment Neural Compressor . . . . . . . . . . . . . . . 18Figure 4.2 The Discourse Neural Compressor . . . . . . . . . . . . . . . 19Figure 4.3 The Discourse Structure Neural Net . . . . . . . . . . . . . . 20Figure 4.5 Multi-tasking Network . . . . . . . . . . . . . . . . . . . . . 23Figure 4.4 Multi-tasking . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 4.6 Using the weights of one network as a form of pre-training foranother network . . . . . . . . . . . . . . . . . . . . . . . . . 24viiiAcknowledgmentsI would like to offer my wholehearted gratitude to everyone who has inspired orsupported my work during my master study.Special Thanks to my supervisors Dr. Giuseppe Carenini, and Dr. Raymond Ng.The past 3 years have been extremely challenging for me and you have been themost supportive, helpful and considerate supervisors that I could wish for. I wouldlike to thank Julieta Martinez and Issam Laradji for their insightful comments onNeural Net libraries and Machine Learning techniques. Finally, a big thank you tomy family for their unconditional love and support.ixChapter 1IntroductionWith the rapid growth of the amount of text data on the Internet, the need for meth-ods to analyze these texts has grown as well. This thesis focuses on studying twofundamental NLP tasks, Discourse Parsing and Sentiment Analysis. The impor-tance of these tasks and their wide applications (e.g., [10], [31]) has initiated muchinterest in studying both but no method yet exists that can come close to humanperformance in solving them.It has been known that they are mutually beneficial, meaning that Discourseinformation could be used to improve Sentiment Analysis and likewise, knowingthe sentiment of text spans within a sentence or document can help with improv-ing Discourse Parsing. Our project relies on previous findings and borrowing fromTransfer learning ideas to create a joint model that improves both Discourse Pars-ing and Sentiment Analysis.1.1 Discourse Parsing“Clauses and sentences rarely stand on their own in an actual discourse; rather,the relationship between them carries important information that allows the dis-course to express a meaning as a whole beyond the sum of its individual parts.Discourse analysis seeks to uncover this coherent structure.” [13]Discourse Parsing is the task of building a hierarchical tree structure over a1Figure 1.1: The Discourse Tree of a sentence from Sentiment Treebankdatasetsentence or a document, the leaves of which are clauses (also called ElementaryDiscourse Units) and nodes correspond to the juxtaposition of their children’s textspans. A Discourse Tree is represented as a set of constituents R[i,m, j] wherei ≤ m < j. R in this representation refers to the rhetorical relation that holds be-tween the Discourse Unit containing EDUs i to m and the one containing EDUsm+1 to j.The relation R also specifies nucliearity. Nuclei are the core parts of the relationand Satellites are the supportive ones.R can take one of the following forms:• Satellite-Nucleus (SN) : First Discourse Unit is Satellite and second Dis-course Unit is Nucleus.• Nucleus-Satellite (NS) : First Discourse Unit is Nucleus and second Dis-course Unit is Satellite.• Nucleus-Nucleus (NN) : Both Discourse Units are Nuclei.In this approach relation identification and nuclearity assignment is done si-2multaneously. Figure 1.1 shows the Discourse Tree of a sample sentence. In thissentence, the Discourse Unit “There are slow and repetitive parts,” holds a “Con-trast” relationship with “but it has just enough spice to keep it interesting.”. Fur-thermore, we can see that the former Discourse Unit is the satellite of the relationand the later part is the Nucleus.Discourse Parsing is such a critical task in NLP because previous work hasshown that information contained in the resulting Discourse Tree can benefit manyother NLP tasks including but not restricted to automatic summarization (e.g.,[10], [23], [20]), machine translation (e.g., [25],[11]) and question answering (e.g.,[34]).In contrast to traditional syntactic and semantic parsing, Discourse Parsing cangenerate structures that cover not only a single sentence but also multi-sententialtext. However, the focus of this thesis is on sentence level Discourse Parsing, leav-ing the study of extensions to multi-sentential text as future work.1.2 Sentiment AnalysisGiven a piece of text, the task of Sentiment Analysis studies how a label, represent-ing the contextual polarity of the text, can be assigned to it. Consider the followingtwo examples:• “The whole cast looks to be having so much fun”• “It’s Robert Duvall.”The sentiment for the first phrase is probably “Very Positive”, while the sentimentlabel for the second phrase is probably “Neutral”. Analyzing the overall polarityof a sentence is a challenging task due to the ambiguities that can be introducedby combinations of words and phrases. For example in the movie review excerptshown in Figure 1.2, the phrase “There are slow and repetitive parts” has a negativesentiment. However when it is combined with the positive phrase “but it has justenough spice to keep it interesting”, it results in an overall positive sentence.3Figure 1.2: The Sentiment annotation (over Discourse Tree structure) of asentence from Sentiment Treebank datasetWith a wide range of applications, including Review Analysis, Poll predictionsand Recommender Systems [8], Sentiment Analysis has been one of the most stud-ied and hottest areas of NLP, yet highly accurate Sentiment Analysis of a domainindependent piece of text remains an ongoing research problem.1.3 MotivationIt has been suggested that the information extracted from Discourse Trees can helpwith Sentiment Analysis [2] and likewise, knowing the sentiment of two piecesof text might help with the identification of discourse relationships between them[16]. For instance, taking the sentence in Figure 1.1 as an example, knowingthat the two text spans “There are slow and repetitive parts” and “but it has justenough spice to keep it interesting” are in a Contrast relationship to each other,also signals that the sentiment of the two text spans is less likely to be of thesame type. Likewise, knowing that the sentiment of the former text span is “verynegative”, while the sentiment of the later text span is “very positive”, helps tonarrow down the choice of discourse relation between these two text spans tothe Contrastive group which contains relations Contrast, Comparison, Antithesis,Antithesis-e,Consequence-s,Concession and Problem-Solution.4To the best of our knowledge there is no previous work that learns both of thesetasks in joint model, using deep learning architectures.1.4 Approach and ContributionsThe main contribution of this thesis is to address this gap by investigating how thetwo tasks can benefit from each other at the sentence level within a deep learningjoint model. More specific contributions include:(i) The generation of embeddings for each task which is obtained by compress-ing generic skip-thought vectors. [15](ii) The development of three independent recursive neural nets: two for thekey sub-tasks of discourse parsing, namely structure prediction and relationprediction; the third net for sentiment prediction.(iii) The design and experimental comparison of two alternative neural joint mod-els, Multi-tasking and Pre-training, that have been shown to be effective inprevious work for combining other tasks in NLP ([6],[7],[18]).Our results indicate that a joint model performs better than individual models ineither of the tasks with Multi-tasking outperforming Pre-training.1.5 OutlineChapter 2 describes the background and previous work done on Discourse Parsing,Sentiment Analysis and ways of jointly training multiple tasks. In Chapter 3, wetalk about the two datasets that we have used to train and test our models. We alsodiscuss their properties and any preprocessing work needed to prepare them for ourspecific tasks. Chapter 4 discusses our framework and all of its subparts in detail.We describe how we learn the embeddings for text spans, our individual NeuralNet models and finally our joint models. We then present the training and evalua-tion process in Chapter 5. Chapter 7 talks about some possible future avenues toimprove the results or extend the work to other, more complicated tasks. Finally,we wrap up the thesis in Chapter 8.5Chapter 2Related WorkIn this section we discuss related projects and their advantages and disadvantages.The related work is divided into three subsections that address the three main ar-eas of our work. The first subsection explores previous work done on DiscourseParsing and Sentiment Analysis which are the two tasks we want to solve. Thesecond subsection describes previous work on distributed vector representationsfor sentences and text spans. Since we do not hand pick features, having mean-ingful vector representations for text spans is key to obtaining more accurate re-sults. The third subsection explores previous work on joint models, focusing ontwo techniques: Pre-training and Multi-tasking. In particular, we describe howMulti-tasking and Pre-training affects the performances of the tasks that are jointlytrained.2.1 Discourse Parsing and Sentiment AnalysisTraditionally, Discourse Parsing and Sentiment Analysis have been approachedby applying machine learning methods with predetermined, engineered featuresthat were carefully chosen by studying the properties of the text.Examples of effective sentence level and document level Discourse Parsers in-clude CODRA [13] and the parser of [9] . These parsers use organizational, struc-tural, contextual, lexical and N-gram features to represent Discourse Units and ap-ply graphical models for learning and inference (i.e. Conditional Random Fields).6The performance of these parsers critically depends on a careful selection of infor-mative and relevant features, something that is instead performed automatically inthe neural models we propose in this thesis.[27], [28] and [30], approach Sentiment Analysis using carefully engineeredfeatures as well as polarity rules. The choice of features also plays a key role in thehigh performance of these models.Yet, with the rapid advancements of Neural Nets in complex areas such as vi-sion and speech understanding, there has been increased interest in applying themto different NLP tasks. Socher et al. [33] approached the problem of SentimentAnalysis by recursively assigning sentiment labels to the nodes of a binarized syn-tactic parse tree over a sentence. At each non-leaf node, the Sentiment Neural Netfirst creates a distributed embedding for the node using the embedding of its twochildren and then assigns a sentiment label to that node. Their approach achievesstate of the art results. In our work, we borrow from the same idea of RecursiveNeural Nets to learn the Sentiment labels. However, the structure over which welearn the Sentiment labels is the Discourse Tree of the sentence as opposed to thesyntactic parse tree, with the goal of testing if Sentiment Analysis can benefit di-rectly from discourse information within a neural joint model.Motivated by Socher’s success on Sentiment Analysis, Li et al. [17] approachedthe problem of Discourse Parsing by recursively building the Discourse Tree usingtwo Neural Nets. A Structure Neural Net decides whether two nodes should be con-nected in the Discourse Tree or not. If two nodes are determined to be connectedby the Structure Neural Net, a Relation Neural Net then decides what rhetoricalrelation should hold between the two nodes. Their approach also yields promisingresults. In terms of representation, the recursive structure of a Discourse Tree isused to learn the embedding of each non-leaf node from its children. For leaf nodes(EDUs), the representation is learned recursively using the syntactic parse tree ofthe node. One problem with their work is that it is unclear how they combine thelabeled Discourse Structure Tree with the unlabeled syntactic parse trees to learnthe vector representations for the text spans.Bhatia et al. [2] trained a Recursive Neural Network for Sentiment Analysisover a Discourse Tree and showed that the information extracted from the Dis-course Tree can be helpful for determining the Sentiment at document level. In7their work however, they did not attempt to learn a distributed representation forthe sub-document units. To represent EDUs, they used the bag-of-words features.For our work, we not only apply a Recurrent Neural Net approach to learn embed-dings for the EDUs, but we also jointly learn models for the two tasks, instead ofsimply feeding a pre-computed discourse structure in a neural model for sentiment.2.2 Learning Text EmbeddingsLearning text embeddings is a fundamental step in using Neural Nets for NLPtasks. An embedding is a fixed dimensional representation of the data (text) with-out the use of handpicked features. As words are the building blocks of text, pre-vious studies have created fixed dimensional vector representations for words [26]that capture the semantic and syntactic meaning of the words. However, creatingmeaningful fixed dimensional vector representations for text spans is an ongoingchallenge.Both Socher et al. [33] and Li et al. [17] learn the embedding of a text spanin a recursive manner, given a binary tree over the text span with leaves being thewords. The words are initialized with random vector representations and the em-bedding of a parent is computed from the embedding of its two children using anon-linear projection. The embedding is then used for training the task under study(Sentiment Analysis and Discourse Parsing respectively) and updated according tohow useful it was for the task.Recently Recurrent Neural Nets (RNNs) and their variant, Long Short-TermMemories (LSTMs) have become a more popular alternative for learning the em-bedding of a sentence ([15] and [29]).In [15], an encoder RNN encodes a sentence into a fixed vector representationthat is then used by a decoder RNN to predict the following and preceding sen-tences and based on how good the predictions were, updates both the decoder andencoder RNNs. Once training is done, the encoder RNN can be used on its ownto create an embedding for any text span. In their work, Kiros et al. [15] appliedskip thought vectors to Sentiment Treebank sentences to see if the representationslearned could directly be used for determining the sentiment of a sentence. Their8results showed that representing a sentence with the skip-thought vectors withouttaking its structure into consideration would not improve the performance beyondthe results Socher et al. [33] had achieved. In this project, we have used theencoder RNN to represent our EDUs but we further compress the resulting embed-dings with a neural based compressor to limit the number of parameters.2.3 Joint modelsWhen training a neural model, the weights are usually initialized with randomnumbers taken from a uniform distribution. However, in their work, [7] argue thatPre-training a neural model helps initialize a neural network with better weightsthat prevent the network from getting stuck in local minima and results in bettergeneralization and can enhance the performance of the model. And this generalidea has been successfully applied in several scenarios (e.g., [5], [32] ). For exam-ple, Chung et al. [5] used auto-encoders as a Pre-training mechanism and showedthat Pre-training can lead to better performance compared to the same model withno Pre-training. In our work, we use the trained weights of one neural model (e.g.sentiment) as an initialization form for another task (e.g. discourse structure) tosee if the features learned for one can be helpful for the other.Neural Multi-tasking was originally proposed by [6], who experimented withthe technique using deep convolutional neural networks. In essence, the basic ideais that a network is alternatively trained with instances for different tasks, so thatthe network is learning to perform all these tasks jointly. In [6] a model is trainedto perform a variety of predictions on a given sentence, including part-of-speechtags, chunks, named entity tags, semantic roles, semantically similar words and thelikelihood that the sentence makes sense using a language model. They showedthat multitasking using a neural net structure can improve the generalization of theshared tasks and result in better performance. Following up on this initial success,many researchers have applied the neural multi-tasking strategy to several tasks,including very recent work in vision [14] and NLP (e.g., text classification [18]and the classification of implicit discourse relations [19]).Liu et al [18] showed that a multi-tasking system can improve the performance9of a task with the help of other related tasks. The goal is to learn representationsfor phrases, text spans and sentences using Recurrent Neural Nets (RNNs) throughsupervised training of four related tasks. They proposed three multi-task models asshown in Figure 2.1. In the first model (Model-I in the Figure), there is only one,shared RNN for all tasks (but producing different outputs). In the second model(Model-II in the Figure), each task has its own RNN but each RNNs shares the hid-den representations at each time step with the other RNNs at the same time step.In the last model (Model-III in the Figure), each task has its own RNN but eachRNN is connected to a mutually shared RNN to share hidden representations ateach time step.Experimental results on each of the three Multi-task models showed signifi-cant improvements compared to the individual model. The amount of improvementon varied among tasks and datasets. The first model resulted in an average of 2%improvement, while the second model resulted in an average of 2.3% improve-ment. The average improvement for the third model with added Pre-training andfine tuning was 2.8%.In their work [19], Liu et al studied implicit discourse relation classificationusing Mutli-task learning of four related tasks. As can be seen from Figure 2.2,their Neural based model consists of a Convolutional layer that compresses the au-gument pairs of different tasks into low-dimensional vector representations. Eachtask owns a unique representation and a shared representation connecting all tasks.The two are then concatenated and mapped into a task specific representation. Theythen attach additional surface-level features and the resulting vector representationis fed to each task’s Neural Net.10Figure 2.1: Three architectures for modelling text with multi-task learning.(Figure adapted from [18])11Figure 2.2: Architecture of Multi-task Neural Networks for Discourse Rela-tion Classification (Figure adapted from [19])Their Multi-tasking model is trained over four different tasks:• Implicit PDTB Discourse Relation Classification using Penn Discourse Tree-bank• Explicit PDTB Discourse Relation Classification using Penn Discourse Tree-bank• RST-DT Discourse Relation Classification using RST-DT• Connective Word Classification using New York Times CorpusTheir experimental results show that a multi-task model achieves significant im-provements over individual models. The amount of improvement is different amongdifferent relations ranging from a minimum of %2 to a maximum of 16% improve-ment in classifying each implicit relation .[29] have also used multitasking and Deep Neural architectures for SemanticDependency Parsing. In its basic form as shown in Figure 2.3, their model included12a layer of bi-directional Long Short Term Memories (BiLSTM) for representingthe sentences, followed by two layers of Deep Neural Networks to predict depen-dency relations in a parse tree. In this work, they explored two multitask learningapproaches. In the first approach, parameters of the BiLSTM part of the modelwere shared among the tasks. In the second approach, higher-order structures wereused to predict the graphs jointly. Their work on a jointly trained multitask systemshowed statistically significant improvements over an individual model. However,the improvement was rather small, from 87.4% to 88%.Figure 2.3: Illustration of the architecture of the basic model. (Figure adaptedfrom [29])In all these projects we notice that Multi-tasking benefits all the tasks in whichit was applied but we also observe that the benefit varies from task to task andmodel to model with some tasks getting more benefit out of multi-tasking thanothers.13Chapter 3CorporaFor the task of Discourse Parsing, we use RST-DT ([3], [4]). This dataset contains385 documents along with their fully labeled Discourse Trees. The annotation isbased on the Rhetorical Structure Theory (RST), a popular theory of discourseoriginally proposed in [21]. All the documents in RST-DT were chosen from WallStreet Journal news articles taken from the Penn Treebank corpus [24]. Since weare focusing only on sentence-level discourse parsing, the documents as well astheir Discourse Trees were first preprocessed to extract the sentences and sentence-level Discourse Trees. The sentence-level Discourse Trees were extracted from thedocument-level Discourse Tree by finding the sub-tree that exactly spans over thesentence. This resulted in a dataset of 6846 sentences with well-formed DiscourseTrees, out of which 2239 sentences had only one EDU. Since sentences with onlyone EDU have trivial Discourse Trees, these sentences were excluded from ourdataset, leaving a total of 4607 sentences.For the task of Sentiment Analysis, we use the Sentiment Treebank [33]. Thisdataset consists of 11855 sentences along with their syntactic parse trees labeledwith sentiment labels at each node. For this work, since our models label sentimentover a Discourse Tree, we had to preprocess the datasets in the following way. Foreach sentence in the Sentiment Treebank dataset, a Discourse Tree was created us-ing [13]. Next, for each node of the discourse tree, a sentiment label was extractedfrom the corresponding labeled syntactic tree by finding a subtree that exactly (oralmost exactly) matches the text span represented by the node in the discourse tree.14Relation percentages Relation percentageselaboration 33.29 temporal 2.38attribution 23.00 condition 1.91same-unit 10.53 comparison 1.52joint 5.62 manner-means 1.43enablement 4.21 evaluation 1.10background 4.13 summary 0.78contrast 3.97 topic-comment 0.12cause 3.48 topic-change 0.03explanation 2.44Table 3.1: Distribution of RST-DT relationsExact match was not possible when the syntactic and the discourse structures werenot fully aligned, which happened in 31.9% of the instances. In this case, an ap-proximation of the sentiment was computed by considering the sentiment of thetwo closest subsuming and subsumed syntactic sub-trees.Both datasets were highly unbalanced across different classes. In the case ofRST-DT, the discourse relations outlined in [21], were further grouped under 16classes (also outlined in [21]). Table 3.1 shows the distribution of each of these16 classes of relations across RST-DT at each node of the sentence level discoursetrees for sentences with more than one EDU. Notice that after adding the appro-priate nuclearity labels (explained in Section 1.1) to these sets, we get a total of41 different sets of relations since some of the relations can only take one of thethree forms of “-NS”, “-SN” or “-NN”. From this table we can see that some ofthe relations are very infrequent and some hardly ever appear at the sentence level.Figure 3.1 shows the distribution of sentiment labels of Sentiment Treebanksentences at all levels of the Discourse Tree created over them as described above.As we can see from the figure, the majority of text spans in Sentiment Treebankare “neutral”, followed by “positive” and “negative” labels. “very negative” and“very positive” labels are much more infrequent than others.15Figure 3.1: Distribution of sentiment labels over Sentiment Treebank sen-tences at all nodes of created Discourse Trees16Chapter 4Joint ModelsOur framework consists of three main sub parts. Given a segmented sentence, thefirst step is to create meaningful vector representations for all the EDUs. This isdiscussed in the first section. Next, we devise three different Recursive Neural Netmodels, each designed for one of discourse structure prediction, discourse relationprediction and sentiment analysis. In section 4.2, we discuss the structure of theseNeural Nets in detail. Finally, we join these Neural Nets in two different ways:Multitasking and pre-training. The final section of this Chapter talks about thesetwo ways of joining the Neural Nets.4.1 Learning Text EmbeddingsOne of the most challenging aspects of designing effective Neural Nets is to havemeaningful representations for the inputs. Since we refrain from hand picking fea-tures, and choose to feed text spans consisting of multiple words to the Neural Netsare our inputs, it is very important to come up with vector representations that aregeneralizable but also meaningful for the two tasks that we approach.Initially, we considered directly applying the Skip-thought framework [15] toeach text span to get generic vector representations for them, since the originalSkip-thought vectors were shown in [15] to be useful for many NLP tasks. How-ever, given the size of our datasets (only in the thousands of instances), it wasclear that using 4800-dimensional Skip-thought vectors would have created an17Figure 4.1: The Sentiment Neural Compressorover-parametrized network prone to over-fitting. Based on this observation, in or-der to simultaneously reduce the dimensionality and to produce vectors that aremeaningful for our tasks, we devised a compression mechanism that takes in theSkip-thought produced vectors and compresses them using a Neural Net. Figures4.2 and 4.1 show the structure of these compressors for our two different tasks.The sentiment neural compressor (Figure 4.1) takes as input, the skip-thoughtproduced vector representations for all phrases of the Sentiment Treebank in thetraining set. For example, consider a phrase i with skip-thought produced vectorPi ∈ R4800. The Sentiment Neural Compressor learns compressed vector P′i ∈ RdthroughP′i = f (W.Pi) (4.1)where f is a non-linear activation function such as relu and W ∈ Rd×4800 is thematrix of weights. This Neural Net uses the sentiment of phrase i for supervisedlearning of the weights.Similarly, the Discourse Parsing neural compressor (Figure 4.2) takes the skip-thought produced vector representations for two EDUs ei, e j that are connectedin their Discourse Tree and learns the compressed vectors e′i and e′j, each with d18Figure 4.2: The Discourse Neural Compressordimensions wheree′i = f (W1.ei)e′j = f (W1.e j)(4.2)where f is again a non-linear activation function such as relu and W1 ∈ Rd×4800is the matrix of weights. Note that the same set of weights are used for both EDUsbecause we are looking for a unique set of weights to compress an EDU.4.2 Neural Net ModelsFollowing [33]’s idea of Sentiment Analysis using recursive Neural Nets, we de-signed three Recursive Neural Nets for each task of Discourse Structure prediction,Discourse Relation prediction and Sentiment Analysis. All these three Neural Nets19Figure 4.3: The Discourse Structure Neural Netare classifiers.The Structure Neural Net takes in the compressed vector representation (∈ Rd) fortwo Discourse Units and learns whether they will be connected in the DiscourseTree (Figure 4.3). In this process, it also learns the vector representation for theparent of these two children. So for a parent p with children cl and cr, the vectorrepresentation for the parent is obtained by:p = f (Wstr[cl,cr]+bstr) (4.3)where [cl,cr] denotes the concatenating vector for the children; f is a non-linearity function; Wstr ∈ Rd×2d and bstr ∈ Rd is the bias vector.The Relation Neural Net takes as input the compressed vector representation fortwo Discourse Units that are determined to be connected in the Discourse Tree andlearns the relation label for the parent node. The Relation Neural Net is the samein structure as the Structure Neural Net in Figure 4.3.20The Sentiment Neural Net takes as input the compressed vector representationfor two Discourse Unit that are determined to be connected in the Discourse Treeand learns the sentiment label for the parent node. This Neural net also shares thesame structure as the one in Figure Individual ModelsBefore joining the models using pre-training or multi-tasking, each task is trainedindividually as a baseline. Algorithm 1 describes the training process for an indi-vidual model (before joining) which is a standard 10-fold cross validation with theaddition of a Neural Compression step.Algorithm 1 Training an individual Neural Netfor i← 0 to 10 dotrain set← load the training set for fold itest set← load the test set for fold itrain set← Train Neural Compressor on train set, and compress its vectorstest set← Compress test set vectors using the trained Neural CompressorTrain the recursive Neural Net on train setTest the recursive Neural Net on test setend forAt test time, for the task of Sentiment Analysis, given a sentence, with its Dis-course Structure tree, each node of the discourse tree is labeled with a sentimentlabel representing the sentiment of the text span the node corresponds to.However, for the task of Discourse Parsing, given a sentence a the discoursetree needs to be created and labeled with discourse relations. To build the mostprobable tree, a CKY-like bottom-up parsing algorithm that uses dynamic pro-gramming to compute the most likely parses is applied [13].214.4 Joint ModelsOur hypothesis in creating a joint model is that the accuracy of prediction obtainedin a joint design would be higher than the accuracy of prediction coming from in-dependent Neural Nets applied to each task. We explore two ways of creating ajoint model. For both approaches, we train three neural nets (Discourse Structure,Discourse Relation and Sentiment Neural Nets) that interact with one another forimproved training. The input to the Structure net are all possible pairs of text spansthat can be connected in a Discourse Tree. The input to the Relation and Sentimentnets are the pairs of text spans that are determined to be connected by the Structurenet.Inspired by Multitasking [6], our goal is to find a representation for the input thatwill benefit all the tasks that need to be solved. Since the first layer in a NeuralNet learns relevant features from the input embedding, in this approach, the firstlayer is shared between the three Neural Nets and training is achieved in a stochas-tic manner by looping over the three tasks. As shown in Figure 4.4, at each timestep, one of the tasks is selected along with a random training example for thattask. Afterwards, the neural net corresponding to this task is updated by taking agradient step with respect to the chosen example. The end product of this designis a joint input representation that could benefit both Sentiment Analysis and Dis-course Parsing.22Figure 4.5: Multi-tasking NetworkFigure 4.4: Multi-taskingInspired by Pre-training Neural Nets [7], in this approach we study how the23parameters of one Neural Net after training can be used as a form of initializationfor the network applied to the other task. As shown in Figure 4.6, in this setting,we first fully train the Discourse Structure Neural Net, then the weights from thistrained net are used to initialize the Discourse Relation Neural Net and once thisnet is fully trained, its weights are used to initialize the weights of the DiscourseStructure Neural Net again. After another round of training the Discourse StructureNeural Net, its weights are used to initialize the Sentiment Neural Net. After train-ing the Sentiment Neural Net, its weights are again used to initialize the StructureNeural Net. We experimented with 2,3 and 10 iterations using 10-fold cross vali-dation on the datasets and achieved best results with 3 iterations, which appears tobe a good compromise between accuracy and training time. Algorithm 2 describesthe training process for the Pre-training setting. Notice that in this setting, weneed both Sentiment and Discourse Neural Compressors, where each one wouldbe trained once on their relevant set of data before entering the training loop.Figure 4.6: Using the weights of one network as a form of pre-training foranother network24Algorithm 2 Training Model in Pre-training setting1: for i← 0 to 10 do2: sentiment train set← load the sentiment training set for fold i3: sentiment test set← load the sentiment test set for fold i4: discourse train set← load the discourse training set for fold i5: discourse test set← load the discourse test set for fold i6: sentiment train set← Sentiment Neural Compressor.train(sentiment train set)( compresses its vectors as well)7: sentiment test set← Sentiment Neural Compressor.compress(sentiment test set)8: discourse train set←Discourse Neural Compressor.train(discourse train set)( compresses its vectors as well)9: discourse test set←Discourse Neural Compressor.compress(discourse test set)10: for j← 0 to pre train itr do11: Structure Neural Net.train(discourse train set)12: Relation Neural Net.train(discourse train set)13: Structure Neural Net.train(discourse train set)14: Sentiment Neural Net.train(sentiment train set)15: end for16: Discourse Neural Net.test(discourse test set)17: Sentiment Neural Net.test(sentiment test set)18: end for25Chapter 5Training and Evaluating TheModelsAll the neural models presented in this project were implemented using the Ten-sorFlow python package [1]. We minimize the cross-entropy error using the Adamoptimizer and L2-regularization on the set of weights. For the individual models(before joining), we use 200 training epochs and a batch size of 100.We evaluate our models using 10-fold cross validation on the sentiment tree-bank and on RST-DT. All the experiments were based on manual Discourse Seg-mentation. In Table 5.1 and Table 5.3, a star indicates that there is statistical sig-nificance with a p-value less than 0.05. The significance is with respect to the jointmodel vs the model before joining.For the task of Discourse parsing, the three predictions are: whether two dis-course units should be connected (span), what relation holds between them (re-lation) and which one is the nucleus (Nuclearity). For these three sub tasks, themetrics used to evaluate the model are Precision, Recall and F score proposed byMarcu [22]. Since we are using manual discourse segmentation, Precision, Recalland F score are the same and so we only show the F score.The results for Discourse Parsing are shown in Table 5.1. We have used the 41relations outlined in [21] for training and evaluation of the Relation prediction.From the results, we see some improvement on Discourse Structure predictionwhen we are using a joint model but the improvement is statistically significant26Approach Span Nuclearity RelationDiscourse Parser 93.37 73.38 57.05(Before Joining)Joined Model 94.35 74.92 58.82Pre-trainingJoined Model 94.31 75.91* 60.91*Multi-taskingTable 5.1: Discourse Parsing results based on manual discourse segmenta-tionRelationSettingIndividual Pre-training Multi-taskingComparison 18.97 20.87 27.08Contrast 15.19 17.74 20.83Cause 7.6 8.11 8.61Average 13.92 15.57 18.84Table 5.2: Contrastive Relation Prediction results under different trainingsettingsonly for the Nuclearity and Relation predictions. The improvements on the Rela-tion predictions were mainly on the Contrastive set ([2]), specifically the class ofContrast, Comparison and Cause relations as defined in [21]. The result for eachof these relations under different training settings are shown in Table 5.2. Noticethat the accuracies may seem low but because we train over 41 classes of rela-tions, a random prediction results in 2.43%. Among the contrastive relations, theProblem-Solution did not improve due to the fact that this relation is hardly seenat the sentence level. This confirms our hypothesis that knowing the sentiment ofthe two Discourse Units that are connected in a discourse tree can help with theidentification of the discourse relation that holds between them.For the task of Sentiment Analysis, the results are shown in Table 5.3. To train27the model, we use the five classes of sentiment ({very negative, negative, neutral,positive, very positive}) used in [33]. We measure the accuracy of prediction intwo different settings. In the fine grained setting we compute the accuracy of exactmatch across five classes. In the Positive/Negative setting, if the prediction and thetarget had the same sign, they were considered equal1. The huge difference in ac-curacy between these two settings signals that distinguishing between very positiveand positive and distinguishing between very negative and negative is very hard.The results of sentiment shown in Table 5.3 are also consistent with our hypothesis.When jointly trained with Discourse Parsing, we can get a better performance onlabeling nodes of the Discourse Tree with sentiment labels compared to an individ-ual sentiment analyzer applied to a Discourse Tree.Interestingly, if we compare the two joint models across both tasks it appearsthat Multi-tasking does better that Pre-training in all cases except for discoursestructure. A possible explanation is that by transferring weights from one networkto another (as done in Pre-training), the neural net starts learning with a possiblybetter initialization of the weights. However Multi-tasking performs a joint learn-ing at the finer granularity of single training instances and so an improvement inlearning one task immediately affects the next.All results in Table 5.1 and 5.3 were obtained by setting the dimension d of thecompressed vectors to 100. Experimentally, we found that the performance of themodel was rather stable for d ∈ {1200,600,300,100} and was substantially lowerfor d ∈ {50,25}.In terms of actual runtime, Pre-training and the individual models are an orderof magnitude faster than the Multi-tasking model. This is because even thoughthey require a larger number of epochs to converge (200 for individual, vs 6 forMulti-tasking), they can be trained in parallel.Notice that training and testing of the networks is done on Sentiment Treebank forsentiment analysis and on RST-DT for discourse parsing. [13]’s Discourse parserwas run on Sentiment Treebank to get the sentiment annotation at the granular-ity required for the joint model with discourse. However, having a gold dataset ofsentiment labels corresponding to discourse units could further improve the results.1Notice that this is different from training a classifier for binary classification, which is a mucheasier task (see [2])28Approach Fine grained Positive/NegativeAll Root All RootSentiment Analyzer 43.37 40.6 52.86 51.27(Before Joining)Joined Model 42.46 40.36 53.82 53.15Pre-trainingJoined Model 45.49* 44.82* 55.52* 54.72*Multi-taskingTable 5.3: Sentiment Analysis over Discourse Tree29Chapter 6DiscussionSeveral differences between this work and previous approaches make direct com-parisons challenging and possibly not very informative. In this section, we high-light and explain the differences between our work and the two most recent Senti-ment Analyzers and Discourse Parsers.6.1 Comparison to previous Sentiment AnalyzersSocher et al. ([33]) use syntactic trees, as opposed to discourse trees, as the recur-sive structure for training. Due to this underlying structural difference, we cannotcompare our ”All”-level results with those of his. For ”Root”-level, which repre-sents the sentiment prediction for the whole sentence, Socher et al. reports 45.7%fine-grained sentiment accuracy compared to 44.82% of our Multi-tasking. Thisdifference is unlikely to be significant while the sentiment annotation of syntacticstructure is definitely more costly than one at the EDU level because a syntacticparse tree of a sentence has considerably more nodes than the sentence’s discoursetree.Bhatia et al. ([2]) focuses on document level Sentiment Analysis, using bag-of-word features for EDUs. While in future, our work can be extended to documentlevel sentiment analysis, the model we use learns the distributed representations ofthe EDUs, which will remain a key difference between our work and that of Bhatia30et al. Furthermore, Bhatia et al. train a binary model while assuming the discoursetree as given. In our approach, we train a 5-class model while joinly learning thediscourse tree.6.2 Comparison to previous Discourse ParsersSince our work focuses on sentence-level Discourse Parsing, we cannot comparewith Li et al. ([17]) because they only report overall results without differentiatingsentence vs document level.Our model is outperformed by CODRA ([13]) which achieves better perfor-mance on sentence level Discourse Parsing. While we believe that with moretraining data, as it has been shown with other NLP tasks, we would eventuallyoutperform CODRA, the primary goal of our work is not to beat the state of theart on each single task, but to show how the two tasks of Discourse Parsing andSentiment Analysis can be jointly performed in a neural model.31Chapter 7Future WorkFrom incorporating more data, to more complicated models and even benefitingfrom other relevant tasks, there is so much that can be done to improve the re-sults and to increase generalization of these models. Below we will discuss somepossible avenues for future work.7.1 Improving CorporaIn this work, we used a pre-existing Discourse Parser to parse the sentences ofSentiment Treebank into Discourse Trees and then tried to label the nodes of theproduced discourse trees with sentiment labels through a matching process. Pre-existing discourse parsers are not perfect and introduce some error. Furthermore,to that, the matching process described in Chapter 3 is also error prone. The bestway to go about this is to either use crowd-sourcing for producing correct discoursetrees and sentiment labels or to use multiple pre-existing discourse parsers to pro-duce high quality discourse trees (see [12]) and follow it up with crowd-sourcingto label each node with more accurate sentiment labels.7.2 More DataThe amount of data greatly affects the performance of a Neural Nets. With moredata, networks can better learn the patterns, and the results are more reliable. In thisproject, the number of training instances for the Structure Neural Net was 8639, for32Approach Sentence Level Document LevelManual Segmentation Automatic Segmentation Automatic SegmentationSpan 95.4 80.1 83.84Nuclearity 88.6 75.2 68.90Relation 78.9 66.8 55.87Table 7.1: CODRA’s Discourse Parsing results at sentence-level anddocument-level, based on manual and automatic segmentationRelation Neural net was 7892 and for Sentiment Neural net was 5381, while thenumber of parameters that need to be trained for each network are around 20,000.Using methods that can augment the datasets [12] could help with the issue ofsmall dataset size.7.3 Recursive Neural Nets coupled with RecurrentNeural NetsGranted more data, one could learn the representation for the EDUs using a variantof Recurrent Neural Nets that are jointly trained with the Recursive model. Thissolution can eliminate the need for a Neural Compressor applied to Skip-thoughtvectors because text embeddings for EDUs can be learned in any desired dimensiondirectly. A similar model is described by Peng et al. [29] and shown to be helpfulin learning task-specific meaningful representations.7.4 Document Level Discourse Parsing and SentimentAnalysisIt is relatively easier to perform sentence-level discourse parsing and sentimentanalysis than the same tasks performed on multiple sentences or at document level.As an example, the results achieved by CODRA [13] (also shown in the Table 7.1)indicates that document-level discourse parsing is much harder than sentence leveldiscourse parsing. Using neural nets, a similar behaviour may be present but moredata, better vector representations of text spans as well as experimenting with morecomplicated models can help minimize the drop.It would also be interesting to observe how the performance of Sentiment33Analysis would be affected at the Document level using Multi Tasking Neural Net-works. Just as determining the sentiment of sentences is harder than determiningthe sentiment of words, determining the sentiment of documents is harder than de-termining the sentiment of sentences. Bhatia et al. [2] reported an 84.1% accuracyon document level binary Sentiment Analysis. However, in their work, they used avector constructed from bag-of-words features to represent the EDUs. As a possi-ble future work, one could look at the fine-grained (5 class) document level Senti-ment Analysis and the effects of learning the text span embeddings when scaled todocuments.7.5 Simultaneous Pre-training and Multi-taskingAs another future avenue, one could combine a form of pre-training with Multi-tasking. Under that setting, (possibly unsupervised) pre-training can be used toinitialize the weights in a better way which can then be followed by a loop ofmulti-tasking for each task to benefit from the other tasks’ features.34Chapter 8ConclusionsDiscourse Parsing and Sentiment Analysis are two fundamental NLP tasks thathave been shown to be mutually beneficial. Evidence from previous work indi-cates that information extracted from Discourse Trees can help with SentimentAnalysis and likewise, knowing the sentiment of two pieces of text can help withidentification of discourse relationships between them. In this thesis, we show howsynergies between these two tasks can be exploited in a joint neural model. Thefirst challenge entailed learning meaningful vector representations for text spansthat are the inputs for the two tasks. Since the dimension of vanilla skip-thoughtvectors is too high compared to the size of our corpora, in order to simultaneouslyreduce the dimensionality and to produce vectors that are meaningful for our tasks,we devised task specific neural compressors, that take in Skip-thought vectors andproduce much lower dimensional vectors.Next, we designed three independent Recursive Neural Nets classifiers; onefor Discourse Structure prediction, one for Discourse Relation prediction and onefor Sentiment Analysis. After that, we explored two ways of creating joint mod-els from these three networks: Pre-training and Multitasking. Our experimentalresults show that such models do capture synergies among the three tasks with theMulti-tasking approach being the most successful, confirming that latent Discoursefeatures can help boost the performance of a neural sentiment analyzer and that la-tent Sentiment features can help with identifying contrastive relations between textspans.35In the short term, we plan to verify how syntactic information could be explic-itly leveraged in the three task-specific networks as well as in the joint models.Then, our investigation will move from making predictions about a single sentenceto the much more challenging task of dealing with multi-sentential text, whichwill likely require not only more complex models, but also models with scalabletime performance in both learning and inference. Next, we intend to study howpre-training and multitasking could be both exploited simultaneously in the samemodel, something that to the best of our knowledge has not been tried before.Finally, as another venue for future research, we plan to explore how sentimentanalysis and discourse parsing could be modeled jointly with text summarization,since these three tasks can arguably inform each other and therefore benefit fromjoint neural models similar to the ones described in this thesis.36Bibliography[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vie´gas, O. Vinyals, P. Warden,M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scalemachine learning on heterogeneous systems, 2015. URL Software available from → pages 26[2] P. Bhatia, Y. Ji, and J. Eisenstein. Better document-level sentiment analysisfrom rst discourse parsing. In Proceedings of Empirical Methods for NaturalLanguage Processing (EMNLP), September 2015. URL → pages 4, 7, 27,28, 30, 34[3] L. Carlson and D. Marcu. Discourse tagging reference manual. ISITechnical Report ISI-TR-545, 54:56, 2001. → pages 14[4] L. Carlson, M. E. Okurowski, and D. Marcu. RST discourse treebank.Linguistic Data Consortium, University of Pennsylvania, 2002. → pages 14[5] Y.-A. Chung, H.-T. Lin, and S.-W. Yang. Cost-aware pre-training formulticlass cost-sensitive deep learning. arXiv preprint arXiv:1511.09337,2015. → pages 9[6] R. Collobert and J. Weston. A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In Proceedingsof the 25th international conference on Machine learning, pages 160–167.ACM, 2008. → pages 5, 9, 2237[7] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio. Why does unsupervised pre-training help deep learning? Journalof Machine Learning Research, 11(Feb):625–660, 2010. → pages 5, 9, 23[8] S. Faridani. Using canonical correlation analysis for generalized sentimentanalysis, product recommendation and search. In Proceedings of the fifthACM conference on Recommender systems, pages 355–358. ACM, 2011. →pages 4[9] V. W. Feng and G. Hirst. A linear-time bottom-up discourse parser withconstraints and post-editing. In Proceedings of the 52nd Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers),pages 511–521, Baltimore, Maryland, June 2014. Association forComputational Linguistics. URL → pages 6[10] S. Gerani, Y. Mehdad, G. Carenini, R. T. Ng, and B. Nejat. Abstractivesummarization of product reviews using discourse structure. In EMNLP,pages 1602–1613, 2014. → pages 1, 3[11] F. Guzma´n, S. R. Joty, L. Ma`rquez, and P. Nakov. Using discourse structureimproves machine translation evaluation. In ACL (1), pages 687–698, 2014.→ pages 3[12] K. Jiang, G. Carenini, and R. T. Ng. Training data enrichment for infrequentdiscourse relations. → pages 32, 33[13] S. Joty, G. Carenini, and R. T. Ng. Codra: A novel discriminative frameworkfor rhetorical analysis. Computational Linguistics, 2015. → pages 1, 6, 14,21, 28, 31, 33[14] T. Kaneko, K. Hiramatsu, and K. Kashino. Adaptive visual feedbackgeneration for facial expression improvement with multi-task deep neuralnetworks. In Proceedings of the 2016 ACM on Multimedia Conference,pages 327–331. ACM, 2016. → pages 9[15] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba,and S. Fidler. Skip-thought vectors. In Advances in neural informationprocessing systems, pages 3294–3302, 2015. → pages 5, 8, 17[16] A. Lazaridou, I. Titov, and C. Sporleder. A bayesian model for jointunsupervised induction of sentiment, aspect and discourse representations.In ACL (1), pages 1630–1639, 2013. → pages 438[17] J. Li, R. Li, and E. H. Hovy. Recursive deep models for discourse parsing.In EMNLP, pages 2061–2069, 2014. → pages 7, 8, 31[18] P. Liu, X. Qiu, and X. Huang. Recurrent neural network for textclassification with multi-task learning. arXiv preprint arXiv:1605.05101,2016. → pages viii, 5, 9, 11[19] Y. Liu, S. Li, X. Zhang, and Z. Sui. Implicit discourse relation classificationvia multi-task neural networks. arXiv preprint arXiv:1603.02776, 2016. →pages viii, 9, 10, 12[20] A. Louis, A. Joshi, and A. Nenkova. Discourse indicators for contentselection in summarization. In Proceedings of the 11th Annual Meeting ofthe Special Interest Group on Discourse and Dialogue, pages 147–156.Association for Computational Linguistics, 2010. → pages 3[21] W. C. Mann and S. A. Thompson. Rhetorical structure theory: Toward afunctional theory of text organization. Text-Interdisciplinary Journal for theStudy of Discourse, 8(3):243–281, 1988. → pages 14, 15, 26, 27[22] D. Marcu. The theory and practice of discourse parsing and summarization.MIT press, 2000. → pages 26[23] D. Marcu and K. Knight. Discourse parsing and summarization, May 112001. US Patent App. 09/854,301. → pages 3[24] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a largeannotated corpus of english: The penn treebank. Computational linguistics,19(2):313–330, 1993. → pages 14[25] T. Meyer and A. Popescu-Belis. Using sense-labeled discourse connectivesfor statistical machine translation. In Proceedings of the Joint Workshop onExploiting Synergies between Information Retrieval and MachineTranslation (ESIRMT) and Hybrid Approaches to Machine Translation(HyTra), pages 129–138. Association for Computational Linguistics, 2012.→ pages 3[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributedrepresentations of words and phrases and their compositionality. InAdvances in neural information processing systems, pages 3111–3119,2013. → pages 839[27] T. Nakagawa, K. Inui, and S. Kurohashi. Dependency tree-based sentimentclassification using crfs with hidden variables. In Human LanguageTechnologies: The 2010 Annual Conference of the North American Chapterof the Association for Computational Linguistics, pages 786–794.Association for Computational Linguistics, 2010. → pages 7[28] B. Pang, L. Lee, et al. Opinion mining and sentiment analysis. Foundationsand Trends R© in Information Retrieval, 2(1–2):1–135, 2008. → pages 7[29] H. Peng, S. Thomson, and N. A. Smith. Deep multitask learning forsemantic dependency parsing. arXiv preprint arXiv:1704.06855, 2017. →pages viii, 8, 12, 13, 33[30] V. Rentoumi, S. Petrakis, M. Klenner, G. A. Vouros, and V. Karkaletsis.United we stand: Improving sentiment analysis by joining machine learningand rule based methods. In LREC, 2010. → pages 7[31] S. Rosenthal, A. Ritter, P. Nakov, and V. Stoyanov. Semeval-2014 task 9:Sentiment analysis in twitter. In Proceedings of the 8th internationalworkshop on semantic evaluation (SemEval 2014), pages 73–80. Dublin,Ireland, 2014. → pages 1[32] S. Z. Seyyedsalehi and S. A. Seyyedsalehi. A fast and efficient pre-trainingmethod based on layer-by-layer maximum discrimination for deep neuralnetworks. Neurocomputing, 168:669–680, 2015. → pages 9[33] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng,C. Potts, et al. Recursive deep models for semantic compositionality over asentiment treebank. Proceedings of the conference on empirical methods innatural language processing (EMNLP), 1631:1642, 2013. → pages 7, 8, 9,14, 19, 28, 30[34] S. Verberne, L. Boves, N. Oostdijk, and P.-A. Coppen. Evaluatingdiscourse-based answer extraction for why-question answering. InProceedings of the 30th annual international ACM SIGIR conference onResearch and development in information retrieval, pages 735–736. ACM,2007. → pages 340


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items