Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Extractive summarization of long documents by combining global and local context Xiao, Wen 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2019_september_xiao_wen.pdf [ 1.63MB ]
JSON: 24-1.0380504.json
JSON-LD: 24-1.0380504-ld.json
RDF/XML (Pretty): 24-1.0380504-rdf.xml
RDF/JSON: 24-1.0380504-rdf.json
Turtle: 24-1.0380504-turtle.txt
N-Triples: 24-1.0380504-rdf-ntriples.txt
Original Record: 24-1.0380504-source.json
Full Text

Full Text

Extractive Summarization of Long Documents byCombining Global and Local ContextbyWen XiaoB. Sc, University of Toronto, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2019c©Wen Xiao, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Extractive Summarization of Long Documents by Combining Globaland Local Contextsubmitted by Wen Xiao in partial fulfillment of the requirements for the degree ofMaster of Science in Computer Science.Examining Committee:Giuseppe Carenini, Computer ScienceSupervisorRaymond Ng, Computer ScienceSecond ReaderiiAbstractIn this thesis, we propose a novel neural single-document extractive summariza-tion model for long documents, incorporating both the global context of the wholedocument and the local context within the current topic. We evaluate the modelon two datasets of scientific papers , Pubmed and arXiv, where it outperforms pre-vious work, both extractive and abstractive models, on ROUGE-1 and ROUGE-2scores. We also show that, consistently with our goal, the benefits of our methodbecome stronger as we apply it to longer documents. Besides, we also show thatwhen the topic segment information is not explicitly provided, if we apply a pre-trained topic segmentation model that splits documents into sections, our model isstill competitive with state-of-the-art models.iiiLay SummaryThe goal of this work is to automatically select sentences to form an extractivesummary for a given long document, ideally with section information, like scien-tific papers. Our main idea is that to decide whether a sentence is representativefor a document, there are three important factors to be considered, the sentenceitself, the local context within the same section of that sentence, and the globalcontext, i.e. what the document is talking about as a whole. To realize our idea,we build a model mainly applying the recurrent neural network and a techniquecalled LSTM-Minus, which has been used in other domain. By the results of ourempirical experiments, we show that we have achieved the stat-of-the-art on thescientific papers datasets, and the benefits of our method become stronger as weapply it to longer documents.ivPrefaceThis dissertation is original, independent work by the author, W. Xiao. A com-pressed version of this dissertation has been accepted to be presented on the EMNLP-IJCNLP 2019 (2019 Conference on Empirical Methods in Natural Language Pro-cessing and 9th International Joint Conference on Natural Language Processing).vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Traditional Summarization . . . . . . . . . . . . . . . . . . . . . 42.2 Neural Extractive Summarization . . . . . . . . . . . . . . . . . . 52.3 Datasets for long documents . . . . . . . . . . . . . . . . . . . . 82.4 Neural Abstractive summarization . . . . . . . . . . . . . . . . . 92.5 LSTM-Minus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Topic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Statistical Significance test in NLP . . . . . . . . . . . . . . . . . 143 Extractive Summarization Model for Long Documents . . . . . . . 15vi3.1 Sentence Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Document Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 General Experiment Settings . . . . . . . . . . . . . . . . . . . . 214.1.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Extractive Label Generation . . . . . . . . . . . . . . . . 224.1.3 Implementation Details . . . . . . . . . . . . . . . . . . . 234.1.4 Models for Comparison . . . . . . . . . . . . . . . . . . 234.2 Experiments on Scientific Paper Datasets . . . . . . . . . . . . . 244.2.1 Results and analysis . . . . . . . . . . . . . . . . . . . . 244.2.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 294.3 Experiment on Bigpatent - Long documents without topic segmentinformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Conclusion and Future work . . . . . . . . . . . . . . . . . . . . . . 32Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34viiList of TablesTable 2.1 Comparison of news datasets . . . . . . . . . . . . . . . . . . 8Table 4.1 Results on the arXiv dataset . . . . . . . . . . . . . . . . . . . 24Table 4.2 Results on the Pubmed dataset . . . . . . . . . . . . . . . . . 25Table 4.3 Percentage relative improvement of our model . . . . . . . . . 26Table 4.4 Ablation study on Pubmed . . . . . . . . . . . . . . . . . . . . 29Table 4.5 Ablation study on arXiv . . . . . . . . . . . . . . . . . . . . . 29Table 4.6 Results on the Bigpatent-A dataset. . . . . . . . . . . . . . . . 30viiiList of FiguresFigure 2.1 Extractors cpmpared in [19] . . . . . . . . . . . . . . . . . . 6Figure 2.2 Extractors cpmpared in [19] . . . . . . . . . . . . . . . . . . 6Figure 2.3 The structure of Cohan’s model of discourse-aware abstractivesummarization [8] . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.4 LSTM-Minus in dependency parsing. . . . . . . . . . . . . . 11Figure 2.5 The supervised topic segmentation model proposed in [21] . . 13Figure 2.6 The approximate randomization statistical significance test. [31] 14Figure 3.1 Structure of our model . . . . . . . . . . . . . . . . . . . . . 16Figure 3.2 Details of local context representation . . . . . . . . . . . . . 17Figure 4.1 Results with different length on Pubmed and arXiv . . . . . . 27Figure 4.2 Relative position plot . . . . . . . . . . . . . . . . . . . . . . 28Figure 4.3 Results with different length on Bigpatent-A . . . . . . . . . 31ixAcknowledgmentsI am grateful for the funding received towards my Master program from Huawei.First and foremost, I would like to thank my supervisor - Prof. GiuseppeCarenini, for his patience, encourage and helpful guidance. He is always will-ing to help me and share his ideas with me whenever I am stuck. Then I would liketo thank my second reader - Prof. Raymond Ng for his time reading my thesis andproviding me useful suggestions.Its my fortune to gratefully acknowledge the spiritual support of my best friends,Yuekun Gao, Jiayi Wu and Siqi Mei, even though they are not here with me physi-cally, they always cheer me up and encourage me during my hard time.My great appreciation goes to all my family members for their love and supportall the time.Next, I must express my very profound gratitude to my parents for showingfaith in me and giving me liberty to choose what I desired, and how lucky I amto have the best parents in the world. This accomplishment would not have beenpossible without their unfailing support and continued encourage. There is anotherfamily member I would also like to thank, and he is my pet Little White, he hasaccompanied me for almost ten years, I can always feel better after playing withhim, although he can not talk.Finally, I owe thanks to a very special person, my husband, Zhenan Fan. It ishim who always stands by my side supporting me and making me more confidentwhen I feel tired, disappointed and lost. Without him, I would not have had thecourage to embark on this journey in the first place.xChapter 1IntroductionSingle-document summarization is the task of generating a short summary for agiven document.IV Ideally, the generated summaries should be fluent and coher-ent, and should faithfully maintain the most important information in the sourcedocument.This is a very challenging task, because it arguably requires an in-depthunderstanding of the source document, and current automatic solutions are still farfrom human performance [1].ISingle-document summarization can be either extractive or abstractive. Extrac-tive methods typically pick sentences directly from the original document based ontheir importance, and form the summary as an aggregate of these sentences. Usu-ally, summaries generated in this way have a better performance on fluency andgrammar, but they may contain much redundancy and lack in coherence acrosssentences. In contrast, abstractive methods attempt to mimic what humans do byfirst extracting content from the source document and then produce new sentencesthat aggregate and organize the extracted information. Since the sentences are gen-erated from scratch they tend to have a relatively worse performance on fluency andgrammar. Furthermore, while abstractive summaries are typically less redundant,they may end up including misleading or even utterly false statements, because themethods to extract and aggregate information form the source document are stillrather noisy.In this thesis, we focus on extracting informative sentences from a given doc-ISentence underlining and Roman numbering will be explained in the result sub-section (without dealing with redundancy), especially when the document is rela-tively long (e.g., scientific articles).Most recent works on neural extractive summarization have been rather suc-cessful in generating summaries of short news documents (around 650 words/doc-ument) [26] by applying neural Seq2Seq models [4]. However when it comes tolong documents, the models tend to struggle with longer sequences because at eachdecoding step, the decoder needs to learn to construct a context vector capturingrelevant information from all the tokens in the source sequence [35].Long documents typically cover multiple topics. In general, the longer a doc-ument is, the more topics are discussed. As a matter of fact, when humans writelong documents they organize them in chapters, sections etc.. Scientific papers arean example of longer documents and they follow a standard discourse structuredescribing the problem, methodology, experiments/results, and finally conclusions[38].To the best of our knowledge only one previous work in extractive summa-rization has explicitly leveraged the section information to guide the generationof summaries [9]. However, the only information about sections that is fed intotheir sentence classifier is a categorical feature with values like Highlight, Ab-stract, Introduction, Results / Discussion / Analysis, Method, Conclusion, all else,depending on which actual section the sentence appears in. In contrast, in order toexploit section information, we propose to capture a distributed representation ofboth the global (the whole document) and the local context (e.g., the section/topic)when deciding if a sentence should be included in the summary IIThe main contributions of this thesis are as follows:• In order to capture the local context, we are the first to apply LSTM-minusto text summarization. LSTM-minus is a method for learning embeddingsof text spans, which has achieved good performance in dependency parsing[43], in constituency parsing [10], as well as in discourse parsing [24]. Withrespect to more traditional methods for capturing local context, which relyon hierarchical structures, LSTM-minus produces simpler models i.e. withless parameters, and therefore faster to train and less prone to overfitting.• We test our method on the Pubmed and arXiv datasets and results appear to2support our goal of effectively summarizing long documents.III In particular,while overall we outperform the baseline and previous approaches only bya narrow margin on both datasets, the benefit of our method become muchstronger as we apply it to longer documents.• In order to evaluate our approach, we have created oracle labels for bothPubmed and arXiv [8], by applying a greedy oracle labeling algorithm.VThese two datasets annotated with extractive labels will be made public.• When the topic segment information is not available, we first apply a topicsegmentation model to split the documents into sections. And we haveshown we can achieve a competitive result even using a pre-trained topicsegmentation model trained on a completely different corpus.VIIn Chapter 2, we will introduce some related work on summarization models,datasets, as well as techniques we use in this thesis. The details of our own modelwill be described in Chapter 3. The implementation details and the experimentswith empirical evidence are discussed in Chapter 4. The last Chapter presents ourconclusion and possible future work.3Chapter 2Related Work2.1 Traditional SummarizationBefore neural models, enabled by large datasets, became the most successful ap-proach in summarization, researchers had applied more traditional techniques, likeprobabilistic models and graph-based methods. Mihalcea and Tarau [25] were thefirst to introduce graph theory and corresponding algorithms to the NLP area(includingthe summarization area). They propose to build a graph of text, where the node ofthe graph are text elements (e.g., the sentences). Then a node ranking techniquecan be applied to extract the most important information. They test their method ontwo tasks, keyword extraction and sentence extraction, i.e. extractive summariza-tion. At the same time, Erkan et al. [15] propose a similar graph-based method formulti-document summarization. Later on, Garg et al. [16] improve the sentenceextraction model on meeting transcripts by first clustering the sentences, and thenbuild a similar graph as [25] for clusters, instead of single sentences. The reasonis that the meeting transcripts are usually incomplete, ill-formed sentences withhigh redundancy, and also contain chit-chat, which has nothing to do with theirmain topic, by using cluster, they could avoid including such sentences. Besides,they use the cosine similarity of two sentences rather than directly counting thecommon words, and prove that this would lead to a better performance. Tixier etal. [39] propose a submodular based summarization model. They try to first ex-tract keywords by building a graph of words and selecting important words[40],4then select sentences containing high scores keywords, and finally build the ex-tractive summary by maximizing a submodular function within a certain budget.Unlike the ranking score used in Textrank, which tends to give nodes with moreimportant connections higher scores in the graph, without any cohesiveness consid-eration, they hypothesize that important words are more likely to be found amongthe influential spreaders, which are the nodes that not only have many importantconnections but also in a dense substructures of these connections. Thus they useCoreRank[40] as score of each word, because it can capture the spreading influ-ence of a node.2.2 Neural Extractive SummarizationBenefiting from the success of neural sequence models in other NLP tasks, Chengand Lapata [4] propose a novel approach to single-document extractive summa-rization based on neural networks and continuous sentence features, which outper-forms traditional methods on the DailyMail dataset. In particular, they develop ageneral encoder-decoder architecture, where a CNN is used as sentence encoder, auni-directional LSTM as document encoder, with another uni-directional LSTM asdecoder. Besides, they extend their structure to an abstractive stage, by changingthe sequence-labeling decoder to an attention-based word generator, but the resultshows that the abstractive method is not as promising as the extractive model. Todecrease the number of parameters while maintaining the accuracy, Nallapati etal. [27] present SummaRuNNer, a simple RNN-based sequence classifier withoutdecoder, outperforming or matching the model of [4]. They take content, salience,novelty, and position of each sentence into consideration when deciding if a sen-tence should be included in the extractive summary. Yet, they do not capture anyaspect of the topical structure, as we do in this thesis. So their approach wouldarguably suffer when applied to long documents, likely containing multiple anddiverse topics.While SummaRuNNer is tested only on news, Kedzie and McKeown [19] carryout a comprehensive set of experiments with deep learning models of extractivesummarization across different domains, i.e. news, personal stories, meetings, and5Figure 2.1: One of the extractor compared in [19], (a) is a simple RNNmodel, model (b) is an attention-based encoder-decoder modelFigure 2.2: One of the extractor compared in [19], model (c) is the extractorproposed in [4], model (d) is the extractor proposed in [27].6medical articles, as well as across different neural architectures, in order to betterunderstand the general pros and cons of different design choices. They comparedifferent sentence encoders(CNN, RNN, and Average Word Embedding) and sen-tence extractors, and the extractors they compare are shown in Figure 2.1 - 2.2:• Extractor (a): a simple bidirectional RNN model. First, the sentence em-beddings are encoded through a bidirectional RNN, and then the forwardand backward hidden states for each sentence are concatenated and passedthrough a Multi-Layer Perceptron with a logsitic sigmoid. The final outputis the probability that the sentence is selected.• Extractor (b): an attention-based encoder-decoder model. The sentence em-beddings are first encoded through a bidirectional RNN, which is the same asextractor (a), and then there is a bidirectional decoder with attention mecha-nism, in which the outputs of the decoder are the query vectors attending tothe encoder. After that, the concatenation of attended encoder output and de-coder output are concatenated and passed through a Multi-Layer Perceptronwith a logistic layer. The final output is also the probability that the sentenceis selected.• Extractor (c): this extractor is first proposed in [4], which is an auto-regressiveencoder-decoder structure. The sentences are encoded through a RNN en-coder, with the last hidden state of the encoder as the initial hidden state ofthe decoder. The sentence embeddings are fed into the decoder and thenpassed to a Multi-Layer Perceptron with a logistic layer. The result is theprobability that the sentence is part of the extractive summary as well.• Extractor (d): this extractor is proposed in [27]. The first step is also to en-code sentence embeddings through a RNN encoder, and then there is a docu-ment representation by averaging the RNN output and a summary represen-tation by taking the weighted sum of the RNN output until the current stagewith weight being the extraction probability of corresponding sentence. Thedecision is made based on the RNN output, document representation, sum-mary representation as well as positions of the sentence in the document.7Datasets # docs avg. doc. length(# words)avg. summarylength(# words)CNN 92K 656 43Daily Mail 219K 693 52NY Times 655K 530 38PubMed 133K 3016 203arXiv 215K 4938 220Bigpatent 1,341K 3573 117Bigpatent-A 193K 3521 110Table 2.1: Comparison of news datasets, scientific paper datasets and the re-cently proposed patent dataset, [8][36]They find that non auto-regressive sentence extraction performs at least as wellas auto-regressive extraction in all domains, where by auto-regressive sentence ex-traction they mean the previous predictions is used to inform future predictions.Furthermore, they find that the Average Word Embedding sentence encoder worksat least as well as encoders based on CNN and RNN. In light of these findings, ourmodel is not auto-regressive and uses the Average Word Embedding encoder.2.3 Datasets for long documents[12] provide a comprehensive overview of the current datasets for summarization.Noticeably, most of the larger-scale summarization datasets consists of relativelyshort documents (less than 1000 words/document), like CNN/DailyMail [26] andNew York Times [32]. One exception is [8] that recently introduce two large-scaledatasets of long and structured scientific papers obtained from arXiv and PubMed.These two new datasets contain much longer documents than all the news datasets(See table 2.1) and are therefore ideal test-beds for the method we present in thisthesis. Recently, a new dataset for summarization, Bigpatent, is proposed in [36].The dataset consists of 1.3 million U.S. patent documents collected from GooglePatents Public Datasets, and the authors use the patent’s abstract as summary whilethe correponding description as the input document. Based on their result, theyclaim that when compared with other datasets (news and scientific papers), thesummaries in their dataset contain a richer discourse structure with more repeating8entities, and the salient content is evenly distributed throughout the input. Besides,the documents tend to be as long as in the Pubmed dataset, while the summariestend to be shorter, showing the higher compression ratio of this new dataset. Thereare 9 different categories in this dataset, and in this thesis, we only do the experi-ment on the first category (Bigpatent-A).The main problem for experiments on this dataset is that there is no sectioninformation as in scientific papers. To solve the problem, we apply a pre-trainedtopic segmentation model to split the whole documents into sections. And we willintroduce the previous works on topic segmentation in Sec Neural Abstractive summarizationRecently, more researchers have started to use Neural Network to generate ab-stractive summaries, especially after the large dataset CNN/Dailymail corpus isintroduced by Nallapati et al. in [26] to the summarization community (it is origi-nally used for the task of passage-based question answering([18])). They proposethe encoder-decoder RNN model with attention, as well as several variants withneural tricks, like using pointer-generator switch and/or hierarchical structure. Seeet al. ([34])apply the same idea as pointer-generator switch, and they propose thepointer-generator model for summarization, in which there is a generation prob-ability determining whether generating words from the vocabulary, or copyingwords from the source documents. They show that they significantly outperformthe abstractive state-of-the-art result at that time. However, the model works onlyfor relatively short documents, for which the need for summarization is limited. AsSee et al. mention in their paper, they truncate the articles to 400 tokens.While most current neural abstractive summarization models focus on summa-rizing relatively short news articles, few researchers have started to investigate thesummarization of longer documents by exploiting their natural structure. Celiky-ilmaz et al. [3] present an encoder-decoder architecture to address the challengesof representing a long document for abstractive summarization. The encoding taskis divided across several collaborating agents, each is responsible for a subsectionof text through a multi-layer LSTM with word attention. Generally speaking, theirmodel seems however overly complicated when it comes to the extractive summa-9Figure 2.3: The structure of Cohan’s model of discourse-aware abstractivesummarization [8]rization task, where word attention is arguably much less critical. So, we do notconsider this model further in this thesis.Cohan et al. [8] also propose a model for abstractive summarization taking thestructure of documents into consideration with a hierarchical approach, and test iton longer documents with section information, i.e. scientific papers. In particular,they apply a hierarchical encoder at the word and section levels. Then, in thedecoding step, they combine the word attention and section attention to obtain acontext vector at each state. Finally a score is computed based on the contextvector, and the word is chosen as the one with highest score. The model is shownin Figure 2.3.This approach to capture discourse structure is however quite limited both ingeneral and especially when you consider its application to extractive summariza-tion. First, their hierarchical method has a large number of parameters and it istherefore slow to train and likely prone to overfitting1. Secondly, it does not takethe global context of the whole document into account, which is arguably critical inextractive methods, when deciding on the salience of a sentence (or even a word).1To address this, they only process the first 2000 words of each document, by setting a hardthreshold in their implementation, and therefore loosing information.10Figure 2.4: LSTM-Minus in dependency parsing.The extractive summarizer we present in this thesis does not suffer from these lim-itations by adopting the parameter lean LSTM-minus method, and by explicitlymodeling the global context.2.5 LSTM-MinusThe LSTM-Minus method is first proposed in [43] as a novel way to learn sen-tence segment embeddings for graph-based dependency parsing, i.e. estimatingthe most likely dependency tree given an input sentence. For each dependencypair, they divide a sentence into three segments (prefix, infix and suffix), and theLSTM-Minus is used to represent each segment. They apply a single LSTM to thewhole sentence and use the difference between two hidden states h j−hi to repre-sent the segment from word wi to word w j, as shown in Figure 2.4. This enablestheir model to learn segment embeddings from information both outside and insidethe segments, enhancing their model’s ability to access sentence-level information.The intuition behind the method is that each hidden vector ht can capture usefulinformation before and including the word vt .Shortly after, [10] use the same method on the task of constituency parsing,as the representation of a sentence span, extending the original uni-directionalLSTM-Minus to the bi-directional case. More recently, inspired by the successof LSTM-Minus in both dependency and constituency parsing, [24] extend the11technique to discourse parsing. They propose a two-stage model consisting of anintra-sentential parser and a multi-sentential parser, learning contextually informedrepresentations of constituents with LSTM-Minus, at the sentence and documentlevel, respectively.Similarly, in this thesis, when deciding if a sentence should be included in thesummary, the local context of that sentence is captured by applying LSTM-Minusat the document level, to represent the sub-sequence of sentences of the document(i.e., the topic/section) the target sentence belongs to.2.6 Topic SegmentationTopic segmentation is the task of dividing a document into segments, such thatthe sentences within each segment are topically cohesive, while the cut-off pointshould mark the change of topic. This provides a basic structure of documents thatcan be useful for summarization.The traditional topic segmentation models are mostly unsupervised, due to thelack of large-scale labeled data. Riedl and Biemann[30] employ a method based onthe topic assigned by the Bayesian Inference method of LDA. They define a coher-ence score between pairs of sentences, and identify segment boundaries by largescore drops between pairs of adjacent sentences. Another noteworthy approach isGRAPHSEG[17], an unsupervised graph-based approach, which builds a seman-tic relatedness graph, where nodes represent sentences and edges are created bysemantically related sentence pairs. Then they split the topic segments by findingmaximal cliques in the relatedness graph. This is arguebly better because insteadof approximating the meaning of the sentence with its topics as in [30], it explicitlyleverages the semantic relatedness between sentences.Koshore et al. [21] introduce a large-scale natural dataset - WIKI-727K dataset,which is extracted from Wikipedia with sections as the topic segments, and proposea supervised hierarchical neural network model to solve the problem, as shown inFigure 2.5. For a given document d = (s1,s2,, the output of the model wouldbe (y1,y2, ...,yn), where yi indicates the probability that sentence si is the end ofa topic segment. After that, the final decision, whether a sentence should be theboundary of two topics, is made by setting a threshold t on the value of yi- es-12Figure 2.5: The supervised topic segmentation model proposed in [21]sentially, only the sentences with probability higher than t will be categorized astopic boundaries. In a series of experiments, they show that their model outper-form other unsupervised models on natural datasets (except for one synthesizedautomatically[6]), and that it generalizes well to unseen natural text.In this thesis, when topic segment information is not available, we will use thepre-trained model proposed in [21] to split the whole long documents into sectionsfirst.13Figure 2.6: The approximate randomization statistical significance test. [31]2.7 Statistical Significance test in NLPRiezler and Maxwell [31] study some deficiencies in the discriminatory ability ofMachine Translation evaluation metrics(NIST, BLEU, F1) and the accuracy of sta-tistical significance tests. In particular, they show an example that if the differencebetween two system in BLEU is as small as 0.3%, then the confidence levels areassessed as 70%, in which case, the two system can not be considered to have sig-nificant difference. This highlight the fact that if the differences between the resultsof multiple systems are small, then a statistical significance test is critically needed.Based on their experiments, they show that approximate randomization[28] can es-timate the p-value more conservatively, when compared with the popular statisticalsignificance testing - bootstrap test [14], which increasing the likelihood of type-Ierror for the latter. Based on their findings, Kedzie et al. [19] choose the approxi-mate randomization as the methods for statistical significance test.Thus, following their work, we use the approximate randomization statisticalsignificance test on all the results shown in this thesis. The algorithm to computep-value is shown in figure 2.6.14Chapter 3Extractive Summarization Modelfor Long DocumentsIn this thesis, we propose an extractive model for long documents, incorporating lo-cal and global context information, motivated by natural topic-oriented structure ofhuman-written long documents. The architecture of our model is shown in Figure3.1, each sentence is visited sequentially in the original document order, and a cor-responding confidence score is computed expressing whether the sentence shouldbe included in the extractive summary1. Our model comprises three components:the sentence encoder, the document encoder and the sentence classifier.3.1 Sentence EncoderThe goal of the sentence encoder is mapping sequences of word embeddings to afixed length vector (See bottom center of Figure 3.1). There are several commonmethods to embed sentences. For extractive summarization, RNN were used in[27], CNN in [4], and Average Word Embedding in [19]. [19] experiment withall the three methods, and conclude that Word Embedding Averaging is as goodor better than either RNNs or CNNs for sentence embedding across different do-mains and summarizer architectures. So in this work, we use the Average WordEmbedding as our sentence encoder, by which a sentence embedding is simply the1We do not deal with redundancy in this thesis.15Figure 3.1: The structure of our model, sei,sri represent the sentence embed-ding and sentence representation of sentence i, respectively. The binarydecision of whether the sentence should be included in the summary isbased on the sentence itself (A), the whole document (B) and the cur-rent topic (C). The document representation is simply the concatenationof the last hidden states of the forward and backward RNNs, while thetopic segment representation is computed by applying LSTM-Minus, asthe details shown in Fig 3.216Figure 3.2: Detail of C, the topic segment representation is computed by ap-plying LSTM-Minus. The RNN in red rectangle is the Document En-coder, the same as the one in the red rectangle in Fig. 3.1average of its word embeddings, i.e. se = 1n ∑wnw0 emb(wi),se ∈ Rdemb .Besides, we also tried the popular pre-trained BERT sentence embedding [13],but initial results were rather poor. So we do not pursue this possibility any further.3.2 Document EncoderAt the document level, a bi-directional recurrent neural network [33] is often usedto encode all the sentences sequentially forward and backward, with such model17achieving remarkable success in machine translation [2]. As units, we selectedgated recurrent units (GRU) [5], in light of favorable results shown in [7]. TheGRU is represented in a standard fashion with r,z,n representing the reset, update,and new gates, respectively.rt = σ(Wirset +bir +Whrh(t−1)+bhr)zt = σ(Wizset +biz+Whzh(t−1)+bhz)nt = tanh(Winset +bin+ rt(Whnh(t−1)+bhn))ht = (1− zt)nt + zth(t−1)The output of the bi-directional GRU for each sentence t comprises two hiddenstates, h ft ∈ Rdhid ,hbt ∈ Rdhid as forward and backward hidden state, respectively.A. Sentence representation As shown in Figure 3.1(A), for each sentence t, thesentence representation is the concatenation of both backward and forward hiddenstate of that = (hft : hbt ),srt ∈ Rdhid∗2In this way, the sentence representation not only represents the current sentence,but also partially covers contextual information both before and after this sentence.B. Document representation The document representation provides global infor-mation on the whole document. It is computed as the concatenation of the finalstate of the forward and backward GRU, labeled as B in Figure 3.1. [22]d = (h fn : hb0),d ∈ Rdhid∗2C. Topic segment representation To capture the local context of each sentence,namely the information of the topic segment that sentence falls into, we apply theLSTM-Minus method2, a method for learning embeddings of text spans. LSTM-Minus is shown in detail in Figure 3.2, each topic segment is represented as thesubtraction between the hidden states of the start and the end of that topic. Asillustrated in Figure 3.2, the representation for section 2 of the sample document(containing three sections and eight sentences overall) can be computed as [ f5−2In the original paper, LSTMs were used as recurrent unit. Although we use GRUs here, forconsistency with previous work, we still call the method LSTM-Minus18f2,b3−b6], where f5, f2 are the forward hidden states of sentence 5 and 2, respec-tively, while b3,b6 are the backward hidden states of sentence 3 and 6, respectively.In general, the topic segment representation lt for segment t is computed as:ft = hfendt −hfstartt−1, ft ∈ Rdhidbt = hbstartt −hbendt+1,bt ∈ Rdhidlt = ( ft : bt), lt ∈ Rdhid∗2where startt ,endt is the index of the beginning and the end of topic t, ft and btdenote the topic segment representation of forward and backward, respectively.The final representation of topic t is the concatenation of forward and backwardrepresentation lt . To obtain fi and bi, we utilize subtraction between GRU hiddenvectors of startt and endt , and we pad the hidden states with zero vectors both in thebeginning and the end, to ensure the index can not be out of bound. The intuitionbehind this process is that the GRUs can keep previous useful information in theirmemory cell by exploiting reset, update, and new gates to decide how to utilizeand update the memory of previous information. In this way, we can representthe contextual information within each topic segment for all the sentences in thatsegment.3.3 DecoderOnce we have obtained a representation for the sentence, for its topic segment(i.e., local context) and for the document (i.e., global context), these three factorsare combined to make a final prediction pi on whether the sentence should beincluded in the summary. We consider two ways in which these three factors canbe combined.Concatenation We can simply concatenate the vectors of these three factors as,inputi = (d : lt : sri), inputi ∈ Rdhid∗6where sentence i is part of the topic t, and inputi is the representation of sentence iwith topic segment information and global context information.Attentive context As local context and global context are all contextual informa-19tion of the given sentence, we use an attention mechanism to decide the weight ofeach context vector, represented asscoredi = vT tanh(Wa(d : sri))scoreli = vT tanh(Wa(lt : sri))weightdi =scorediscoredi + scoreliweight li =scoreliscoredi + scorelicontexti = weightdi ∗d+weight li ∗ ltinputi = (sri : contexti), inputi ∈ Rdhid∗4where the contexti is the weighted context vector of each sentence i, and assumesentence i is in topic t.Then there is a final multi-layer perceptron(MLP) followed with a sigmoidactivation function indicating the confidence score for selecting each sentence:hi = Dropout(ReLU(Wml pinputi+bml p))pi = σ(Whhi+bh).20Chapter 4ExperimentsTo validate our method, we set up experiments on the two scientific paper datasets(arXiv and PubMed). With ROUGE scores and METEOR score as automatic eval-uation metrics, we compare with previous works, both abstractive and extractive.Besides, we also do a series experiments on an additional datasets, Bigpatent, inwhich that documents do not contain the section information.4.1 General Experiment SettingsIn this section, we will introduce the details and general settings of our experimentson all the datasets.4.1.1 TrainingWe minimized the weighted negative log-likelihood during training, where theweight is computed as wpos =#negative#postive , to solve the problem of highly imbalanceddata (typical in extractive summarization).L = −N∑d=1Nd∑i=1(wpos ∗ yi log p(yi|W,b)+ (1− yi) log p(yi|W,b))where yi represent the ground-truth label of sentence i, with yi = 1 meaning sen-tence i is in the gold-standard extract summary.214.1.2 Extractive Label GenerationIn the Pubmed and arXiv datasets, the extractive summaries are missing. So wefollow the work of [19] on extractive summary labeling, constructing gold labelsequences by greedily optimizing ROUGE-1 on the gold-standard abstracts, whichare available for each article. 1 The pseudo code is shown in Algorithm 1.Algorithm 1 Extractive label generationfunction LABELGENERATION(Reference,sentences,lengthLimit)hyp = ”wc = 0picked = []highest r1 = 0sid =−1while wc≤ lengthLimit dofor i in range(len(sentences)) doscore = ROUGE(hyp+ sentences[i],re f )if score > highest r1 thenhighest r1 = scoresid = iend ifend forif sid!=−1 thenpicked.append(sid)hyp = hyp+ sentences[sid]wc += NumberOfWords(sentences[sid])elsebreakend ifend whilereturn pickedend function1For this, we use a popular python implementation of the ROUGE score to build the oracle. Codecan be found here, Implementation DetailsWe train our model using the Adam optimizer [20] with learning rate 0.0001. Weuse a mini-batch with a batch size of 32 documents, and the size of the GRU hid-den states is 300. The pre-trained word embedding we use is GloVe [29] withdimension 300, trained on the Wikipedia and Gigaword. The vocabulary size ofour model is 50000. And the drop out rate we use in the experiments is 0.3. All theabove parameters were set based on [19] without any fine-tuning. Again following[19], we train each model for 30 epochs, and the best model is selected with earlystopping on the validation set according to Rouge-2 F-score.4.1.4 Models for ComparisonWe perform a systematic comparison with previous work in extractive summariza-tion. For completeness, we also compare with recent neural abstractive approaches.In all the experiments, we use the same train/val/test splitting.• Traditional extractive summarization models: SumBasic [41], LSA [37], andLexRank [15] (Only available on scientific paper datasets)• Neural abstractive summarization models: Attn-Seq2Seq [26], Pntr-Gen-Seq2Seq [34] and Discourse-aware [8] (Only available on scientific paperdatasets)• Neural extractive summarization models: Cheng&Lapata [4] and SummaRuN-Ner [27]. Based on [19], we use the Average Word Encoder as sentence en-coder for both models, instead of the CNN and RNN sentence encoders thatwere originally used in the two systems, respectively. 2• Baseline: Similar to our model, but without local context and global context,i.e. the input to MLP is the sentence representation only.• Lead: Given a length limit of k words for the summary, Lead will return thefirst k words of the source document.• Oracle: uses the Gold Standard extractive labels, generated based on ROUGE(Sec. 4.1.2).2Aiming for a fair and reproducible comparison, we re-implemented the models by borrowingthe extractor classes from [19], the source code can be found Experiments on Scientific Paper DatasetsIn this section, we will show the results of the experiments on the two scientificpaper datasets - Pubmed and arXiv.4.2.1 Results and analysisModel Rouge-1 Rouge-2 Rouge-L MeteorSumBasic* 29.47 6.95 26.30 -LSA* 29.91 7.42 25.67 -LexRank* 33.85 10.73 28.99 -Attn-Seq2Seq* 29.30 6.00 25.56 -Pntr-Gen-Seq2Seq* 32.06 9.04 25.16 -Discourse-aware* 35.80 11.05 31.80 -Baseline 42.91 16.65 28.53 21.35Cheng & Lapata 42.24 15.97 27.88 20.97SummaRuNNer 42.81 16.52 28.23 21.35Ours-attentive context 43.58 17.37 29.30 21.71Ours-concat 43.62 17.36 29.14 21.78Lead 33.66 8.94 22.19 16.45Oracle 53.88 23.05 34.90 24.11Table 4.1: Results on the arXiv dataset. For models with an ∗, we report re-sults from [8]. Models are traditional extractive in the first block, neuralabstractive in the second block, while neural extractive in the third block.The Oracle (last row) corresponds to using the ground truth labels, ob-tained (for training) by the greedy algorithm, see Section 4.1.2. Resultsthat are not significantly distinguished from the best systems are bold.For evaluation, we follow the same procedure as in [19]. Summaries are gen-erated by selecting the top ranked sentences by model probability p(yi|W,b), un-til the length limit is met or exceeded. Based on the average length of abstractsin these two datasets, we set the length limit to 200 words. We use ROUGEscores3 [23] and METEOR scores [11] between the model results and ground-3We use a modified version of rouge papier, a python wrapper of ROUGE-1.5.5, papier. The command line is ’Perl ROUGE-1.5.5 -e data -a -n 2 -r 1000 -f A -z SPLconfig file’24Model Rouge-1 Rouge-2 Rouge-L MeteorSumBasic* 37.15 11.36 33.43 -LSA* 33.89 9.93 29.70 -LexRank* 39.19 13.89 34.59 -Attn-Seq2Seq* 31.55 8.52 27.38 -Pntr-Gen-Seq2Seq* 35.86 10.22 29.69 -Discourse-aware* 38.93 15.37 35.21 -Baseline 44.29 19.17 30.89 20.56Cheng & Lapata 43.89 18.53 30.17 20.34SummaRuNNer 43.89 18.78 30.36 20.42Ours-attentive context 44.81 19.74 31.48 20.83Ours-concat 44.85 19.70 31.43 20.83Lead 35.63 12.28 25.17 16.19Oracle 55.05 27.48 38.66 23.60Table 4.2: Results on the Pubmed dataset. For models with an ∗, we reportresults from [8]. See caption of Table 4.1 above for details on comparedmodels. Results that are not significantly distinguished from the bestsystems are bold.truth abstractive summaries as evaluation metric. The unigram and bigram overlap(ROUGE-1,2) are intended to measure the informativeness, while longest commonsubsequence (ROUGE-L) captures fluency to some extent [4]. The METEOR isfirst proposed to evaluate translation systems, and it scores machine translationhypotheses by aligning them to one or more reference translations. Alignmentsare based on exact, stem, synonym, and paraphrase matches between words andphrases. Meteor consistently demonstrates high correlation with human judgmentsin independent evaluations.The performance of all models on arXiv and Pubmed is shown in Table 4.1and Table 4.2, respectively. We use the approximate randomization as the statis-tical significance test method [31] with the Bonferroni correction to the multiplecomparison problem, at the confident level 0.01 (p < 0.01).As we can see from these tables, on both datasets, the neural extractive mod-els outperforms the traditional extractive models on informativeness (ROUGE-1,2)25Dataset-versus Rouge-1(%) Rouge-2(%) Rouge-L(%) Meteor(%)arXiv-SR +1.9 +5.1 +3.2 +2.0arXiv-BSL +1.7 +4.3 +2.1 +2.0Pubmed-SR +2.2 +4.9 +3.5 +2.0Pubmed-BSL +1.3 +2.8 +1.7 +1.3Macro avg-SR +2.0 +5.0 +3.4 +2.0Macro avg-BSL +1.5 +3.5 +1.9 +1.7Table 4.3: Percentage relative improvement of our model, when comparedwith the SummaRuNNer (SR) and Baseline (BSL) models on bothdatasets (first and second block). The third block shows Macro averagerelative improvement across the two datasets .by a wide margin, but results are mixed on ROUGE-L. Presumably, this is due tothe neural training process, which relies on a goal standard based on ROUGE-1.Exploring other training schemes and/or a combination of traditional and neuralapproaches is left as future work. Similarly, the neural extractive models also dom-inate the neural abstractive models on ROUGE-1,2, but these abstractive modelstend to have the highest ROUGE-L scores, possibly because they are trained di-rectly on gold standard abstract summaries.Compared with other neural extractive models, our models (both with attentivecontext and concatenation decoder) have better performances on all three ROUGEmetrics as well as METEOR score. In particular, the improvements over the Base-line model show that the local and global contextual information does help to iden-tify the most important sentences. Interestingly, just the Baseline model alreadyachieves a slightly better performance than previous works; possibly because theauto-regressive approach used in those models is even more detrimental for longdocuments. The details of these key comparisons are revealed in Table 4.3, whichshows the percentage relative improvements of our model over the Baseline andSummaRuNNer on both datasets, as well as their macro averages.Figure 4.1 shows the most important result of our analysis: the benefits of ourmethod, explicitly designed to capture global and local context for dealing withlonger documents, do actually become stronger as we apply it to longer documents.As it can be seen in the Figure, the performance gain of our model with respect to26Figure 4.1: A Comparison between our model, SummaRuNNer and Oraclewhen applied to documents with increasing length, left-up: ROUGE-1on Pubmed dataset, right-up: ROUGE-2 on Pubmed dataset, left-down:ROUGE-1 on arXiv dataset, right-down: ROUGE-2 on arXiv datasetits closest neural competitor is more pronounced for documents with >= 3000words.Finally, the result of Lead (Table 4.1, 4.2) shows that scientific papers haveless position bias than news; i.e., the first sentences of these papers are not a goodchoice to form an extractive summary.Figure 4.2 shows the relative position of our predicted sentences, oracle sen-tences and the section borders in the documents, with the documents uniformlysampled from the highest ROUGE score(left) to the lowest ROUGE score(right).The interesting result is that our method has the preference on the first sentencesand last sentences from the first section and the last section, which are most likelyIntroduction and Conclusion of the scientific papers, respectively, even thoughthere is no explicit information on the positions.As a teaser for the potential and challenges that still face our approach, its out-27Figure 4.2: The relative position in documents of our predicted sentences,oracle sentences, and the section borders, and the documents are sam-pled uniformly from the highest ROUGE score(left) to lowest ROUGEscore(right). The upper figure shows the position distribution ofPubmed, and the lower one shows the position distribution of arXiv.put (i.e., the extracted sentences) when applied to this thesis is underlined and theorder in which the sentences are extracted is marked with the Roman numbering.They are all located in the Introduction chapter, and distributed from the introduc-tory paragraph to the contribution part. It can be found that the most confidentthree sentences are the ones stating the motivation, explaining the intuition, anddescribing the experiments. If we increase the length limit to the number of wordsin our abstract, three more sentences are extracted, which do seem to provide use-ful complementary information. Not surprisingly, some redundancy is present, asdealing explicitly with redundancy is not a goal of our current proposal and left asfuture work.28Model Rouge-1 Rouge-2 Rouge-LBaseline 44.29 19.17 30.89Baseline+local 44.85 19.77 31.51Baseline+global 44.06 18.83 30.53Baseline+global+local(concat) 44.85 19.70 31.43Table 4.4: Ablation study on the Pubmed dataset. Baseline is the modelwith sentence representation only, Baseline+segment is the model withsentence and local topic information, Baseline+doc is the model withsentence and global document information, and the last one is the fullmodel with concatenation decoder. Results that are not significantly dis-tinguished from the best systems are bold.Model Rouge-1 Rouge-2 Rouge-LBaseline 42.91 16.65 28.53Baseline+local 43.57 17.35 29.29Baseline+global 42.90 16.58 28.36Baseline+global+local(concat) 43.62 17.36 29.14Table 4.5: Ablation study on the arXiv dataset. The model descriptions referto Table 4.4. Results that are not significantly distinguished from the bestsystems are bold.4.2.2 Ablation StudyTo investigate the influence that each part of our model makes, we do the ablationstudy on the concatenation decoder with the ROUGE scores as evaluation metric,and the results are shown in Table 4.4, 4.5. The same as Section 4.2.1, we use theapproximate randomization as the statistical significance test method [31] with theBonferroni correction to the multiple comparison problem, at the confident level0.01 (p < 0.01).From these tables, we can see that the performances significantly improve withthe additional topic information, for both Baseline and Baseline+global models.It indicates that the topic information do have relation to deciding if a sentenceshould be part of the summary, and the LSTM-Minus method helps to catch suchinformation. But adding the global information does not always improve the per-formances, in contrast, the performance is even worse when adding the global in-29Model Rouge-1 Rouge-2 Rouge-L MeteorBaseline 35.44 10.79 23.95 15.08Baseline + local 35.62 10.86 24.05 15.19Baseline + global 35.75 10.95 24.06 15.26Cheng & Lapata 35.77 10.86 24.07 15.22SummaRuNNer 35.79 10.94 24.04 15.25Ours-attentive context 35.62 10.82 23.96 15.17Ours-concat 35.62 10.84 24.02 15.16Lead 31.27 8.64 21.58 12.93Oracle 45.92 16.32 29.95 18.51Table 4.6: Results on the Bigpatent-A dataset.formation to the Baseline model. It might because that global information we useis always the same for all the sentences in one document, which might not be usefulto distinguish the summary sentences from other sentences. To solve the problem,exploring a more sentence-specific global representation is needed, and we wouldleave it as one of the future work.4.3 Experiment on Bigpatent - Long documents withouttopic segment informationFor documents in the Bigpatent corpus, there is no natural topic segment informa-tion as scientific papers do, so in this case, we need to first employ a topic seg-mentation methods to split documents into sections. In essence, this experimentis a preliminary exploration of whether our method can still deliver useful resultswhen topic segmentation is done automatically. The topic segmentation methodwe use in this experiment is the pre-trained model proposed by [21], trained on theWIKI-727k corpus. And the dataset we use is the BigPatent-A (Human Necessi-ties), a subset of the new dataset proposed in [36]. For all the models, we set thelength limit of result summaries to be 100 words, based on the average length ofthe ground truth summaries.All the results on this dataset are shown in Table 4.6. There is no model that sig-nificantly outperforms the others. The Figure 4.3 shows the ROUGE scores of doc-uments with different lengths. Although in this experiment, the performance of our30Figure 4.3: A Comparison between our model, SummaRuNNer and Ora-cle when applied to documents with increasing length on Bigpatent-adatasetmodel is not improving, compared with others, as the documents being longer, ourmodel is still competitive with the state-of-the-art extractive summarization model.One possible reason for this is that the pre-trained topic segmentation model doesnot work well on this dataset, since it was trained on Wikipedia, which is quite dif-ferent from patents. For instance, there is an obvious distinct between sections inWikipedia, while the patent documents tend to cover topics that are not sufficientlydistinct. Furthermore, since we do not have ground truth topic segment informa-tion for this dataset, we can not evaluate how accurate the result of topic segmentmodel is. We will leave the study of this issue for future work.31Chapter 5Conclusion and Future workIn this thesis, we propose a novel extractive summarization model especially de-signed for long documents, by incorporating the local context within each topic,along with the global context of the whole document. Our approach is based onthe fact that when human write long documents, they tend to include multiple top-ics and organize them in a structured way. Technically, our proposal integratesrecent findings on neural extractive summarization in a parameter-lean and modu-lar architecture.Our main contribution is that we apply the LSTM-Minus method to the extrac-tive summarization model, as the way to generate the representation of one topicsegment, e.g. the sections in a scientific paper. This technique had been success-fully used in graph-based dependency parsing[43], constituency parsing [10] anddiscourse parsing [24] as a representation of a text span. In this thesis, we showthat it can be effectively applied to extractive summarization.We evaluate our model and compare with previous works in both extractiveand abstractive summarization on two large scientific paper datasets, Pubmed andarXiv, which contain documents that are much longer than in previously used cor-pora, and with natural topic segment information (section). Our model not onlyachieves state-of-the-art on these two datasets, but in an additional experiment, inwhich we consider documents with increasing length, it becomes more competitivefor longer documents. Besides, although we do not explicitly use the position in-formation of each sentence and section, the result shows that our model prefers the32sentences located at the beginning or the end of the first and last sections, whichis most likely the introduction and conclusion, respectively. We also performedan ablation study to test the effect of each module in our proposed model, andthe results suggest that the local context itself could improve the baseline modelsignificantly, while adding the global context does not have an obvious effect.We also test our model on a new dataset, Bigpatent, with long documents butwithout natural topic segment information. In this case, we first apply a pre-trainedtopic segmentation model to split the documents, and then apply our model togenerate extractive summary. Despite the fact that the topic segmentation modelmay be inaccurate, our summarizer still achieves result competitive with the state-of-the-art model. HOwever, when the length of documents increase, we do not findthe same benefits as in the scientific paper datasets.For future work, we will try to deal with redundancy of our generated sum-maries, one possible way is to have a summary representation storing the infor-mation of the summary at each timestep[27]. After that, it could be beneficialto integrate explicit features, like sentence position and salience, into our neuralapproach. As another venue for future work, we will also explore how to applyhierarchical model that can leverage the section information, like Hierarchical At-tention Networks(HAN) [44], or hierarchical transformer[42]. More generally, weplan to combine traditional and neural models, as suggested by our results.Furthermore, we would like to explore more sophistical structure of docu-ments, like discourse tree, instead of rough topic segments. Besides, we wouldalso like to explore ways to combine the pre-trained multi-task language models(like BERT[13] or XLnet[45]) with the natural structure of documents to generateextractive summaries.More long term, we will study how extractive/abstractive techniques can beintegrated. Initially, the output of an extractive system could be fed into an abstrac-tive one, training the two jointly. Then, we would consider a finer-grain integration,where the combination of abstractive/extractive techniques is tailored to the partic-ular source document.33Bibliography[1] M. Allahyari, S. A. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B.Gutierrez, and K. Kochut. Text summarization techniques: A brief survey.CoRR, abs/1707.02268, 2017. URL →page 1[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointlylearning to align and translate. CoRR, abs/1409.0473, 2015. → page 18[3] A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi. Deep communicatingagents for abstractive summarization. In Proceedings of the 2018Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (LongPapers), pages 1662–1675, New Orleans, Louisiana, June 2018. Associationfor Computational Linguistics. doi:10.18653/v1/N18-1150. URL → page 9[4] J. Cheng and M. Lapata. Neural summarization by extracting sentences andwords. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 484–494, Berlin,Germany, Aug. 2016. Association for Computational Linguistics.doi:10.18653/v1/P16-1046. URL → pages 2, 5, 6, 7, 15, 23, 25[5] K. Cho, B. van Merrie¨nboer, D. Bahdanau, and Y. Bengio. On the propertiesof neural machine translation: Encoder–decoder approaches. In Proceedingsof SSST-8, Eighth Workshop on Syntax, Semantics and Structure inStatistical Translation, pages 103–111, Doha, Qatar, Oct. 2014. Associationfor Computational Linguistics. doi:10.3115/v1/W14-4012. URL → page 18[6] F. Y. Y. Choi. Advances in domain independent linear text segmentation. In1st Meeting of the North American Chapter of the Association for34Computational Linguistics, 2000. URL → page 13[7] J. Chung, C¸. Gu¨lc¸ehre, K. Cho, and Y. Bengio. Empirical evaluation ofgated recurrent neural networks on sequence modeling. CoRR,abs/1412.3555, 2014. URL → page 18[8] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, andN. Goharian. A discourse-aware attention model for abstractivesummarization of long documents. In Proceedings of the 2018 Conferenceof the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 2 (Short Papers), pages615–621, New Orleans, Louisiana, June 2018. Association forComputational Linguistics. doi:10.18653/v1/N18-2097. URL → pagesix, 3, 8, 10, 23, 24, 25[9] E. Collins, I. Augenstein, and S. Riedel. A supervised approach to extractivesummarisation of scientific papers. In Proceedings of the 21st Conferenceon Computational Natural Language Learning (CoNLL 2017), pages195–205, Vancouver, Canada, Aug. 2017. Association for ComputationalLinguistics. doi:10.18653/v1/K17-1021. URL → page 2[10] J. Cross and L. Huang. Span-based constituency parsing with astructure-label system and provably optimal dynamic oracles. InProceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 1–11, Austin, Texas, Nov. 2016. Associationfor Computational Linguistics. doi:10.18653/v1/D16-1001. URL → pages 2, 11, 32[11] M. Denkowski and A. Lavie. Meteor universal: Language specifictranslation evaluation for any target language. In Proceedings of the NinthWorkshop on Statistical Machine Translation, pages 376–380, Baltimore,Maryland, USA, June 2014. Association for Computational Linguistics.doi:10.3115/v1/W14-3348. URL → page 24[12] F. Dernoncourt, M. Ghassemi, and W. Chang. A repository of corpora forsummarization. In LREC, 2018. → page 8[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training ofdeep bidirectional transformers for language understanding. In Proceedings35of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Longand Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019.Association for Computational Linguistics. doi:10.18653/v1/N19-1423.URL → pages 17, 33[14] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Number 57in Monographs on Statistics and Applied Probability. Chapman &Hall/CRC, Boca Raton, Florida, USA, 1993. → page 14[15] G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality assalience in text summarization. J. Artif. Int. Res., 22(1):457–479, Dec. 2004.ISSN 1076-9757. URL→ pages 4, 23[16] N. Garg, B. Favre, K. Reidhammer, and D. Hakkani Tr. Clusterrank: Agraph based method for meeting summarization. 2009. → page 4[17] G. Glavasˇ, F. Nanni, and S. P. Ponzetto. Unsupervised text segmentationusing semantic relatedness graphs. In Proceedings of the Fifth JointConference on Lexical and Computational Semantics, pages 125–130,Berlin, Germany, Aug. 2016. Association for Computational Linguistics.doi:10.18653/v1/S16-2016. URL → page 12[18] K. M. Hermann, T. Kocˇisky´, E. Grefenstette, L. Espeholt, W. Kay,M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend.In Proceedings of the 28th International Conference on Neural InformationProcessing Systems - Volume 1, NIPS’15, pages 1693–1701, Cambridge,MA, USA, 2015. MIT Press. URL → page 9[19] C. Kedzie, K. McKeown, and H. Daume III. Content selection in deeplearning models of summarization. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Processing, pages 1818–1828.Association for Computational Linguistics, 2018. URL → pages ix, 5, 6, 14, 15, 22, 23, 24[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2015. → page 23[21] O. Koshorek, A. Cohen, N. Mor, M. Rotman, and J. Berant. Textsegmentation as a supervised learning task. In Proceedings of the 201836Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 2(Short Papers), pages 469–473, New Orleans, Louisiana, June 2018.Association for Computational Linguistics. doi:10.18653/v1/N18-2075.URL → pages ix, 12, 13, 30[22] W. Li, X. Xiao, Y. Lyu, and Y. Wang. Improving neural abstractivedocument summarization with explicit information selection modeling. InProceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, pages 1787–1796, Brussels, Belgium, Oct.-Nov.2018. Association for Computational Linguistics. URL → page 18[23] C.-Y. Lin and E. H. Hovy. Automatic evaluation of summaries using n-gramco-occurrence statistics. In HLT-NAACL, 2003. → page 24[24] Y. Liu and M. Lapata. Learning contextually informed representations forlinear-time discourse parsing. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing, pages 1289–1298,Copenhagen, Denmark, Sept. 2017. Association for ComputationalLinguistics. doi:10.18653/v1/D17-1133. URL → pages 2, 11, 32[25] R. Mihalcea and P. Tarau. Textrank: Bringing order into text. InProceedings of the 2004 Conference on Empirical Methods in NaturalLanguage Processing, 2004. URL → page 4[26] R. Nallapati, B. Zhou, C. dos Santos, C¸. Gulc¸ehre, and B. Xiang.Abstractive text summarization using sequence-to-sequence RNNs andbeyond. In Proceedings of The 20th SIGNLL Conference on ComputationalNatural Language Learning, pages 280–290, Berlin, Germany, Aug. 2016.Association for Computational Linguistics. doi:10.18653/v1/K16-1028.URL → pages 2, 8, 9, 23[27] R. Nallapati, F. Zhai, and B. Zhou. Summarunner: A recurrent neuralnetwork based sequence model for extractive summarization of documents.In Proceedings of the Thirty-First AAAI Conference on ArtificialIntelligence, AAAI’17, pages 3075–3081. AAAI Press, 2017. URL → pages5, 6, 7, 15, 23, 3337[28] E. W. Noreen. Computer intensive methods for testing hypotheses. anintroduction. 1989. → page 14[29] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors forword representation. In Empirical Methods in Natural Language Processing(EMNLP), pages 1532–1543, 2014. URL → page 23[30] M. Riedl and C. Biemann. TopicTiling: A text segmentation algorithmbased on LDA. In Proceedings of ACL 2012 Student Research Workshop,pages 37–42, Jeju Island, Korea, July 2012. Association for ComputationalLinguistics. URL → page 12[31] S. Riezler and J. T. Maxwell. On some pitfalls in automatic evaluation andsignificance testing for MT. In Proceedings of the ACL Workshop onIntrinsic and Extrinsic Evaluation Measures for Machine Translation and/orSummarization, pages 57–64, Ann Arbor, Michigan, June 2005. Associationfor Computational Linguistics. URL → pages ix, 14, 25, 29[32] E. Sandhaus. The New York Times Annotated Corpus, 2008. → page 8[33] M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. Trans.Sig. Proc., 45(11):2673–2681, Nov. 1997. ISSN 1053-587X.doi:10.1109/78.650093. URL → page17[34] A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization withpointer-generator networks. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers),pages 1073–1083, Vancouver, Canada, July 2017. Association forComputational Linguistics. doi:10.18653/v1/P17-1099. URL → pages 9, 23[35] L. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and R. Kurzweil.Generating long and diverse responses with neural conversation models.CoRR, abs/1701.03185, 2017. URL →page 2[36] E. Sharma, C. Li, and L. Wang. BIGPATENT: A large-scale dataset forabstractive and coherent summarization. CoRR, abs/1906.03741, 2019.URL → pages 8, 3038[37] J. Steinberger and K. Jezek. Using latent semantic analysis in textsummarization and summary evaluation. 2004. → page 23[38] F. Suppe. The structure of a scientific paper. Philosophy of Science, 65(3):381–405, 1998. ISSN 00318248, 1539767X. URL → page 2[39] A. Tixier, P. Meladianos, and M. Vazirgiannis. Combining graph degeneracyand submodularity for unsupervised extractive summarization. InProceedings of the Workshop on New Frontiers in Summarization, pages48–58. Association for Computational Linguistics, 2017. URL → page 4[40] A. J.-P. Tixier, F. D. Malliaros, and M. Vazirgiannis. A graphdegeneracy-based approach to keyword extraction. In J. Su, X. Carreras, andK. Duh, editors, EMNLP, pages 1860–1870. The Association forComputational Linguistics, 2016. ISBN 978-1-945626-25-8. URL →pages 4, 5[41] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova. Beyondsumbasic: Task-focused summarization with sentence simplification andlexical expansion. Inf. Process. Manage., 43:1606–1618, 11 2007.doi:10.1016/j.ipm.2007.01.023. → page 23[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR,abs/1706.03762, 2017. URL → page 33[43] W. Wang and B. Chang. Graph-based dependency parsing with bidirectionallstm. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 2306–2315,Berlin, Germany, Aug. 2016. Association for Computational Linguistics.doi:10.18653/v1/P16-1218. URL → pages 2, 11, 32[44] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchicalattention networks for document classification. pages 1480–1489, 01 2016.doi:10.18653/v1/N16-1174. → page 33[45] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le.Xlnet: Generalized autoregressive pretraining for language understanding.39CoRR, abs/1906.08237, 2019. URL →page 3340


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items