Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Summarization of partial email threads : silver standards and bayesian surprise Johnson, Jordon Kent 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_may_johnson_jordon.pdf [ 807.85kB ]
JSON: 24-1.0365780.json
JSON-LD: 24-1.0365780-ld.json
RDF/XML (Pretty): 24-1.0365780-rdf.xml
RDF/JSON: 24-1.0365780-rdf.json
Turtle: 24-1.0365780-turtle.txt
N-Triples: 24-1.0365780-rdf-ntriples.txt
Original Record: 24-1.0365780-source.json
Full Text

Full Text

Summarization of Partial EmailThreadsSilver Standards and Bayesian SurprisebyJordon Kent JohnsonB.Sc., The University of British Columbia, 2013BCS, The University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2018c© Jordon Kent Johnson 2018AbstractWe define and motivate the problem of summarizing partial email threads.This problem introduces the challenge of generating reference summariesfor these partial threads when extractive human annotation is only avail-able for the threads as a whole, since gold standard annotation intendedto summarize a completed email thread may not always be equally appli-cable to each of its partial threads, particularly when the human-selectedsentences are not uniformly distributed within the threads. We propose aframework for generating these reference summaries with arbitrary lengthin an oracular1 manner by exploiting existing gold standard summaries forcompleted email threads. We also propose and evaluate two sentence scoringfunctions that can be used in this “silver standard” framework, and we aremaking the resulting datasets publicly available2. In addition, we apply arecent unsupervised method based on Bayesian Surprise that incorporatesbackground knowledge to partial thread summarization, extend that methodwith conversational features, and modify the mechanism by which it handlesinformation redundancy. Experiments with our partial thread summarizersindicate comparable or improved performance relative to a state-of-the-artunsupervised full thread summarizer baseline in most cases; and we haveidentified areas in which potential vulnerabilities in our methods can beavoided or accounted for. Furthermore, our results suggest that the po-tential benefits of background knowledge to partial thread summarizationshould be further investigated with larger datasets.1“Oracular” is defined here as having access to “special” information that is normallyunavailable to a system and using that information to ensure improved output.2 SummaryPast research on summarizing email threads has always focused on “com-pleted” email threads; but the common user experience is with in-progressemail threads, and so email thread summarizers must be able to summarizethese “partial threads” as well. In order to develop useful partial threadsummarizers, we need summaries that we know are good (called “gold stan-dards”) for comparison; but these reference summaries currently only existfor completed email threads. We have developed a means to take existinggold standard summaries and use them to generate similar summaries forpartial threads (called “silver standards”). Also, since partial threads canbe short, we want to be able to leverage background knowledge when sum-marizing them; so we developed simple partial thread summarizers that doexactly that. We have found that our partial thread summarizers work wellin most cases.iiiPrefaceA significant portion of this work has been published [15] and has thus beenadapted here with additional discussion as appropriate. I was the primaryauthor of that publication; Vaden Masrani wrote the Related Work sectionof that paper, which is included here with minor modifications. GiuseppeCarenini and Raymond Ng provided substantive edits of the paper drafts.Since the publication of that paper, the algorithm for generating silverstandard summaries was refactored into a framework that allowed the useof any sentence scoring function, so long as that function made use of boththe partial email thread and the existing extractive gold standard summaryin scoring candidate sentences. An additional sentence scoring function wasdevised, as was a means to automatically evaluate these functions usingthe human annotations as a reference. Once the candidate sentence scoringfunctions were evaluated, the top-performing function was used to generatesilver standard summaries, which were in turn used to re-evaluate our can-didate summarizers. Since a concern in our previous work was the size ofthe dataset used, all of these evaluations were also performed using a seconddataset.Vaden Masrani’s programming contribution was chiefly to create a databaseof email threads and a parser interface that permitted access to that database.He also implemented some preliminary sentence scoring functions that werelater discarded in favour of the methods presented here. I performed theremainder of the programming work.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Silver Standards . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 Distribution of Summary Sentences . . . . . . . . . . . . . . 83.2 The Silver Standard Framework . . . . . . . . . . . . . . . . 103.3 Sentence Scoring Function 1: Graph Centrality and TopicProminence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Sentence Scoring Function 2: Jensen-Shannon Divergence . . 154 Partial Thread Summarization . . . . . . . . . . . . . . . . . 174.1 Bayesian Surprise . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Conversational Features . . . . . . . . . . . . . . . . . . . . . 194.3 Surprise Decay . . . . . . . . . . . . . . . . . . . . . . . . . . 20vTable of Contents5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 215.1 Languages and Packages . . . . . . . . . . . . . . . . . . . . 215.2 Modifications to Topic Segmentation Software . . . . . . . . 226 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.1 Annotated Corpora . . . . . . . . . . . . . . . . . . . . . . . 276.1.1 Loza et al. . . . . . . . . . . . . . . . . . . . . . . . . 276.1.2 BC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.2 Background Corpora . . . . . . . . . . . . . . . . . . . . . . 286.2.1 Enron email dataset . . . . . . . . . . . . . . . . . . . 296.2.2 W3C email dataset . . . . . . . . . . . . . . . . . . . 296.3 Notes on Pre-processing . . . . . . . . . . . . . . . . . . . . . 297 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.1 Sentence Scoring Function Evaluation . . . . . . . . . . . . . 317.2 Summarizer Evaluation . . . . . . . . . . . . . . . . . . . . . 357.2.1 Evaluation over Full Threads . . . . . . . . . . . . . . 367.2.2 Preliminary Evaluation over Partial Threads . . . . . 377.2.3 Subsequent Evaluation over Partial Threads . . . . . 398 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 45Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48viList of Tables7.1 ROUGE-1 precision scores for silver standard sentence scor-ing function evaluation experiments. The only two pairs ofvalues that are not statistically significantly different (p <0.05) within their dataset and evaluation scenario are bolded. 347.2 ROUGE-1 mean F-scores for full threads as compared to goldstandard summaries. . . . . . . . . . . . . . . . . . . . . . . . 367.3 Length (in words) of the partial threads in the Loza et al.dataset used to define the quartile bins. . . . . . . . . . . . . 377.4 ROUGE-1 mean F-scores over partial threads in the Loza etal. dataset (binned into quartiles by length in words) as com-pared to silver standard summaries (with the TPM sentencescoring function). Values are given for both summary lengths(20% and 30% of PT length). Bolded ROUGE scores arethe highest for their quartile and summary length category.P-values are given for the comparisons between BSE-d andCWS; underlined p-values indicate at least marginal signifi-cance (p<0.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 387.5 Number of partial threads in each dataset with the givenlength (in emails). Note that in the case of the Loza et al.dataset, the first email in one of the threads consisted of onlyan email attachment (i.e. it contained no sentences), and sothat partial thread was not included in the evaluation. . . . . 407.6 Partial thread summarizer evaluation results for the Loza etal. dataset, using JS divergence-based silver standard sum-maries as reference. P-values are italicized, and those indi-cating at least marginal statistical significance (p < 0.1) areunderlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.7 Partial thread summarizer evaluation results for the BC3dataset, using JS divergence-based silver standard summariesas reference. P-values are italicized, and those indicating atleast marginal statistical significance (p < 0.1) are underlined. 41viiList of Tables7.8 Examples of sentences in the BC3 dataset selected by BayesianSurprise-based methods that were not selected by the silverstandard framework or the CWS baseline, along with cate-gories of “surprising” tokens (bolded) found in each sentence. 44viiiList of Figures1.1 A representation of an email thread (or “full thread”) as the fi-nal iteration in a series of partial threads, each one a snapshotof the thread at a different point in its development. The par-tial thread summarization problem is to generate summariesfor each of the partial threads rather than for only the fullthread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.1 Distribution of EGS sentences in full threads in the Loza et al.dataset [21]. Each vertical column of dots represents a thread,with each dot representing a sentence at its relative positionwithin the thread (beginning at 0, ending at 1). Non-EGSsentences are black dots, while EGS sentences are red circles;and larger circles indicate that a human annotator consideredthose sentences more important. The threads are sorted indescending order of the relative position of the highest-rankedsentences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Distribution of EGS sentences in full threads in the BC3dataset [33]. Each vertical column of dots represents a thread,with each dot representing a sentence at its relative positionwithin the thread (beginning at 0, ending at 1). Non-EGSsentences are black dots, while EGS sentences (which are notranked in this dataset) are red circles. The threads are sortedin descending order of the relative position of the last EGSsentence in the thread. . . . . . . . . . . . . . . . . . . . . . . 9ixList of Figures5.1 FQG representations of a simple email thread with two partic-ipants taking regular turns, but that has no quoted material.On the left is the expected representation; in the absence ofquoted material, the connections between fragments (which,in this case, are entire emails) should be determined by thetemporal order of the emails and by the fact that the par-ticipants respond to each other. On the right is the graphproduced by the original version of the FQG extraction al-gorithm; in the absence of quoted material, the emails weretreated as disconnected fragments. . . . . . . . . . . . . . . . 235.2 Examples of possible quotation structures within a single email. 245.3 The recursive approach to generating links between fragmentsfor arbitrary nested quotation structures. A single email isdepicted here. At each step, the fragment being visited islinked to the lowest-depth fragments in the quotation struc-tures immediately above and/or below it. The result is aportion of the FQG for the email thread. . . . . . . . . . . . . 257.1 Sentence scoring function performance: ROUGE-1 precisionvs. the number of EGS sentences (out of n) in the in-progresssummary. For each scoring function, the upper and lowercurves indicate performance at different levels of oracularity.Since random sentence selection is not oracular, it only hasone curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34xList of AcronymsBS Bayesian SurpriseBSE Extended Bayesian SurpriseCWS ClueWordSummarizer-d with surprise decay (eg. BS-d, BSE-d)EGS Extractive gold standardFQG Fragment quotation graphFT Full email threadJS, JSD Jensen-Shannon divergenceKL, KLD Kullback-Leibler divergenceMMR Maximal marginal relevancePT Partial email threadTPM Topic-PageRank-MMRxiAcknowledgementsI would like to thank my supervisors, Giuseppe Carenini and RaymondNg, for their insights and guidance throughout this work. I would like tothank Vaden Masrani for his thoughtful, cheerful and energetic assistancethroughout the early part of this work. I would also like to thank HyejuJang for her feedback in the later stages of this work.This research was funded in part under a grant from the Yahoo FacultyResearch and Engagement Program (FREP).xiiDedicationTo my wife, Jennifer; and our children, David, Andrew, Naomi and Marcus;who are my greatest happiness and my truest legacy.xiiiChapter 1IntroductionEmails are an ongoing focus of NLP research for a number of reasons. First,despite their early advent relative to other forms of electronic communica-tion, emails are still widely used around the world. Second, email threadscan take on an “asynchronous” conversation structure, where any partic-ipant can respond to any part of the conversation at any time. Similarconversation structures can be found in blogs, forums, and on social mediaplatforms; therefore, insights gained from studying email conversations mayalso be applicable to these other media.Users are experiencing an increasing flow of emails; they frequently par-ticipate in multiple email conversations at once, and these conversationscan span days, weeks, or even months. In addition, new participants can bebrought into conversations already in progress. Finally, users are increas-ingly turning to mobile devices with small screens to meet their computingneeds. In the face of all these factors, there has been a growing interest inthe email summarization task : given an email thread with multiple partic-ipants, provide a summary of the contents of the thread. Such summariesshould contain the key information in a thread and thus free a user joining athread in progress from having to comb through its entire contents; they canalso serve as memory aids to users returning to a thread [33]. In order to ac-commodate smaller screen sizes, these summaries should be of user-definedlength; for example, a user may want to incrementally view additional im-portant information from a thread, and so the ability to incrementally addto a summary on demand should be available.Summarization is not a new area of research; there has been extensivework on the summarization problem [30], including work on what can becalled the full thread summarization problem, where a single summary is gen-erated for a complete, archived email thread [5]. However, email threads aredynamic document collections. While users may wish to revisit completedthreads occasionally, the bulk of the user experience is with in-progress, orpartial threads (PT).Email threads may develop in different ways, including the following:• Important questions may be answered or new questions asked1Chapter 1. Introduction• Proposals may be made, modified, and accepted or rejected• The focus of the conversation may shift from one topic to anotherIn any of these cases, the content of a summary of a partial thread mayneed to change over time as emails come in; therefore, we are interested inthe partial thread summarization problem, where we generate a successionof summaries, each summarizing the thread at different moments in time.More formally, for each email Ei in a given email thread {E1...Ei...En}, wewish to generate a summary for the corresponding partial thread {E1...Ei}.The concept of partial threads is also depicted in Figure 1.1.Figure 1.1: A representation of an email thread (or “full thread”) as thefinal iteration in a series of partial threads, each one a snapshot of the threadat a different point in its development. The partial thread summarizationproblem is to generate summaries for each of the partial threads rather thanfor only the full thread.A partial thread summary will provide a summary of the thread so far,including the new email; it is intended to benefit users that may have forgot-ten the content of the preceding emails in the thread (or may be new to thethread) and need a quick refresh, possibly on the relatively small screen ofa mobile device. Additionally, as mentioned previously, a user may want toincrementally “extend” a partial thread summary in order to get more infor-mation about the thread; and so we also investigate the ability to generate2Chapter 1. Introductionsummaries of arbitrary length. The partial thread summarization problemis thus different from the update summarization problem previously studiedfor news in the Text Analysis Conferences [7]; the update summarizationproblem, applied to email threads, would provide a summary of only theincoming email with the assumption that the user knows and remembersthe content of the preceding emails.The new NLP task of summarizing partial email threads is challenging,not only because new algorithms may need to be developed in order toaccount for factors in email thread development such as those listed above,but also with respect to evaluating the generated summaries. While thereare publicly available datasets [21, 33] that provide gold standard summariesfor completed email threads, none to our knowledge provides such summariesfor partial threads; such annotation by humans would be prohibitive, as itwould require a summary for each partial thread (i.e., each email) in thecorpus. So, a new challenge we face in the evaluation of PT summariesis due to the dearth of human annotations; more specifically, given goldstandard human annotations of a thread as a whole, how do we generatereference summaries of each PT against which to compare automaticallygenerated extractive summaries?Another challenge in partial thread summarization is related to the factthat most (if not all) current summarization techniques for full thread sum-marization rely on an analysis of only the input thread to decide what sen-tences should be included in the summary. However, since partial threadscan be rather short, we hypothesize that the identification of the most in-formative sentences would benefit from examining the larger informationalcontext in which the partial thread was generated (eg. all the emails gener-ated in an organization). We test this hypothesis by applying and extendinga recent summarization method based on Bayesian Surprise that leveragessuch background information for partial thread summarization.Given the novelty of the partial thread summarization task (and thecorresponding dearth of annotated data), our goal is to establish sets ofbaselines to enable future research; therefore, in this thesis we focus on in-vestigating simple unsupervised extractive approaches, where the summaryis a subset of the sentences in the source partial thread, and leave supervisedand abstractive approaches for future work.The main contributions of this thesis are as follows:• We propose a framework for exploiting existing extractive gold stan-dard (EGS) summaries of full threads to automatically generate orac-ular “silver standard” partial thread summaries of arbitrary length, as3Chapter 1. Introductiondiscussed in section 3. We also propose and evaluate two candidatesentence scoring functions for use in this framework. Further, we arereleasing these silver standard summaries for the datasets that wereused in this work.• For partial thread summary generation, we propose an unsupervisedmethod that considers not only the input thread, but also backgroundknowledge synthesized from a large number of other email threads. Inparticular, we developed a summarization method based on BayesianSurprise [20] which takes into account conversational features of thepartial thread, as discussed in chapter 4.• Using our silver standard summaries with the widely used ROUGEtoolkit [18], we carry out experiments to compare the summaries gen-erated by Bayesian Surprise-based methods with a state-of-the-art ex-tractive technique developed for full thread summarization that doesnot take into account background information.Our experiments have indicated that our partial thread summarizers ex-hibit comparable or improved performance relative to the state-of-the-artbaseline in most cases; and we have identified potential areas of vulnerabil-ity for our summarizers and proposed ways to account for them in futurework. Our results also indicate that future research in partial thread sum-marization is warranted with larger datasets.4Chapter 2Related WorkTo generate partial thread (PT) summaries we propose an unsupervised ex-tractive approach. Although to the best of our knowledge no one has studiedPT summarization directly, there has been extensive work done in extrac-tive summarization in general, as well as work done on email summarizationspecifically.Supervised methods have been proposed which turn the extractive sum-marization task into a binary classification problem where sentences arelabeled “in/out” using standard machine learning classifiers [27, 32]. Varia-tions of this approach include adding sentence compression and using inte-ger linear programming to evaluate candidate summaries and select the bestones [2]. Sentence classification assumes sentences are independent from oneanother; and so to capture dependencies between sentences, the extractivesummarization problem has also been recast as a sequence labeling problemusing hidden Markov models and conditional random fields [9, 14, 31].The weakness of supervised approaches is the reliance on human-annotatedlabeled data, which is often time-consuming and difficult to acquire due toprivacy concerns. This thesis, therefore, focuses on unsupervised extractivetechniques which do not require labeled data. Another benefit of unsuper-vised methods is that they can generate features for supervised methods,meaning improvements in unsupervised techniques can directly benefit su-pervised systems.Many unsupervised extractive summarization methods have been pro-posed for generic documents, as well as for conversations. Some make useof textual features such as lexical chains, cue words (“In conclusion”, “Tosummarize”, etc.) or conversation structure to select the most informativesentences [1, 6, 11]. Others make use of methods including topic model-ing [29], latent semantic analysis [17] or rhetorical parsing [12]. While ourpartial thread summarizers take an entirely different approach (as discussedin Chapter 4), our algorithm for generating silver standard summaries of par-tial threads incorporates a topic modeling framework that, in turn, makesuse of lexical chains and conversational structure.There is also a large class of summarization methods which build graphs5Chapter 2. Related Workwith textual units (words, sentences, paragraphs, etc) as vertices and usesimilarity measures between the text units to form the edge weights. Oncea full graph is created, an extractive summary is generated by using a cen-trality measure to select “central” nodes from a cluster and concatenatingthem to form a summary. Two popular systems are LexRank and Tex-tRank, which both use a variant of the PageRank algorithm [8, 23, 24].Graph-based methods are popular because of their simplicity and ease ofimplementation, and their performance has been shown to be competitivewith other methods. One of the functions designed for our silver standardframework incorporates graph-based sentence scoring.Another important aspect of summarization is the need to ensure thatnewly-added sentences are not only informative but also convey new infor-mation with respect to sentences already in the summary. One popularmethod known as Maximum Marginal Relevance (MMR) builds a summaryby selecting sentences which maximize relevance while minimizing redun-dancy with previously selected sentences [4]. While one of our silver standardsentence scoring functions uses MMR, the Bayesian Surprise-based summa-rizers described in Chapter 4 have their own built-in means of handlinginformation redundancy.There has also been work on the task of unsupervised email summariza-tion specifically. Carenini et al. [6] proposed the use of “fragment quotationgraphs” (FQGs) to summarize asynchronous conversations. FQGs use thefact that a given email often contains quoted material from previous emails.These quotations, or “fragments”, can then be used to create fine-grainedrepresentations of the underlying structure of a given email thread, allowinga set of particularly informative clue words to be identified. In this thesis,we use FQGs in one of our sentence scoring functions for silver standardsummary generation; we use clue words as a conversational feature in ourextended PT summarizer; and we use a summarizer based on clue words asa baseline in our evaluations.A common (yet key) limitation of current approaches to full-thread sum-marization (including [6]) is that they consider only the input thread inthe summarization process; in contrast, a user’s email history (or that ofthe user’s organization) can provide valuable background knowledge. Thesummarizers we propose in this paper, based on work by Annie Louis [20],address this limitation by taking into account background knowledge syn-thesized from a large number of other email threads, which we argue can beespecially beneficial to PT summarization, since short partial threads maynot provide much ground for sentence selection on their own.Finally, some work has been done in the area of automatically evaluat-6Chapter 2. Related Working document summaries in the absence of human-annotated gold standards.Louis and Nenkova [19], looking at the multi-document news summarizationtask, compared human rankings of candidate summaries with rankings au-tomatically generated by a number of metric functions. The function thatcorrelated most highly with human judgements was JS divergence, a sym-metric variant of KL divergence. We use JS divergence in one of our sentencescoring functions for silver standard summary generation.7Chapter 3Silver StandardsIn order to automatically evaluate partial thread (PT) summaries (i.e. withmetrics such as ROUGE), human-generated extractive gold standard (EGS)summaries are needed for comparison. However, because prior research hasfocused exclusively on full thread summarization, all publicly available emailcorpora of which I am aware only provide human-annotated EGS summariesfor each email thread as a whole [21, 33]. Acquiring human-annotated ref-erence summaries of every partial thread in addition to those of the fullemail threads is prohibitive because there would need to be summaries ofthe partial thread {E1..Ei} for each email Ei in each thread. However, afeasible alternative explored in this thesis is to automatically generate refer-ence summaries that make use of the existing gold standard in an oracularfashion; such reference summaries are termed “silver standards.”Given a partial thread PT and a gold standard summary EGS of thecorresponding full thread, an intuitive solution might be to simply use thegold standard sentences present in the partial thread (i.e. EGS ∩ PT ) asthe silver standard. In this chapter I discuss potential problems with thatapproach as well as a framework for silver standard summary generationthat accounts for those problems. I also introduce two methods for sentencescoring that can be used in the silver standard generation framework.3.1 Distribution of Summary SentencesThe distribution of EGS sentences across emails in a thread cannot be as-sumed to be uniform in all (or even most) cases; indeed, this is not thecase in the two annotated datasets used in this work (described further insections 6.1.1 and 6.1.2). As shown in Figure 3.1, while many threads inthe first dataset have highly ranked EGS sentences in the first part of theconversation, others have important EGS sentences in the middle or even atthe end of the conversation. As shown in Figure 3.2, the second dataset alsodisplays stark deviations from uniformity in EGS sentence distribution.Variations in EGS sentence distribution become a concern when generat-ing silver standard PT summaries. In some cases, there may not be enough83.1. Distribution of Summary SentencesFigure 3.1: Distribution of EGS sentences in full threads in the Loza et al.dataset [21]. Each vertical column of dots represents a thread, with each dotrepresenting a sentence at its relative position within the thread (beginningat 0, ending at 1). Non-EGS sentences are black dots, while EGS sentencesare red circles; and larger circles indicate that a human annotator consideredthose sentences more important. The threads are sorted in descending orderof the relative position of the highest-ranked sentences.Figure 3.2: Distribution of EGS sentences in full threads in the BC3dataset [33]. Each vertical column of dots represents a thread, with each dotrepresenting a sentence at its relative position within the thread (beginningat 0, ending at 1). Non-EGS sentences are black dots, while EGS sentences(which are not ranked in this dataset) are red circles. The threads are sortedin descending order of the relative position of the last EGS sentence in thethread.EGS sentences in a given PT to form a silver standard summary of the de-sired length; in extreme cases, the PT may have no EGS sentences at all. Inother cases, there may be too many EGS sentences in a PT to fit into thesilver standard; and not all datasets rank EGS sentences by importance aspart of the annotation. In other words, unless exactly the desired number ofEGS sentences are present in each PT, some sentence selection is necessary;93.2. The Silver Standard Frameworkand this issue is exacerbated when generating silver standard summaries ofarbitrary length. Our silver standard generation framework handles all ofthese possibilities as described in the next section.3.2 The Silver Standard FrameworkWhile the silver standard framework (shown in Algorithm 1) is fairly simple,it is also powerful in that it iteratively generates silver standard summariesof a partial thread PT with arbitrary length using any suitable sentencescoring function. Furthermore, it does so in an oracular fashion; that is,it references an existing gold standard EGS in selecting sentences for thesummary.The framework uses sentence scoring functions that take the followingas inputs:• a candidate sentence, ci, from the list of candidate sentences C• the summary constructed so far, S• the partial thread, PT• the existing gold standard for the entire thread, EGSIn each iteration, the framework selects the “best” sentence from C (i.e.the sentence with the optimal score) by some comparison to PT and/orEGS; it then removes it from C and adds it to S. The process iteratesuntil the summary reaches the desired length. What may change over timeis what comprises the list of candidate sentences. Initially, C consists of allthe gold standard sentences in the partial thread (i.e. EGS∩PT ); this is thefirst way in which the framework is oracular, since it “seeds” the summaryby drawing from the gold standard directly. If the point is reached whereall of the gold standard sentences in the partial thread have been includedin the summary, C is then re-initialized with all remaining partial threadsentences (i.e. PT\EGS).As mentioned in the previous section, there is no guarantee that a givenpartial thread will have any gold standard sentences in it at all; withoutsome additional consideration, the silver standard framework would not beoracular in those cases. To that end, the sentence scoring function shouldalso be oracular; in other words, the scoring function should reference thegold standard as well as the partial thread in determining the importanceof the sentence. This second mechanism of referencing the gold standard103.2. The Silver Standard FrameworkAlgorithm 1: The Silver Standard Framework.Result: Sm, the silver standard summary for the partial thread upto email m1 - Let EGS = {egs1...egsj ...egsk} be the set of gold standard sentences2 - Let PTm = {pt1...pti...ptn} be the set of sentences in the partialthread up to email m3 - Let EGSPTm = {egsPT1 ...egsPTi ...egsPTn } = (EGS ∩ PTm)4 - Let score(ci, Sm, EGS, PTm) be a scoring function for sentence cithat considers both the partial thread and the existing gold standard5 - Let len be the desired length of Sm6 Sm ← {}7 C ← EGSPTm8 while |S| <min(len, |EGSPTm |) do9 if EGS has annotated sentence ranking then10 - score each ci ∈ C using annotated ranking11 end12 else13 - score each ci ∈ C using score(ci, Sm, EGS, PTm)14 end15 - remove highest scoring ci from C and add it to Sm16 end17 C ← PTm\EGS18 while |Sm| < len do19 - score each ci ∈ C using score(ci, Sm, EGS, PTm)20 - remove highest scoring ci from C and add it to Sm21 end22 return Sm113.2. The Silver Standard Frameworkensures oracular behaviour in all cases and enhances it in cases where goldstandard sentences are present in the partial thread.There may be a concern that enforcing oracular behaviour for partialthreads without gold standard sentences may result in lower-quality silverstandard summaries. While that would depend largely on the scoring func-tion used, in general there are three cases to consider.Let us take as an example a thread discussing two topics: cats and dogs.In the first case, the distribution of these topics in the partial thread is verysimilar to that over the entire thread (so, for example, the relative frequencyof cat-related sentences in the partial thread is roughly the same as theirfrequency in the full thread). In that case, we would not expect the influenceof the gold standard to negatively impact the quality of the silver standardsummary to a significant extent, since the content of the “future” part ofthe thread is similarly distributed to that of the partial thread itself.In the second case, the partial thread is very dissimilar to the laterportions of the thread; for example, the partial thread may be entirely aboutcats, while the rest of the thread (and thus the entire gold standard) isabout dogs. In that case, we might expect the oracular portion of thesentence scoring function to score all of the sentences in the partial threadequally poorly, and so the overall effect in choosing the “best” next summarysentence should be minimal.The final case lies between these two extremes; for example, the partialthread could be mostly about cats with a few unimportant dog-related sen-tences, and the rest of the thread could be mostly about dogs with some fewreferences to cats. In that case, the oracular part of the sentence scoringfunction might ignore partial thread sentences about cats in favour of thosediscussing dogs. In order to deal with this case, it would be important tobalance the sentence scoring function such that the oracular effect is feltwithout rendering irrelevant the content of the partial thread itself. Sincethe purpose of this thesis is to introduce the partial thread summarizationproblem, these questions of measuring and fine-tuning the oracular effectare left for future work.It should be noted that the framework, as described here, uses greedyscoring functions. While such functions are not necessarily globally optimal,they have a number of advantages for the partial thread summarization task.First, for the purposes of research, greedy scoring functions can provide use-ful baselines. Second, they are often simpler to implement and have betterruntime performance than globally optimal functions. Finally (and perhapsmost importantly), greedy summarizers can have much more intuitive re-sults for the end user. Consider a user who has obtained a preliminary123.3. Sentence Scoring Function 1: Graph Centrality and Topic Prominencepartial thread summary and requests additional information (i.e. a slightlylarger summary). A non-greedy method will generate a new summary fromscratch, which may confuse the user if some of the sentences in the initialsummary are no longer included; whereas a greedy approach will simplyperform another iteration and add the next “best” sentence to the existingsummary. Given the benefits of summaries obtained using greedy summa-rizers, it is also valuable to have silver standards that are greedily generatedto facilitate their evaluation.3.3 Sentence Scoring Function 1: GraphCentrality and Topic ProminenceThe first silver standard sentence scoring function, called Topic-PageRank-MMR (TPM), was developed as part of this thesis and published recently [15].It uses graph-based sentence scoring, which has been applied extensively forsummarization as discussed in chapter 2. Both the graph-based aspect ofthe algorithm and its redundancy minimization mechanism rely on wordembeddings trained using large email corpora (described in section 6.2).This sentence scoring function also makes use of topic modeling. Weexpect the discussions in email threads to be topically coherent (though, forboth individual emails and threads, multiple topics may be covered). Thetopical coherence of a sentence with both the partial thread and the goldstandard are thus related to that sentence’s importance in the discussion inthe context of the partial thread as well as the thread as a whole. The topicmodeling system we use exploits conversational structure.Finally, to maximize sentence importance while minimizing redundancy,the scoring function uses maximal marginal relevance (MMR) [4]. For agiven candidate sentence ci for inclusion in a summary S, its MMR score(which is used as the overall sentence score) isTPM(ci) = λI(ci)− (1− λ)Sim(ci, S) (3.1)where I(s) is an importance function, and Sim(s, S) is a similarity functioncomparing s to the sentences of S. Due to the small size of the availabledatasets, it was not deemed feasible to set aside data for parameter tuning;so for this work λ was set to 0.5.The importance function used here incorporates graph centrality andtopic segmentation. We first define PRPT (ci) as the PageRank score of ciin the fully-connected graph whose vertices are the sentences of the par-tial thread. We choose PageRank over LexRank in order to incorporate133.3. Sentence Scoring Function 1: Graph Centrality and Topic Prominencetopic modeling designed for conversational data. For each sentence in PT,a vector representation is obtained by averaging 100-dimensional Word2Vecembeddings of its words [25, 26]. The edge weights are then set to the cosinesimilarity of the vector representations of the relevant sentences.We then define T (ci) as the topic of candidate sentence ci; in this work,we apply a topic segmentation method that uses fragment quotation graphsto represent conversational structure and that has been shown to work wellon asynchronous conversations [16]. We then define PromPT (T (ci)), or theprominence of T (ci) within the partial thread, as the fraction of partialthread sentences that have that topic; so if a partial thread containing fivesentences has a total of three whose topic is T (ci), then PromPT (T (ci)) is0.6. Similarly, PromEGS(T (ci)) is the prominence of T (ci) within the goldstandard summary. Together, the two prominence scores form the overalltopic prominence score of ci:Prom(T (ci)) =12(PromPT (T (ci)) + PromEGS(T (ci)))(3.2)Putting graph centrality and topic prominence together, we obtain ourimportance function:I(ci) =12(PRPT (ci) + Prom(T (ci)))(3.3)It is worth noting that the PageRank score takes values in [0,1], as do bothof the prominence scores. The weights in equations 3.2 and 3.3 are set tomatch the simplifying assumption that the centrality of a sentence in itspartial thread is as important as the overall prominence of its topic, andthat the prominence of a topic within a partial thread is as important asits prominence within the larger context of the full thread (as representedby the gold standard summary). Taken together, the importance functiontakes values in [0,1], which is appropriate for MMR. These weights can befine-tuned as desired; as stated earlier, such efforts are left for future work.The similarity function Sim(ci, S) in equation 3.1 is the maximum cosinesimilarity of the candidate sentence ci and the sentences of the in-progresssummary S, using the aggregated Word2Vec representations described forthe PageRank score.143.4. Sentence Scoring Function 2: Jensen-Shannon Divergence3.4 Sentence Scoring Function 2:Jensen-Shannon DivergenceThe Topic-PR-MMR silver standard algorithm brings together a numberof intuitive concepts: semantic centrality, topic prominence, and novelty ofinformation relative to an existing in-progress summary. However, this isby no means an exhaustive list of intuitive concepts that could be appliedto the problem of generating silver standard summaries for partial emailthreads. In this section, I discuss another sentence scoring function thatcompares the word distributions of candidate summaries with those of thecorresponding partial threads and gold standard summaries.Recent work by Louis and Nenkova [19] explores automatically evaluat-ing summaries when no human-annotated reference summaries are available;in other words, is there a metric that correlates well with human rankings ofsummaries that can be applied without comparing these summaries to a goldstandard? Louis and Nenkova compared a number of candidate metrics, andthe top performer was the Jensen-Shannon (JS) divergence.The JS divergence is derived from the Kullback–Leibler (KL) divergence,which compares two discrete probability distributions, A and B, as follows:KL(A,B) = −∑iA(i)logB(i)A(i)(3.4)Note that identical distributions will have zero divergence; in other words,more similar distributions will have lower KL divergence values. Since we cantreat word frequency distributions within collections of text as probabilitydistributions, this metric can be used to compare such text collections.One drawback of the KL divergence is that it is not symmetric; givendistributions A and B, KL(A,B) and KL(B,A) are not necessarily equal.The JS divergence provides a symmetric and smoothed alternative:JS(A,B) =12KL(A,M) +12KL(B,M) (3.5)whereM =12(A+B) (3.6)As with the KL divergence, more similar distributions will have lower JS di-vergence values. Louis and Nenkova were able to “score” a given summaryby taking the JS divergence of its word distribution and that of the corre-sponding source document(s); the summary rankings were then obtained bycomparing the summary scores.153.4. Sentence Scoring Function 2: Jensen-Shannon DivergenceSince summary rankings obtained using JS divergence values were foundto correlate fairly well with human rankings, it may be the case that using asimilar approach to rank sentences (and then selecting the “best” sentence)results in good silver standard summaries. Therefore, I have designed andimplemented a simple scoring function (JSD) based on JS divergence:JSD(ci) = λJS({ci ∪ S}, PT ) + (1− λ)JS({ci ∪ S}, EGS) (3.7)where, as noted previously in section 3.2, ci is the candidate sentence, S isthe summary so far, PT is the partial thread and EGS is the gold standardsummary of the full thread. The function scores ci by using JS divergence tocompare the word distribution of a candidate summary (including ci) withthose of both the partial thread and the gold standard; it is thus greedy, notwith respect to the candidate sentence itself, but with the summary thatwould result if ci were added.Note that the second term ensures that the sentence scoring functionis oracular, and the λ parameter allows us to fine-tune the strength of theoracular effect. For the purposes of this work, we set λ to 0.75 for the sakeof consistency with the Topic-PR-MMR method, since the overall oracularweight of its importance function was set to 0.25, as described in the previoussection (recall that Equation 3.3 sets the overall topic prominence weight to0.5, and Equation 3.2 sets the weight of the oracular topic prominence to0.5 of the overall topic prominence).Unlike the Topic-PageRank-MMR method (and the Bayesian Surprise-based methods described in Chapter 4), there is no explicit mechanism tomanage redundant information; rather, this management is inherent in theJS divergence itself. The word distributions being compared are those ofthe partial thread and the candidate summary generated by adding thecandidate sentence to the summary in-progress summary already selected;thus one would expect that the most effective candidate sentence would bethe one that best fills the discrepancies between the partial thread and thein-progress summary.16Chapter 4Partial ThreadSummarizationThus far, this thesis has discussed ways to automatically generate referencesummaries that could be used to evaluate summaries of partial email threads;however, the whole point of generating these reference summaries is to fa-ciliate research on the more fundamental problem of generating the partialthread summaries themselves. Since partial threads are earlier “snapshots”of full threads, it is expected that some of the known challenges in emailsummarization will be applicable to PT summarization as well.Summarizing email threads poses special challenges due to their con-versational structure, which can vary significantly between threads. Thesechallenges have already been met with some success. For example, Careniniet al. [6] exploit conversational structure derived from quoted content inemails to create fragment quotation graphs (FQGs); these graphs can beused to derive clue words, which are words found in some email fragment aswell as in at least one of that fragment’s parents or children in the FQG.Their ClueWordSummarizer (CWS) scores sentences based on the numberof clue words they contain; and it was found to perform well in full threadsummarization.There are additional challenges associated with email summarization be-cause their length can vary significantly between threads; and this issue isexacerbated with partial threads, which can become very short. One mightreasonably expect that the structure of shorter partial threads would beless complex than that of their longer counterparts; and so even specially-designed summarizers such as CWS may struggle with shorter partial threads.In the most extreme case—that of a single email—there cannot be any cluewords, and so CWS would need to resort to some other default behaviourin such cases.This example can be used to reveal a larger issue: previous work onunsupervised email summarization (as discussed in Chapter 2) takes as inputonly the thread to be summarized, meaning that existing summarizers haverelatively little with which to work when dealing with shorter partial threads.174.1. Bayesian SurpriseThis being the case, it might be advantageous to find additional informationthat can be leveraged in partial thread summarization. For example, it hasbeen shown that background knowledge can be effectively taken into accountin the summarization process by applying the idea of Bayesian Surprise [20].The Bayesian Surprise method is based on the intuition that, given acollection of background knowledge (such as the email history of a user ororganization), the most “surprising” new information is the most significantfor inclusion in a summary.Presumably, while background knowledge should be useful for summa-rization in general as an additional source of information from which toinfer salience, it should be especially useful for summarizing relatively shortpartial email threads. For this reason, the partial thread summarizationmethod discussed in this chapter is based on Bayesian Surprise; however,we also extend the existing technique to consider conversational features andincorporates a less harsh redundancy management mechanism.4.1 Bayesian SurpriseLet H be some hypothesis about a background corpus that is representedby a multinomial distribution over word unigrams. The prior probability ofH is a Dirichlet distribution:P (H) = Dir(α1, ...αV ) (4.1)where αi is the count of word i in the background corpus, and V is the sizeof the background corpus vocabulary.Suppose word wi appears ni times in the PT being summarized. We canthen obtain the posteriorP (H|wi) = Dir(α1, ..., αi + ni, ...αV ) (4.2)The Bayesian Surprise score for wi due to the PT is then the KL divergencebetween P (H|wi) and P (H).To obtain the Bayesian Surprise score of a sentence, one simply aggre-gates the scores of its words; and the sentence with the highest score is addedto the summary. In order to minimize redundancy during summarization inthe method’s original proposal [20], once a sentence is added to a summary,the Bayesian Surprise scores of its words are set to zero. The process isrepeated until the desired summary length has been reached.One possible concern with a Bayesian Surprise-based approach lies withits initial assumption; namely, that importance is directly correlated with184.2. Conversational Featuressurprise over word counts. For example, consider a corporation such asEnron (a reasonable example, since this work uses data derived from theEnron email corpus). One would assume that many email threads in sucha corporation would be related to meetings, whether planning them or dis-cussing them after the fact. As a result, one could expect that words suchas “meet” and “meeting” would be much less surprising than words used bypeople in email threads about other domains; therefore, these words wouldbe much less impactful for a Bayesian Surprise-based summarizer, even incases where they are highly relevant to the target email thread.While it is correct that commonly-used words are less surprising even incases where they may be important, it is also critical to note that such wordsrarely appear in isolation. Continuing with the example of email threadsabout meetings, important sentences would also likely include informationabout the subject of a meeting (eg. projections for quarterly earnings) or itsparticipants (eg. a meeting with a government oversight committee). It isreasonable to expect that the relative rarity of these words will still increasethe overall surprise of the important sentences; indeed, this may contributeto the effectiveness of Bayesian Surprise-based methods as reported by Louis.Unfortunately, taking the rarity of certain words to its extreme raisesanother potential concern: namely, that unwanted content (such as typosor random insertions like “By the way, my cat just had kittens!”) couldartificially inflate the surprise of an unimportant sentence. This concernhas arguably more merit; and while we do not expect the presence of ran-dom content to be a commonplace occurrence, especially in more formalemail communications, it is possible for rare tokens such as acronyms toartificially inflate the novelty of meaningful but less important sentences. Inother words, we acknowledge that novelty and importance are not perfectlycorrelated, and so the presence of rare but less-important terms could stillbe a source of error in summarizers based on Bayesian Surprise. The re-lated problem of spelling or other typographical errors may be dealt withsomewhat effectively by applying probabilistic pre-processing methods thatreplace tokens as needed, though doing so would present its own challenges;such efforts are left for future work.4.2 Conversational FeaturesAs discussed in section 2, conversational features have proved useful in sum-marizing asynchronous conversations such as email threads. We have ex-tended the Bayesian Surprise method to include a number of these con-194.3. Surprise Decayversational features as additional concentration parameters in the Dirichletdistributions. In order to maintain consistency with the original BayesianSurprise method, we limit our extensions to features that can be expressedas counts of word wi; specifically, we use the number of times wi was used:• by the creator of the thread (whether in the initial email or afterwards)• by the dominant participant in the thread (who may or may not bethe thread creator)• in emails where it also appears in the email subject line• as a clue wordThe prior for the extended Bayesian Surprise method then becomesP (H) = Dir(α1..V , β1..V , γ1..V , δ1..V , 1..V ) (4.3)where α1..V are the original concentration parameters, and β, γ, δ,  are thecorresponding feature counts. The posterior for word wi is obtained byupdating αi, βi, γi, δi, and i with the appropriate feature counts.4.3 Surprise DecayAs discussed in section 4.1, once a sentence containing a word is added to thesummary, the Bayesian Surprise score of that word is set to zero in order tominimize redundancy. While this accomplishes that goal, it may impact themeasured importance of words in the larger context of the PT too harshly. Inorder to mitigate this effect, we propose an alternative we call surprise decay,where each time a sentence is added to the summary, the Bayesian Surprisescores of its words are multiplied by some decay factor < 1. Intuitively, thiscorresponds to making these words “less surprising”, rather than removingthe surprise entirely; this allows salient words to continue to contribute to theoverall surprise of sentences in an increasingly limited way as the summaryis generated. The simplest decay factor would be a constant df ∈ [0, 1),resulting in exponential decay of a given word’s Bayesian Surprise score;this is the approach we take in this work.A more advanced decay factor may be a function of both the length ofthe partial thread and the desired summary length, so that the redundancyoversight would be more harsh for shorter summaries but more lenient forlonger partial threads (where significant words have more opportunities tobe repeated); but we leave such explorations for future work.20Chapter 5Implementation Details5.1 Languages and PackagesAll original programming for this thesis was performed in Python 2.7.6.In addition to the default Python installation, the following packages wereused:• gensim - training and applying Word2Vec models• networkx - building sentence graphs and calculating PageRank scores• nltk - word tokenization, an English stopword list, and a Porter stem-mer• numpy and scipy - various numerical and statistical applications• pandas, sqlalchemy and sshtunnel - interacting with the database ofemail threads• xml and json - reading and creating data filesThe other software package used was a topic segmenter developed atUBC [16], consisting of the following stages:1. extracting a fragment quotation graph from the email thread and out-putting all of its branched paths to plain text files2. performing the LCSeg algorithm [10] on the FQG paths to find topicsegments3. building a matrix, M , where each element Mij is the number of timessentences i and j appear in the same topic segment4. performing a min-cut clustering algorithm on M to find an optimalnumber of topic clusters and their constituent sentences5. outputting a list of the email thread sentences tagged with a topic ID215.2. Modifications to Topic Segmentation SoftwareStages 1, 3, and 5 were originally written in Perl; stage 2 was a simplePerl script that called a precompiled software package; and stage 4 waswritten in Matlab.5.2 Modifications to Topic SegmentationSoftwareThe topic segmentation software was originally written for use on emailthreads; it was later modified for use with blog conversations. This work wasperformed using the original version for email threads. During the courseof this work, some bugs were found and fixed, and some improvements weremade.The first (and most significant issue) in the topic segmentation softwarewas related to how the fragment quotation graphs are extracted from theemail threads; more specifically, it concerned the circumstances under whichemail fragments were connected to each other in the graph. The primarypurpose of the software was to construct a graph based on quoted material inemails, and all of the email threads in the dataset used for its developmenthad quoted material (as will be discussed in section 6.1.2); therefore, theemail fragments could be connected using only quoted material, and therewas apparently no need to create graph edges between fragments based onconcerns such as temporal email order or matching senders and recipients.Our published work, however, involved datasets without quoted material inthe email bodies (see sections 6.1.1 and 6.2.1); as a result, email threadswhose FQGs are simple chains were instead represented by disconnectednodes, as depicted in Figure 5.1.Rather than modify the existing Perl implementation, the decision wasmade to re-implement the FQG extraction algorithm in Python and to han-dle these issues in the process.As with the original implementation, the primary factor in linking emailfragments is the presence of quoted material. The original approach tolinking fragments in a FQG was to generate all possible links and thenremove redundant links. This work instead uses a recursive approach thatchecks for existing paths between fragments, so that a removal step is notnecessary. Note that no claim is made that the runtime performance of thisapproach is superior; however, it is arguably more elegant conceptually. Therecursive approach allows links to be drawn in the presence of arbitrarily-arranged nested quotes.When generating fragment quotation graphs, it is assumed that a new225.2. Modifications to Topic Segmentation SoftwareFigure 5.1: FQG representations of a simple email thread with two partic-ipants taking regular turns, but that has no quoted material. On the leftis the expected representation; in the absence of quoted material, the con-nections between fragments (which, in this case, are entire emails) shouldbe determined by the temporal order of the emails and by the fact thatthe participants respond to each other. On the right is the graph producedby the original version of the FQG extraction algorithm; in the absence ofquoted material, the emails were treated as disconnected fragments.fragment can refer to quoted material immediately above or below it (seeFigure 5.2(a)). This may be seen as the result of uncertainty in quotingstyles across users; some users will quote material and then respond below,while others will make a comment with the quote below it as a reference. Ifquoted material is included as part of a new quote, then a nested quotationstructure forms; and users with different quoting styles may produce a nestedstructure that exhibits a mixture of the styles. Such a scenario is depictedin Figure 5.2(b).In order to handle any nested quotation structure, a simplifying assump-tion is made that a new (i.e. non-quoted) fragment is related directly to theleast-nested fragments within the quotation structures immediately aboveand/or below it. Links within the nested quotation structure are assignedrecursively, with the currently least-nested fragment being treated as a non-quoted fragment in the next recursive call. This recursive approach is de-picted in Figure 5.3. At each step, links are only added between two nodesif there is not already a path between them, so no redundant links are gen-erated.In the absence of quoted material, the other factors considered for ex-tracting conversational structure from email threads are the senders/recipients235.2. Modifications to Topic Segmentation Software(a) A simple quotation struc-ture without nested quotes(b) Nested quotationstructures mixing replyposition conventionsFigure 5.2: Examples of possible quotation structures within a single email.of each email, similarities in the subject lines of the emails, and the temporalordering of the emails. Since email threads, like other asynchronous onlineconversations, allow any participant to comment on any part of the conver-sation at any time, exploiting relationships between senders and recipientsis more likely to result in a correct FQG structure.Ideally, each email will have within its metadata a unique ID, and anyreplies to that email will include that ID in a “reply-to” field; unfortunately,this is not guaranteed to be the case. In the absence of such identifiers, andin the absence of quoted material, replies to emails may still be inferred bymatching the senders and recipients of the emails in a manner that does notbreak temporal ordering. More specifically, a participant in the conversationcan only reply to an email they have received, and it is likely that they areresponding to an email they received relatively recently. Therefore, in thiswork it is assumed that an email Enew is a reply to email Eprevious if Epreviousis the most recent email such that the sender of Enew is in its recipientlist. Note that this approach will not work for mailing lists, where therecipient for all emails is the list itself. Also note that user information canbe represented differently in different address books; for example, a personnamed Samantha Tucker could be listed in many different ways, such as“Sam Tucker”, “S Tucker”, “Samantha”, “Tucker, Samantha”, and so on.In this work, simple tokenization of participant identifiers is performed, andany match between any token of a sender and recipient are taken to mean245.2. Modifications to Topic Segmentation SoftwareFigure 5.3: The recursive approach to generating links between fragmentsfor arbitrary nested quotation structures. A single email is depicted here. Ateach step, the fragment being visited is linked to the lowest-depth fragmentsin the quotation structures immediately above and/or below it. The resultis a portion of the FQG for the email thread.they are the same individual. While this would fail in cases where peopleshare names (such as David Johnson and Naomi Johnson), it is an adequatefirst-order approach for a non-trivial problem.It can be assumed that, if some email has subject line X, and if anotheremail’s subject line is some variant of “Re: X” or “Fwd: X”, that the latteremail is a child of the former. However, since email composition conventionsvary in that some will have a single copy of “Re: ” or “Fwd: ”, while otherswill have multiple copies of the substring, the proper relationships may notbe clear. Therefore, for some email with subject line “Re: X” or “Fwd: X”,the most recent email whose subject line is X after having had any “Re: ”or “Fwd: ” tags stripped away is considered to be the parent of that email.As before, we take this as a satisfactory first-order approach and leave morenuanced approaches for future work.While the temporal order of emails is arguably the least relevant factorin determining FQG structure, it does still play a role; trivially, one cannotcomment on a part of the conversation that has not yet happened. Inaddition, one might directly reply to an early part of an email conversationafter having read it in its entirety; in such a case, their comments may wellbe informed by more recent parts of the conversation as well, suggestingthat a link to more recent emails may also be appropriate. To that end,255.2. Modifications to Topic Segmentation Softwaretemporal order is added to the structure by linking a given fragment to thefragments of the email that preceded it.It is important to note that these additional means of creating linksbetween email fragments are only taken into account in the absence of linksfrom quoted material; this is to prevent saturation of the FQG structurewith less “meaningful” links. A possible variant of the FQG structure mayinvolve including all of these links but weighting them according to therelative importance of each type of link; we leave such investigations forfuture work.Another issue that arose during the re-implementation of the FQG ex-traction software had to do with “curated quotes”; namely, that users willsometimes take quoted material and then remove or add content within thequote itself. A simple example would be a user taking only a small quotefrom a longer email and then including tokens like “[snip]” or “...” in thequote to indicate the presence of additional (but less relevant) material. De-termining the significance (or lack thereof) of the curated portions of quotedmaterial is not a trivial problem; rather than using a simple token, a usercould inject a summary of (or other commentary on) the material that hadbeen removed from the quoted material, which could provide useful insightsinto summarization and other tasks. We leave such explorations for futurework, however, and we take the simplifying assumption that curated por-tions of quoted material are to be ignored.Ignoring curated portions of quotes is fairly straightforward. Givensome quoted fragment F = {s1, ..., sn} and the non-quoted fragment F ′ ={s′1, ..., s′m} from which it was taken, we find the first sentence si in F thatis also in F ′ as well as the last sentence sj in F that is also in F ′, and weconclude that the parent fragment of F consists of all sentences in F ′ fromsi to sj . Any sentences in F before si or after sj are assumed to be artefactsof quote curation, as are any additional or missing sentences between si andsj .26Chapter 6DatasetsThis work makes use of data both with and without annotations. Theannotated datasets, discussed in section 6.1, are used by the candidate sum-marizers (described in chapter 4) as well as by the sentence scoring functions(described in sections 3.3 and 3.4) that are used to generate silver standardsummaries for summarizer evaluation. The datasets without annotations,discussed in section 6.2, are much larger and serve a two-fold purpose. First,they are used to train the Word2Vec word embeddings used in the Topic-PR-MMR sentence scoring function. Second, subsets of these datasets are usedto provide background knowledge for the Bayesian Surprise summarizers.6.1 Annotated Corpora6.1.1 Loza et al.The first annotated email dataset used in this work is the “corporate thread”subset of the corpus produced by Loza et al. [21], which was derived fromthe Enron email dataset. The data consists of 62 email threads (from which282 partial threads can be extracted) containing a total of 345 emails and1654 sentences. Each thread was manually annotated by two humans withabstractive and extractive summaries, as well as keyphrases; thus we havetwo sets of annotations per thread. Each summary was permitted to containup to five sentences. Both the chosen summary sentences and the keyphraseswere ranked. This work focuses on extractive summarization, so only theextractive annotations were used. The keyphrases were not used here, be-cause it is not expected that most gold standard annotations will includekeyphrases.In this dataset, the text of the email bodies did not include quotedmaterial; therefore, the FQG extraction algorithm was only able to deriveconversational structure from the additional linking strategies described insection 5.2. Since the formatting of sender and recipient names does notalways seem to be consistent across emails, correct matching was not guar-anteed; however, in the majority of email conversations in this dataset, the276.2. Background Corporaconversational structure seems to follow a single, non-branching thread (withmost of the other cases branching only at the end of the thread), reducingthe overall effect of the FQG on the topic segmentation process (and, there-fore, on the silver standard summaries generated using the Topic-PR-MMRmethod).6.1.2 BC3The second annotated email dataset used in this work is the British ColumbiaConversation Corpus (BC3) created at UBC [33], which was derived fromthe W3C email dataset. The data consists of 40 email threads (from which219 partial threads can be extracted) containing a total of 259 emails and1641 sentences (including quoted material, and after signature removal asdiscussed below). Many of the threads appear to be related to the activitiesof a working group devoted to accessibility features (such as screen reading)for online content. Each thread was manually annotated by three humanswith abstractive and extractive summaries.Unlike the Loza et al. dataset, the sentences in the BC3 extractive sum-maries were not ranked, and no constraints seem to have been placed on thenumber of sentences that could be included in these extractive summaries.In addition, the BC3 dataset annotation includes feature labels that werenot used in this work, such as speech acts, subjectivity, and meta sentences(i.e. sentences that refer to the corresponding email thread itself). As withthe keyphrases in the Loza et al. dataset, it is not expected that these fea-tures would have annotations in email corpora in general; since a goal of thiswork is to present a system that is as generally applicable to email corporaas possible, we do not use these feature labels.Also unlike the Loza et al. corpus, quoted material is found throughoutthe BC3 dataset, resulting in more complex conversational structures. Toillustrate the difference, it is worth noting that the average number of pathsin the FQG structure for BC3 conversations is 3.6 (with a maximum of 10),while for the Loza et al. dataset it is only 1.6 (with a maximum of 5).6.2 Background CorporaThe corpora without annotations are available as large collections of indi-vidual emails; they are not organized by default into their respective emailthreads. As stated previously, these corpora served two purposes: to trainWord2Vec word embeddings, and to serve as background knowledge for theBayesian Surprise-based summarizers. For the first purpose, conversational286.3. Notes on Pre-processingfeatures are not required, and so the full corpora could be used. For the sec-ond purpose, however, the background knowledge being extracted includesconversational features, and so subsets of the corpora were used where theemail conversations had been reconstructed.6.2.1 Enron email datasetSince the Loza et al. annotated dataset was derived from the Enron emailcorpus, it was appropriate to use the Enron corpus (consisting of ∼500kemails) for background knowledge as well. A publicly available collection of70k Enron email conversations was also used [13]; of these threads, ∼43k hadthe metadata required (sender, recipient(s) and subject line in all emails) inorder to extract the desired conversational features.6.2.2 W3C email datasetThe BC3 corpus was derived from W3C emails, and so the correspondingbackground corpus was derived from the same source. As described in Ul-rich et al. [33], the mailing list subset consists of nearly 200k emails, andapproximately 23.6k email threads had been identified by TREC partici-pants “based on reply-to relations and subject overlap”. Of this mailing listsubset, approximately ∼23.1k threads were used in our work for conversa-tional feature extraction.6.3 Notes on Pre-processingIt seems to be common practice when training word embedding models orcalculating word distributions to represent the least frequently seen wordswith a single “unknown” token; this has the benefits of reducing the overallsize of the vocabulary (sometimes drastically) and preventing errors without-of-vocabulary words in unseen data. However, this would also renderthese words less “surprising”; and since the primary factor in our BayesianSurprise-based summarizers is the relative novelty of words, we have nottaken this step.We have, however, set out some constraints on what constitutes a “valid”word or sentence in order to reduce noise in our word embedding models andin the background feature counts of our summarizers. The word constraintsare as follows:• contains at least one letter (This allows us to ignore numbers andpunctuation-only tokens such as “???”. However, since some terms296.3. Notes on Pre-processingmix letters and numbers, such as Intel’s i7 series of processors, wechose not to require tokens to consist only of letters.)• not a stopword (as per the nltk stopwords list)• does not begin with an apostrophe (This handles cases where a wordtokenizer may split a contraction into two tokens)• has length of at most 20 characters (This includes most words whileignoring most of the long strings of seemingly random characters thatfrequently appear in the raw email files.)Sentences are constrained to consist of more than one word, since it is rarelythe case that single-word sentences would be of the utmost importance.Some other minor formatting constraints were also put in place to ignorelines of text representing file attachments.One feature of the BC3 dataset that had to be taken into considerationis that the participants’ signatures were included as part of the correspond-ing email body. (This was not the case for the Loza et al. dataset, wherethe signatures were labeled separately.) These signatures were in some casesquite lengthy, including the participants’ contact information, websites, andin some cases favourite quotes from popular and historical figures. In ad-dition, in some cases these signatures contained boilerplate messages suchas notifications that the email had been scanned by some antivirus softwareor legal notifications that the email was for intended recipients only. Thelongest such signature found in the BC3 dataset was seventeen lines longand thus comprised as many “sentences” in its email. These signatures weremanually removed from the non-quoted material in the BC3 email threadsduring pre-processing. They were not removed from quoted material in or-der to save time, since the new FQG extraction implementation discussedin section 5.2 would treat them as curated material and thus ignore them.30Chapter 7ExperimentsIn our published work [15], we used the Topic-PR-MMR silver standardsummaries to evaluate the Bayesian Surprise-based summarizers; these re-sults are also given and discussed here (see section 7.2). However, there ismore than one possible sentence scoring function that can be used in thesilver standard framework, and it is useful to be able to determine the rel-ative quality of these scoring functions. In the following sections, I discussthe means by which the two sentence scoring functions presented previouslywere evaluated, and I also discuss the experiments evaluating the BayesianSurprise-based summarizers over both full and partial threads.Throughout these evaluations, the ROUGE-1 metric is used, which mea-sures unigram overlap. While a number of other metrics are also availablein the ROUGE toolkit, it is not uncommon for only one or two of the met-rics to be used; and since this thesis primarily involves initial forays intoa novel task, the use of arguably the most fundamental ROUGE metric isappropriate.The ROUGE toolkit options were set such that stemming was performedbut that synonyms were not used and stopwords were not removed, con-sistent with previous evaluations of summarization based on Bayesian Sur-prise [20]. For statistical significance tests we use ANOVA followed by pairedt-tests with Bonferroni corrections.7.1 Sentence Scoring Function EvaluationConsider some partial email thread PT and extractive gold standard EGS,and let EGSPT be the subset of the gold standard sentences found withinPT . The silver standard framework was designed such that, if possible,all of EGSPT is included in the silver standard summary; or, in the casethat EGSPT is too large for the desired silver standard length, that somedesired subset of EGSPT is selected. Therefore, for summary lengths upto the length of EGSPT , it is not possible for a silver standard to be ofworse quality than the naive approach of simply using EGSPT . However,silver standard summaries can differ in quality depending on which sentences317.1. Sentence Scoring Function Evaluationthey select, whether it be in the case of selecting gold standard sentences forshorter silver standards or selecting other partial thread sentences for longersilver standards. Therefore, it is of value to be able to evaluate candidatesentence scoring functions.The goal of a sentence scoring function in the silver standard frameworkis similar to that of a corresponding function in a summarizer: to facilitatethe selection of sentences that humans would select in the same circum-stances. Therefore, if we have access to human-annotated gold standardsummaries, then there should be a way for us to use the existing gold stan-dard to evaluate the silver standard sentence scoring functions. In otherwords, we ask the question: how well does a given sentence scoring functionchoose a gold standard sentence?Answering this question is not trivial; two factors have been identifiedthat may impact the performance of sentence scoring functions. The firstis the amount of gold standard data available to the function; since thesescoring functions are oracular, it stands to reason that the number of sen-tences available to the functions as gold standard annotations changes the“oracularity” of the functions and may thus affect their performance. Thesecond factor is the size of the summary that has been generated so far.Both of the scoring functions used in this thesis are influenced by the in-progress summary; the Topic-PageRank-MMR method uses MMR based onthe maximum similarity between candidate sentences and in-progress sum-mary sentences, and the JS divergence method uses prospective summariesthat include candidate sentences in its comparisons of word distributions.In order to address these concerns, five scenarios have been selected torepresent edge cases for the purpose of estimating performance bounds forsentence scoring functions. Full email threads and extractive gold standardsummaries are used. In each of the scenarios, a subset of the EGS is taken tobe the in-progress summary, and the sentence scoring functions are taskedwith selecting another EGS sentence (or one very like it). What is variedacross the five scenarios are the number of EGS sentences comprising thein-progress summary and the number of EGS sentences made available tothe scoring functions as being part of the EGS.The first two scenarios (labeled held-out) are similar to frequently-used“hold-one-out” evaluation methods. For each sentence in the EGS, the sen-tence is removed, and the candidate sentence scoring functions are taskedto try and retrieve it. In both of these scenarios, the size of the in-progresssummary is maximized. In the first scenario, the removed sentence is notincluded in the EGS supplied to the scoring functions; this minimizes theoracularity of the scoring functions. In the second scenario, the removed sen-327.1. Sentence Scoring Function Evaluationtence is included in the EGS supplied to the scoring functions, maximizingtheir oracularity.The next two scenarios (labeled 1-EGS) attempt to minimize the size ofthe in-progress summary. For each sentence in the EGS, only that sentenceconstitutes the in-progress summary, and the scoring functions are used totry and retrieve one of the others. As before, these two scenarios provideextremes of oracular information to the scoring functions. In the first, noneof the other EGS sentences are identified as such (in other words, the scoringfunctions are only aware of the single EGS sentence that comprises the in-progress summary); and in the second, the entire EGS is made known tothe scoring functions.The final scenario (labeled 0-EGS) has no in-progress summary; thescoring functions are choosing their silver standards’ first sentence. In addi-tion, the entire EGS is made known to the scoring functions. This scenariomost closely simulates that of a partial thread with no EGS sentences in it.Note that this scenario has no corresponding case of minimal oracularity;that would require an empty EGS to be provided to the scoring functions(i.e. there is no gold standard at all), which is not a case for which the silverstandard framework was designed.For each full thread in each of the two annotated datasets, each of thefirst four scenarios was enacted for each of their EGS sentences. Giventhat the final scenario has no single sentence to distinguish it, it could onlybe enacted once for each thread. As stated previously, ROUGE-1 scoreswere obtained comparing the sentence chosen by each function to the EGSsentences not used in the in-progress summary. Since the goal is to selecta single sentence out of a set of one or more EGS sentences, precision is amore useful metric for this evaluation than recall or F-score. For a baseline, arandom sentence selection function was used. Since each candidate sentenceis equally likely in the absence of any other information, the “random”baseline value was obtained by evaluating each of the candidate sentenceswith ROUGE and then simply averaging the results. The results are givenin Table 7.1 and depicted in Figure 7.1.The first (and most obvious) conclusion to be drawn is that the sentencescoring function based on JS divergence is the top performer in all casesand for both datasets. The Topic-PageRank-MMR method was used in ourpreviously published work; however, given these evaluation results, we usethe JS divergence-based silver standards for our more recent summarizerevaluations.The performance drop across all scoring functions and over both datasetsis to be expected. Given a thread and its corresponding EGS (which must337.1. Sentence Scoring Function Evaluationheld-out 1-EGS 0-EGSOracularity Method BC3 Loza BC3 Loza BC3 LozaMinimum TPM 0.118 0.122 0.805 0.483 – –JSD 0.208 0.308 0.849 0.719 – –– random 0.158 0.199 0.741 0.483 0.713 0.537Maximum TPM 0.113 0.124 0.815 0.475 0.880 0.770JSD 0.482 0.717 0.932 0.841 0.958 0.925Table 7.1: ROUGE-1 precision scores for silver standard sentence scoringfunction evaluation experiments. The only two pairs of values that arenot statistically significantly different (p < 0.05) within their dataset andevaluation scenario are bolded.(a) BC3 dataset (b) Loza et al. datasetFigure 7.1: Sentence scoring function performance: ROUGE-1 precisionvs. the number of EGS sentences (out of n) in the in-progress summary.For each scoring function, the upper and lower curves indicate performanceat different levels of oracularity. Since random sentence selection is notoracular, it only has one shorter than the thread itself), each EGS sentence added to the summarydecreases the relative number of available EGS sentences more quickly thanthe relative number of available thread sentences; thus the ratio of remainingEGS sentences to remaining thread sentences decreases, which also resultsin a decreased probability of selecting an EGS sentence in the next iteration(all other things being equal). The apparent exception to this behaviour isthe seeming increase in performance of the random selection method fromthe 0-EGS scenario to the 1-EGS scenario; however, the standard deviationof the 0-EGS precision values is > 0.3, and so the difference between the347.2. Summarizer Evaluationaverage precision values is within a reasonable error bound.The amount of material available for oracular use does not appear tomake a significant difference for the Topic-PageRank-MMR model. It maybe that topic prominence is too coarse-grained a feature to be used on itsown for this task, but that it could be used in combination with otherfeatures to add weight to important sentences. It may also be the casethat future improvements in topic segmentation will result in more accuratetopic prominence scores. However, we must also consider the weight of thetopic prominence to the overall sentence score. As stated previously (seeEquations 3.2 and 3.3), the oracular topic prominence in the TPM scoringfunction has an overall weight of 0.25 (out of 1) in calculating a sentence’simportance score; however, that importance score is also balanced againstthe MMR value (see Equation 3.1, where λ was set to 0.5).It is likely that the single largest contributing factor to the relatively poorperformance of the TPM scoring function in this evaluation is the strengthgiven to the MMR score. As implemented, it weights sentence importanceand novelty equally, which is an intuitively reasonable initial assumption;however, the transition to even a single sentence in the in-progress summaryseems to correlate with a drastic performance drop relative to random sen-tence selection in the evaluation over the Loza et al. dataset. It is possiblethat weakening the MMR component of the TPM scoring function—eitherby increasing the value of λ or by changing the MMR similarity function touse the average rather than the max for its aggregation—will improve TPMperformance overall and increase the relative effects of the other componentsof the TPM function, such as the EGS topic prominence. As stated previ-ously, such fine-tuning efforts are left to future work with larger datasets.7.2 Summarizer EvaluationFor our preliminary (and recently published [15]) summarizer evaluation,we generated candidate summaries for each full thread in the Loza et al.dataset as well as for its corresponding partial threads. First, we generatedsummaries using both of the Bayesian Surprise methods (BS and BSE)discussed in sections 4.1 and 4.2. We then generated additional summariesfor each method using the exponential surprise decay (-d) discussed in sec-tion 4.3 with df = 0.5. We also use a ClueWordSummarizer (CWS) im-plementation as a baseline. Of course, for partial threads consisting onlyof a single email, there can be no clue words; and so the default behaviourchosen for our CWS implementation in this case is to select the first k sen-357.2. Summarizer Evaluationtences of the email, in temporal order. We evaluated the partial threadsummaries using silver standards obtained with the Topic-PageRank-MMR(TPM) sentence scoring function.For our more recent evaluations, we re-evaluate the partial thread sum-marizers using silver standards based on JS divergence (JSD), and we expandour evaluation to include summaries generated from the BC3 dataset.Throughout the summarizer evaluations, we constrain the reference andcandidate summaries to be of the same length (in words), and this constraintis enforced by truncation. Since all of the summarizers in this thesis aregreedy, only the least significant sentence in each summary is truncated. Wethen compare the silver standards and candidate summaries using ROUGE-1F-scores.7.2.1 Evaluation over Full ThreadsInitially, we evaluated the system summaries over the full threads against thehuman-annotated EGS. The system summaries were truncated to the length(in words) of the corresponding EGS. As an additional baseline we useda simple PageRank-based summarizer (PR-MMR) that scores sentencesusing the same sentence graphs as the TPM sentence scoring function andemploys MMR to minimize redundancy. The results for this evaluation overfull threads are given in Table 7.2.Method Full threadsBS 0.573BS-d 0.582BSE 0.566BSE-d 0.573CWS 0.598PR-MMR 0.509Table 7.2: ROUGE-1 mean F-scores for full threads as compared to goldstandard summaries.The results of this experiment over full threads suggest that the BayesianSurprise-based methods perform nearly as well as the clue words-based sum-marizer, and that they all significantly outperform the PR-MMR baseline(p<0.005). In addition, there appears to be some benefit to the more grad-ual redundancy handling provided by surprise decay, though the differencesin these cases do not appear to be significant.367.2. Summarizer Evaluation7.2.2 Preliminary Evaluation over Partial ThreadsThis section describes our preliminary partial thread summarization eval-uations and presents our results and corresponding discussion, as recentlypublished [15]. The next section does the same for our more recent summa-rizer evaluations over partial threads.To evaluate the summarizers over partial threads, we generated one silverstandard summary per annotator and per partial thread using the silverstandard framework. In our initial published evaluation, we used the TPMsentence scoring function. The silver standard and system summaries foreach PT in the Loza et al. dataset were truncated to a fraction of the PTlength (in words). Since the silver standard algorithm generates summariesof arbitrary length, we evaluated the summarizers at both 20% and 30% ofthe PT length.The hypothesis behind our use of Bayesian Surprise-based methods isthat they should work particularly well for PT summarization, because PTscan be rather short, and the identification of the most informative sentenceswould benefit from examining a larger informational context. To test thishypothesis we sorted the 282 PTs being summarized by length and binnedthem into quartiles (see Table 7.3). Since BSE-d is the Bayesian Surprise-based method incorporating all of our extensions, we focused our statisticalanalysis on comparing it to the CWS baseline. The results of this prelimi-nary evaluation are given in Table 7.4.min 25% median 75% maxLength 22 104 197 329 1236Table 7.3: Length (in words) of the partial threads in the Loza et al. datasetused to define the quartile bins.We observe a number of trends in Table 7.4 from the experiments overPTs; however, only some cases exhibit at least marginal significance. Thismay be due in part to limited sample size; and so we argue that furtherwork in applying background knowledge to PT summarization over largerdatasets is warranted.Over the shorter PTs (i.e. first and second quantiles) and at both sum-mary lengths, we observe a trend favoring our hypothesis, namely thatBayesian Surprise-based methods seem to perform better than CWS; forexample, the performance improvement of BSE-d over CWS is at leastmarginally significant (p<0.1) for the second quartile at both summarylengths. Conversely, for the longest PTs (i.e., the fourth quantile), we see377.2. Summarizer Evaluation30% PT length BS BS-d BSE BSE-d CWS pQ1 0.666 0.643 0.632 0.622 0.582 0.310Q2 0.558 0.576 0.560 0.571 0.503 0.041Q3 0.552 0.565 0.540 0.535 0.568 0.088Q4 0.504 0.516 0.510 0.510 0.548 0.011all PTs 0.570 0.575 0.560 0.559 0.550 0.51920% PT length BS BS-d BSE BSE-d CWS pQ1 0.600 0.577 0.558 0.557 0.512 0.402Q2 0.504 0.513 0.495 0.494 0.424 0.078Q3 0.476 0.470 0.469 0.469 0.467 0.933Q4 0.435 0.441 0.439 0.448 0.452 0.808all PTs 0.504 0.500 0.490 0.493 0.464 0.114Table 7.4: ROUGE-1 mean F-scores over partial threads in the Loza etal. dataset (binned into quartiles by length in words) as compared to sil-ver standard summaries (with the TPM sentence scoring function). Valuesare given for both summary lengths (20% and 30% of PT length). BoldedROUGE scores are the highest for their quartile and summary length cat-egory. P-values are given for the comparisons between BSE-d and CWS;underlined p-values indicate at least marginal significance (p<0.1).that the effectiveness of clue words is more fully realized, allowing CWS tooutperform the Bayesian Surprise-based summarizers. While this differenceis significant (p<0.05) for summaries of 30% PT length, it is not significantat 20% PT length; this suggests that Bayesian Surprise-based summarizersmay be more robust against changes in PT summary length than CWS.Surprisingly, the conversational features used to extend the BayesianSurprise method do not seem to have improved summarizer performance.It may be that treating these features as equivalent to word counts is inap-propriate for this task, in which case some other means of extracting thesefeatures as background knowledge should be devised. Alternatively, the in-clusion of additional features, such as the number of times a word is used inthe first sentence of each email in the thread, may improve the performanceof the extended Bayesian Surprise summarizer.As with the full threads, the inclusion of surprise decay seems to providesome benefit, though it appears to hamper the summarizers for the shortestPTs; this trend can be seen at 30% PT length, where BS-d outperforms BSin all quartiles except the first. This suggests that applying surprise decayfactors derived from PT length and desired summary length may improve387.2. Summarizer Evaluationoverall performance; we leave this endeavor for future work.7.2.3 Subsequent Evaluation over Partial ThreadsOur subsequent partial thread summarization evaluation experiments re-tained some aspects of our original evaluation. We generated one silverstandard summary per annotator and per partial thread; the silver stan-dard and system summaries were truncated to a fraction of the PT length;and evaluations were performed at both 20% and 30% PT length.However, there are also some differences in these more recent evaluationexperiments. The first is the use of JSD-based silver standard summaries,due to the results of our sentence scoring function evaluation experiments.The second is the inclusion of the BC3 dataset; and the third is a differ-ent method of binning the partial threads. Binning by length in words isa reasonable and intuitive approach, since we are considering the amountof content available to summarizers; however, in the case of conversations(asynchronous or otherwise) it is also reasonable to consider length in termsof the number of turns taken by the participants, which in our case is thenumber of emails in the partial thread.Consider two partial threads, one of which consists of a single but verylong email, while the other consists of several shorter emails. They may be ofcomparable length in words or sentences; but since there are no clue words inthe single lengthy email, the CWS baseline will treat it differently than thepartial thread with multiple emails. Therefore, while there is intuitive valuein binning by length in words as was done in our preliminary evaluation, itis perhaps more appropriate for our task to bin by length in emails.We have chosen five bins in order to keep the bin sizes similar and appliedthem to both datasets; the number of PTs in each bin are given in Table 7.5.The terms “shorter” and “longer” are relative by nature; for these evalua-tions we will set PT length bins 1-3 as being “shorter” PTs and the 4 and5+ bins as comprising the “longer” PTs. A more nuanced definition (forexample, one that depends on the PT length relative to the length of theentire thread) may be beneficial in future work.The results of our PT summarization evaluation experiments on the twodatasets are given in Tables 7.6 and 7.7.Our evaluation over the Loza et al. dataset (see Table 7.6) suggests bothsimilar and different performance trends as compared to our preliminaryexperiments with the TPM-based silver standards. As before, for the shorterpartial threads, the Bayesian Surprise-based methods seem to outperformthe CWS baseline; indeed, the difference is statistically significant for the397.2. Summarizer EvaluationPT length (emails) 1 2 3 4 5+Loza et al. 61 62 62 50 47BC3 40 40 40 39 60Table 7.5: Number of partial threads in each dataset with the given length(in emails). Note that in the case of the Loza et al. dataset, the first emailin one of the threads consisted of only an email attachment (i.e. it containedno sentences), and so that partial thread was not included in the evaluation.Summaries truncated to 30% PT lengthPT length (emails) BS BS-d BSE BSE-d CWS p1 0.627 0.624 0.615 0.608 0.579 0.5202 0.632 0.621 0.616 0.620 0.603 0.6463 0.615 0.614 0.630 0.635 0.579 0.0304 0.609 0.633 0.626 0.634 0.615 0.3655+ 0.602 0.626 0.612 0.611 0.614 0.912Summaries truncated to 20% PT lengthPT length (emails) BS BS-d BSE BSE-d CWS p1 0.545 0.544 0.510 0.513 0.520 0.8892 0.547 0.531 0.521 0.521 0.502 0.6753 0.539 0.531 0.555 0.562 0.434 0.000154 0.515 0.529 0.545 0.560 0.491 0.0115+ 0.515 0.522 0.533 0.527 0.493 0.283Table 7.6: Partial thread summarizer evaluation results for the Loza et al.dataset, using JS divergence-based silver standard summaries as reference.P-values are italicized, and those indicating at least marginal statisticalsignificance (p < 0.1) are underlined.third bin in both the 20% and 30% PT length categories. Additionally, thereare a number of cases where the surprise decay appears to be of benefit,though the original Bayesian Surprise summarizer without decay shows thetop performance for the shortest partial threads, as it did in our preliminaryevaluation.A marked difference in the new results is in the apparent influence ofthe conversational features on the performance of the Bayesian Surprise-based methods. Whereas there was no prior indication of any benefit fromthese conversational features—and in fact the BSE-d model performed sig-nificantly worse than the CWS baseline in some cases—our BSE-d modelis now the top-performing model in bins 3 and 4 in both summary length407.2. Summarizer Evaluationcategories for the Loza et al. dataset. Furthermore, in this evaluation theBSE-d model appears to outperform the CWS baseline fairly consistently,though the difference is only statistically significant for bin 3.Summaries truncated to 30% PT lengthPT length (emails) BS BS-d BSE BSE-d CWS p1 0.587 0.595 0.546 0.547 0.582 0.4762 0.479 0.502 0.493 0.497 0.591 0.0633 0.503 0.522 0.503 0.508 0.557 0.2334 0.554 0.573 0.545 0.548 0.553 0.8805+ 0.538 0.548 0.566 0.576 0.585 0.616Summaries truncated to 20% PT lengthPT length (emails) BS BS-d BSE BSE-d CWS p1 0.522 0.528 0.476 0.476 0.528 0.3392 0.384 0.391 0.430 0.430 0.517 0.1693 0.397 0.399 0.400 0.401 0.473 0.2004 0.448 0.451 0.473 0.472 0.469 0.9545+ 0.487 0.486 0.552 0.552 0.498 0.065Table 7.7: Partial thread summarizer evaluation results for the BC3 dataset,using JS divergence-based silver standard summaries as reference. P-valuesare italicized, and those indicating at least marginal statistical significance(p < 0.1) are underlined.Our evaluation over the BC3 dataset (see Table 7.7) seems to tell arather different story. The CWS baseline seems to fairly consistently outper-form the Bayesian Surprise-based summarizers, even for the shorter partialthreads; the one exception seems to be the 5+ bin at 20% PT length, whereour BSE-d method marginally outperforms the baseline (p < 0.1). Othertrends, such as the potential benefit of surprise decay, seem to be reflectedin this evaluation as well.Two factors have been identified that can help explain the stark contrastbetween the evaluation over the BC3 dataset and that over the Loza et al.The first is the influence of quoted material in the datasets. As stated previ-ously, the Loza et al. dataset has no quoted material; therefore, the threadsin that dataset have very simple FQG structures, where each fragment con-sists of an entire email, rather than a part of an email that had been selectedas significant by a thread participant. Therefore, any non-stopwords foundanywhere in an email can potentially be clue words, which diminishes to anextent the significance of any given clue word. In other words, the email417.2. Summarizer Evaluationthreads in the Loza et al. dataset can be thought of as the worst-case sce-nario for a CWS-based summarizer, and so we would expect its relativeperformance to improve on email threads with more complex conversationalstructures, where thread participants purposefully quote selected material.The second factor was discussed in sections 4.1 and 6.3: terms thatare extremely rare (and thus surprising) can artificially inflate the appar-ent importance of less-relevant sentences. That this factor played a rolein the evaluation results is evidenced by a manual inspection of the sum-maries generated by the silver standard framework and the partial threadsummarizers.The silver standard and candidate summaries for the partial threads inbin 2 (where the performance differences were closest to statistical signifi-cance) were collected, and these partial threads were filtered according tothe following criteria:• All of the Bayesian Surprise-based methods selected some candidatesentence ci• The CWS baseline did not select ci• None of the silver standards corresponding to any of the annotatorsincluded ciFourteen of the forty partial threads in bin 2 matched all of these criteria.A visual inspection revealed that almost all of the cases matching thesecriteria involved unusual words, most of which were acronyms, as shown inTable 7.2.3.While our approach was to minimize the amount of pre-processing per-formed that might impact the novelty of tokens in the partial threads,the results of the BC3 evaluation indicate that some carefully-applied pre-processing is warranted in future work. Another approach that may proveeffective would be to choose as a background corpus the email history of aparticular user or relevant group of users, rather than a single corpus forall users within an organization; that way, the acronyms and other specialterms frequently used by that user or group would become less surprisingrelative to those used elsewhere.If the Loza et al. dataset can be said to represent the worst-case sce-nario for the CWS baseline, it may not be unreasonable to suppose thatthe BC3 dataset represents something similar for Bayesian Surprise-basedmethods. If that is indeed the case, then the fact that the differences be-tween the Bayesian Surprise-based summarizers and the CWS baseline arenot statistically significant (p > 0.05) is reassuring.427.2. Summarizer EvaluationFinally, in light of these new results, we repeat our earlier assertion thatfurther research in partial thread summarization is warranted, using largerdatasets.437.2. Summarizer EvaluationSentence CategoryQuoting Keith Instone, named entityLike all CHIs, there will be many competing activities,so the sooner we can set this up the more likely I canattend.acronymplease note that this meeting would be on at the sametime as the ATIA meeting in Orlando, Florida.acronymNashville, mid october locationMLK Day is never better than in Savannah! acronym,locationI would guess that most people who develop web pages areamateurs (in the original sense of the word: from amoreor amour: an activity done out of love.)non-EnglishFurthermore, I expect the ER IG to be instrumental indefining and refining the exact deliverables of the WGgroup.acronymsI hope it works out because the WAI’s influence on theoverall W3C and the strength-through-diversity we’reachieving is a great model for just about everything andit’s starting to bear fruit.acronymsThe problem is funnier (after people get over the incon-venience of learning about it) when a screen reader uses afinnish or swedish synthesiser but provides its cues andprompts in english...spellingNote that the page is extremely table rich (a euphasismof sorts)spelling- they may have had a TBI (traumatic brain injury) orhave a Learning Disability.acronymChris Lilley, Brian Stell and others have been dis-cussing the rash of irate, get me off this list mesages thelistserv has received, lately.variousTable 7.8: Examples of sentences in the BC3 dataset selected by BayesianSurprise-based methods that were not selected by the silver standard frame-work or the CWS baseline, along with categories of “surprising” tokens(bolded) found in each sentence.44Chapter 8Conclusions and FutureWorkIn this work, we have defined and motivated the partial thread summa-rization problem. We have proposed a framework that uses gold standardsummaries of complete threads in order to build oracular silver standardextractive summaries of arbitrary length for partial email threads. We havedevised a means to evaluate sentence scoring functions to be used in the sil-ver standard framework, and we have found that our scoring function basedon JS divergence significantly outperforms the Topic-PageRank-MMR func-tion used in our previously published work.We have applied an unsupervised summarization method that incorpo-rates background knowledge to partial thread summarization, extended itwith conversational features, and modified the mechanism by which it han-dles redundancy. We have observed trends in our experimental results thatsuggest the following:• Bayesian Surprise-based methods exhibit comparable or improved per-formance relative to the state-of-the-art unsupervised baseline in mostcases, including both shorter and longer partial email threads• Extending Bayesian Surprise-based methods with conversational fea-tures can improve performance• Softening the redundancy management mechanism of Bayesian Surprise-based summarizers can improve performance• Concerns about errors due to novel (but not inherently important)tokens are justified in some cases, which implies a need for some carein selecting and/or pre-processing the background corpus used for thetaskAlthough in our experiments we did not find consistently significant im-provements using Bayesian Surprise-based methods on partial threads overthe baseline, we argue that the observed trends support our assertion that45Chapter 8. Conclusions and Future Workthe potential benefit of background knowledge to PT summarization (andemail summarization in general) should be further investigated with largerdatasets.There are multiple directions of future work, many of which have beenmentioned elsewhere in this thesis. While an obvious direction is the con-tinued development of extractive PT summarization algorithms (eg. by ap-plying recent summarization techniques such as ILP [28] or neural network-based summarizers [3]), another is the abstractive summarization of partialthreads, following for instance the recent work of Mehdad et al. [22]Another direction of future work is the application of the silver standardframework to other domains. Closely related domains, such as other asyn-chronous conversations (discussion forums, blog conversations, social media,etc.), are the most obvious area of direct application of the framework; how-ever, there are other domains where some human annotation is available butreference summaries for different portions of the source document(s) are de-sired. For example, the abstract of an academic paper could be thought ofas an abstractive summary; and so the silver standard framework could beadapted to summarize certain sections of interest within the paper given theabstract.Future approaches to summarizer evaluation using silver standards mayinclude employing multiple scoring functions concurrently. Just as we preferto use multiple human annotators to examine inter-annotator agreementand ensure high quality annotations, making use of multiple high-qualitysentence scoring functions in an ensemble method may further improve thequality of silver standards.Future work may also include finding additional ways to incorporatebackground knowledge into email summarization. For example, backgroundknowledge need not be applied in isolation; Bayesian Surprise scores maybe used in tandem with other features to develop summarizers that aremore robust against changes in document length. Additional ways to adaptthe Bayesian Surprise algorithm may also prove beneficial. For example,the surprise redundancy mechanism may be altered to depend on whether atoken has been seen in the most recent prior email, making the surprise scorealso depend on an additional, more fine-grained layer of prior knowledge.An advantage to the study of partial thread summarization is that itmay reveal whether current summarization techniques perform differentlyon in-progress threads than on complete, archived ones. For example, ifa summarizer uses features that may depend on the entire email thread(eg. the relative positions of sentences in the thread, completed dialog acts,etc.), then those features may have a different significance when applied to46Chapter 8. Conclusions and Future Workpartial threads than they do for complete threads. To further the study ofpartial thread summarization, another direction of future work is a thoroughcategorization of the differences between full and partial threads, as well asdifferences between partial threads at different stages of development. Suchdifferences may be found, for example, in lexical and topic diversity, as wellas dialog act initiation and/or completion.47Bibliography[1] Regina Barzilay and Michael Elhadad. Using lexical chains for textsummarization. Advances in automatic text summarization, pages 111–121, 1999.[2] Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Jointly learningto extract and compress. In Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 481–490. Association for ComputationalLinguistics, 2011.[3] Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. Rankingwith recursive neural networks and its application to multi-documentsummarization. In AAAI, pages 2153–2159, 2015.[4] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-basedreranking for reordering documents and producing summaries. In Pro-ceedings of the 21st annual international ACM SIGIR conference on Re-search and development in information retrieval, pages 335–336. ACM,1998.[5] Giuseppe Carenini, Gabriel Murray, and Raymond Ng. Methods formining and summarizing text conversations. Synthesis Lectures on DataManagement, 3(3):1–130, 2011.[6] Giuseppe Carenini, Raymond T Ng, and Xiaodong Zhou. Summarizingemails with conversational cohesion and subjectivity. In ACL, volume 8,pages 353–361, 2008.[7] Hoa Trang Dang and Karolina Owczarzak. Overview of the tac 2008update summarization task. In Proceedings of text analysis conference,pages 1–16, 2008.[8] Gu¨nes Erkan and Dragomir R Radev. Lexrank: Graph-based lexicalcentrality as salience in text summarization. Journal of Artificial In-telligence Research, 22:457–479, 2004.48Bibliography[9] Pascale Fung, Grace Ngai, and Chi-Shun Cheung. Combining optimalclustering and hidden markov models for extractive summarization. InProceedings of the ACL 2003 workshop on Multilingual summarizationand question answering-Volume 12, pages 21–28. Association for Com-putational Linguistics, 2003.[10] Michel Galley, Kathleen R McKeown, Eric Fosler-Lussier, and HongyanJing. Discourse segmentation of multi-party conversation. In Proceed-ings of the 41st Annual Meeting of the Association for ComputationalLinguistics, 2003.[11] Jun Hatori, Akiko Murakami, and Junichi Tsujii. Multi-topical discus-sion summarization using structured lexical chains and cue words. InInternational Conference on Intelligent Text Processing and Computa-tional Linguistics, pages 313–327. Springer, 2011.[12] Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasuda,and Masaaki Nagata. Single-document summarization as a tree knap-sack problem. In EMNLP, volume 13, pages 1515–1520, 2013.[13] Emily Jamison and Iryna Gurevych. Headerless, quoteless, but nothopeless? using pairwise email classification to disentangle emailthreads. In RANLP, pages 327–335, 2013.[14] Wei Jin, Shafiq Joty, Giuseppe Carenini, and Raymond Ng. Detectinginformative blog comments using tree structured conditional randomfields. NW-NLP, Microsoft Research, Redmond, 2012.[15] Jordon Johnson, Vaden Masrani, Giuseppe Carenini, and RaymondNg. Generating and evaluating summaries for partial email threads:Conversational bayesian surprise and silver standards. In Proceedingsof the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages263–272, 2017.[16] Shafiq Joty, Giuseppe Carenini, and Raymond T Ng. Topic segmenta-tion and labeling in asynchronous conversations. Journal of ArtificialIntelligence Research, 47:521–573, 2013.[17] Kirill Kireyev. Using latent semantic analysis for extractive summa-rization. In Proceedings of text analysis conference, volume 2008, 2008.[18] Chin-Yew Lin. Rouge: A package for automatic evaluation of sum-maries. Text Summarization Branches Out, 2004.49Bibliography[19] Annie Louis and Ani Nenkova. Automatically assessing machine sum-mary content without a gold standard. Computational Linguistics,39(2):267–300, 2013.[20] Annie P Louis. A bayesian method to incorporate background knowl-edge during automatic text summarization. Association for Computa-tional Linguistics, 2014.[21] Vanessa Loza, Shibamouli Lahiri, Rada Mihalcea, and Po-Hsiang Lai.Building a dataset for summarization and keyword extraction fromemails. In LREC, pages 2441–2446, 2014.[22] Yashar Mehdad, Giuseppe Carenini, and Raymond T Ng. Abstractivesummarization of spoken and written conversations based on phrasalqueries. In Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), volume 1,pages 1220–1230, 2014.[23] Rada Mihalcea and Dragomir Radev. Graph-based natural languageprocessing and information retrieval. Cambridge University Press, 2011.[24] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts.Association for Computational Linguistics, 2004.[25] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Effi-cient estimation of word representations in vector space. arXiv preprintarXiv:1301.3781, 2013.[26] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and their com-positionality. In Advances in neural information processing systems,pages 3111–3119, 2013.[27] Gabriel Murray and Giuseppe Carenini. Summarizing spoken and writ-ten conversations. In Proceedings of the Conference on Empirical Meth-ods in Natural Language Processing, pages 773–782. Association forComputational Linguistics, 2008.[28] Gabriel Murray, Giuseppe Carenini, and Raymond Ng. Generating andvalidating abstracts of meeting conversations: a user study. In Proceed-ings of the 6th International Natural Language Generation Conference,pages 105–113. Association for Computational Linguistics, 2010.50Bibliography[29] NK Nagwani. Summarizing large text collection using topic modelingand clustering based on mapreduce framework. Journal of Big Data,2(1):1, 2015.[30] Ani Nenkova, Kathleen McKeown, et al. Automatic summarization.Foundations and Trends R© in Information Retrieval, 5(2–3):103–233,2011.[31] Tatsuro Oya and Giuseppe Carenini. Extractive summarization anddialogue act modeling on email threads: An integrated probabilisticapproach. In 15th Annual Meeting of the Special Interest Group onDiscourse and Dialogue, page 133, 2014.[32] Owen Rambow, Lokesh Shrestha, John Chen, and Chirsty Lauridsen.Summarizing email threads. In Proceedings of HLT-NAACL 2004:Short Papers, pages 105–108. Association for Computational Linguis-tics, 2004.[33] Jan Ulrich, Gabriel Murray, and Giuseppe Carenini. A publicly avail-able annotated corpus for supervised email summarization. In Proc. ofaaai email-2008 workshop, chicago, usa, 2008.51


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items