UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Domain adaptation for summarizing conversations Sandu, Oana 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2011_spring_sandu_oana.pdf [ 1.33MB ]
JSON: 24-1.0051250.json
JSON-LD: 24-1.0051250-ld.json
RDF/XML (Pretty): 24-1.0051250-rdf.xml
RDF/JSON: 24-1.0051250-rdf.json
Turtle: 24-1.0051250-turtle.txt
N-Triples: 24-1.0051250-rdf-ntriples.txt
Original Record: 24-1.0051250-source.json
Full Text

Full Text

Domain Adaptation for Summarizing Conversations  by Oana Sandu B. Science, McGill University, 2008  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science)  The University Of British Columbia (Vancouver) April 2011 c Oana Sandu, 2011  Abstract The goal of summarization in natural language processing is to create abridged and informative versions of documents. A popular approach is supervised extractive summarization: given a training source corpus of documents with sentences labeled with their informativeness, train a model to select sentences from a target document and produce an extract. Conversational text is challenging to summarize because it is less formal, its structure depends on the modality or domain, and few annotated corpora exist. We use a labeled corpus of meeting transcripts as the source, and attempt to summarize a different target domain, threaded emails. We study two domain adaptation scenarios: a supervised scenario in which some labeled target domain data is available for training, and an unsupervised scenario with only unlabeled data in the target and labeled data available in a related but different domain. We implement several recent domain adaptation algorithms and perform a comparative study of their performance. We also compare the effectiveness of using a small set of conversation-specific features with a large set of raw lexical and syntactic features in domain adaptation. We report significant improvements of the algorithms over their baselines. Our results show that in the supervised case, given the amount of email data available and the set of features specific to conversations, training directly indomain and ignoring the out-of-domain data is best. With only the more domainspecific lexical features, though overall performance is lower, domain adaptation can effectively leverage the lexical features to improve in both the supervised and unsupervised scenarios.  ii  Preface This work is informed by past research by Gabriel Murray and Giuseppe Carenini in summarizing conversational data. Sections 2.4 Summarization of conversational data, 4.1 Data, 4.2.1. Conversational features and 4.2.2 Lexico-syntactic features draw from their previous work. I wrote the rest of the manuscript, and also researched and implemented domain adaptation methods, conducted the experiments, and drew conclusions. A version of chapters 4 and 5 has been published: Oana Sandu, Giuseppe Carenini, Gabriel Murray, and Raymond Ng. 2010. Domain adaptation to summarize human conversations. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP 2010), pages 16-22, 2010. I conducted the experiments, and the paper was written in conjunction with the other authors.  iii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xii  1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  2  Summarization Background . . . . . . . . . . . . . . . . . . . . . .  4  2.1  Summarization problem . . . . . . . . . . . . . . . . . . . . . . .  4  2.1.1  Summary of a single or multiple documents . . . . . . . .  5  2.1.2  Generic or query-based summary . . . . . . . . . . . . .  5  Extractive and abstractive summarization . . . . . . . . . . . . .  5  2.2  2.2.1  approaches . . . . . . . . . . . . . . . . . . . . . . . . .  6  Sentence selection . . . . . . . . . . . . . . . . . . . . .  9  Evaluation of summarization . . . . . . . . . . . . . . . . . . . .  10  2.3.1  General classification metrics . . . . . . . . . . . . . . .  11  2.3.2  Metrics specific to summarization . . . . . . . . . . . . .  12  2.2.2 2.3  Comparison of extractive and abstractive summarization  iv  2.4  Summarization of conversational data . . . . . . . . . . . . . . .  14  2.4.1  Nature and challenges of conversations . . . . . . . . . .  14  2.4.2  Meetings summarization . . . . . . . . . . . . . . . . . .  15  2.4.3  Emails summarization . . . . . . . . . . . . . . . . . . .  16  2.4.4  Cross-modality summarization . . . . . . . . . . . . . . .  18  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  19  Domain Adaptation Background . . . . . . . . . . . . . . . . . . . .  20  3.1  Domain adaptation problem . . . . . . . . . . . . . . . . . . . .  20  3.2  Supervised methods . . . . . . . . . . . . . . . . . . . . . . . . .  21  3.2.1  Instance weighting . . . . . . . . . . . . . . . . . . . . .  22  3.2.2  Adding prediction as an input feature . . . . . . . . . . .  23  3.2.3  Using source data for a prior . . . . . . . . . . . . . . . .  24  3.2.4  Maximum entropy genre adaptation model . . . . . . . .  25  3.2.5  Feature copying method (easyadapt) . . . . . . . . . . . .  26  Unsupervised and semi-supervised methods . . . . . . . . . . . .  26  3.3.1  Baseline transfer . . . . . . . . . . . . . . . . . . . . . .  26  3.3.2  Self-training, co-training, and boosting methods . . . . . .  27  3.3.3  Structural correspondence learning (SCL) . . . . . . . . .  30  3.3.4  Easyadapt with unlabeled data . . . . . . . . . . . . . . .  33  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  34  Extractive Summarization with Domain Adaptation . . . . . . . . .  35  4.1  Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  36  4.1.1  AMI corpus . . . . . . . . . . . . . . . . . . . . . . . . .  36  4.1.2  W3C corpus . . . . . . . . . . . . . . . . . . . . . . . .  36  4.1.3  BC3 corpus . . . . . . . . . . . . . . . . . . . . . . . . .  37  4.1.4  Enron corpus . . . . . . . . . . . . . . . . . . . . . . . .  37  Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  37  4.2.1  Conversational features . . . . . . . . . . . . . . . . . . .  38  4.2.2  Lexico-syntactic features . . . . . . . . . . . . . . . . . .  38  4.2.3  Feature selection . . . . . . . . . . . . . . . . . . . . . .  40  4.2.4  Combining conversational and lexical features . . . . . .  40  2.5 3  3.3  3.4 4  4.2  v  4.3  Classification approach . . . . . . . . . . . . . . . . . . . . . . .  41  4.4  Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . .  41  4.5  Domain adaptation methods . . . . . . . . . . . . . . . . . . . .  42  4.5.1  Supervised baseline indomain . . . . . . . . . . . . . . .  42  4.5.2  Unsupervised baseline transfer . . . . . . . . . . . . . . .  42  4.5.3  Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . .  42  4.5.4  Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . .  42  4.5.5  Pred . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  43  4.5.6  Easyadapt . . . . . . . . . . . . . . . . . . . . . . . . . .  43  4.5.7  Easyadapt++ . . . . . . . . . . . . . . . . . . . . . . . .  43  4.5.8  Selftrain . . . . . . . . . . . . . . . . . . . . . . . . . . .  43  4.5.9  Original SCL . . . . . . . . . . . . . . . . . . . . . . . .  44  4.5.10 SCL with projected features only . . . . . . . . . . . . .  44  4.5.11 Setting SCL parameters . . . . . . . . . . . . . . . . . .  45  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  46  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  48  5.1  Distance between domains . . . . . . . . . . . . . . . . . . . . .  48  5.1.1  Experiment to differentiate between domains . . . . . . .  49  5.1.2  Discussion . . . . . . . . . . . . . . . . . . . . . . . . .  51  Comparison between domain adaptation methods . . . . . . . . .  51  5.2.1  Features used . . . . . . . . . . . . . . . . . . . . . . . .  52  5.2.2  Supervised and unsupervised scenarios . . . . . . . . . .  53  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  53  5.3.1  Supervised domain adaptation results . . . . . . . . . . .  53  5.3.2  Unsupervised domain adaptation results . . . . . . . . . .  56  5.3.3  ROC curve plots . . . . . . . . . . . . . . . . . . . . . .  56  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  57  5.4.1  Supervised domain adaptation . . . . . . . . . . . . . . .  57  5.4.2  Unsupervised domain adaptation . . . . . . . . . . . . . .  59  5.4.3  Effectiveness of domain adaptation . . . . . . . . . . . .  60  Conclusion from the experiments . . . . . . . . . . . . . . . . . .  62  4.6 5  5.2  5.3  5.4  5.5  vi  6  Further Analysis and Future Work . . . . . . . . . . . . . . . . . . .  66  6.1  Amount of labeled data . . . . . . . . . . . . . . . . . . . . . . .  66  6.1.1  Dependence on the amount of source data . . . . . . . . .  66  6.1.2  Dependence on the amount of target data . . . . . . . . .  67  Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  69  6.2.1  Using classifiers trained on different feature sets . . . . .  69  6.2.2  Weighting data . . . . . . . . . . . . . . . . . . . . . . .  69  6.2.3  Semi-supervised SCL  . . . . . . . . . . . . . . . . . . .  70  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  70  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  71  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  73  6.2  6.3 7  vii  List of Tables Table 2.1  Example of an abstractive and an extractive summary (originally in [23]) . . . . . . . . . . . . . . . . . . . . . . . . . . .  7  Table 2.2  Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . .  11  Table 3.1  Capitalization error by Maximum Entropy Markov Model (MEMM) with and without in-domain training data . . . . . . . . . . . .  24  Table 3.2  Results for adaptation with MEGAM . . . . . . . . . . . . . . .  25  Table 3.3  Part of Speech (POS) tagging results with Structural Correspondence Learning (SCL) . . . . . . . . . . . . . . . . . . . . . .  32  Table 3.4  Sentiment analysis results with SCL from [8] . . . . . . . . . .  33  Table 4.1  Conversational features as proposed in [33] . . . . . . . . . . .  39  Table 4.2  Lexical feature types . . . . . . . . . . . . . . . . . . . . . . .  40  Table 5.1  Distance estimate between domains and within a domain . . .  50  viii  List of Figures Figure 3.1  Co-training pseudocode . . . . . . . . . . . . . . . . . . . .  28  Figure 3.2  Self-training pseudocode . . . . . . . . . . . . . . . . . . . .  29  Figure 4.1  Effect of varying the number of features in SCL . . . . . . . .  46  Figure 4.2  Effect of varying number of pivots in SCL on accuracy and time 47  Figure 5.1  Results for supervised domain adaptation . . . . . . . . . . .  54  Figure 5.2  Supervised scenario auROC with the conversational features .  54  Figure 5.3  Supervised scenario auROC with the lexical features . . . . .  55  Figure 5.4  Supervised scenario auROC with the merged set of features .  55  Figure 5.5  Comparing supervised performance with the different feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  56  Figure 5.6  Results for unsupervised domain adaptation . . . . . . . . . .  57  Figure 5.7  Unsupervised scenario auROC with the conversational features  58  Figure 5.8  Unsupervised scenario auROC with the lexical features . . . .  59  Figure 5.9  Unsupervised scenario auROC with the merged set of features  60  Figure 5.10 Comparing unsupervised performance with the different feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61  Figure 5.11 ROC curves of domain adaptation methods with the conversational features . . . . . . . . . . . . . . . . . . . . . . . . . .  62  Figure 5.12 ROC curves of domain adaptation methods with the lexical features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  63  Figure 5.13 Baselines and best domain adaptation methods with conversational features . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  64  Figure 5.14 Baselines and best domain adaptation methods with lexical features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  65  Figure 6.1  Domain adaptation performance vs. amount of source data . .  67  Figure 6.2  Domain adaptation performance vs. amount of target data . .  68  x  Glossary DA  dialogue act  NLP  Natural Language Processing  MEMM  Maximum Entropy Markov Model  MEGAM  Maximum Entropy Genre Adaptation Model  SCL  Structural Correspondence Learning  ASO  Alternating Structure Optimization  POS  Part of Speech  WSJ  Wall Street Journal  TF - IDF  Term Frequency - Inverse Document Frequency  MMR  Maximal Marginal Relevance  LSA  Latent Semantic Analysis  ROC  Receiver-Operator Characteristic  SVD  Singular Value Decomposition  ROUGE  Recall-Oriented Understudy for Gisting Evaluation  EM  Expectation-Maximization algorithm  xi  Acknowledgments For inspiring me, professors Giuseppe Carenini, Gabriel Murray, and Raymond Ng, as well as the students in the UBC summarization group, Shafiq, Nicholas, Hammad, Shama, and in the laboratory of computation intelligence as a whole. NSERC and BIN for funding me to tackle challenging problems. My fellow graduate students who have become friends, Sancho, Abigail, Doug, April, Ben, Patrick, Kevin, Cody, Egor, and all those who show up on Fridays for decompressing. My family for their unwavering support despite our distance. I dedicate this thesis to my maternal grandmother Maria Ignat, 1930-2011, for always speaking her truth.  xii  Chapter 1  Introduction Conversations, or multi-party interactions through speech or text, are present everywhere in people’s personal and professional lives. The modern shift from conversations being face-to-face to increasingly mediated by technology has resulted in many different types of conversational data, such as emails, meeting recordings, phone calls, instant messages, chats, and online forums. The natural language processing community, which has extensively studied certain problems on data in written monologue such as books and newsprint, is currently tackling some of them on conversational data. Since dialogue can involve several participants and be less coherent, less fluent, and more fragmented than monologue, conversations pose new challenges. This thesis investigates the summarization of human conversations. The ability to automate this task can help with tracking, retrieving, and understanding of both informal interactions and those within a corporation. The summarization of dialogue data is challenging because a real-life conversation can comprise a long sequence of exchanges that may be synchronous or asynchronous, and may span different modalities. The general problem of text summarization can be described according to different aspects, as we outline in Chapter 2. A distinction crucial to our work is whether a system can be trained with a corpus of examples which have sample summaries produced by humans, i.e., labeled data, or only with naturally occurring conversations, i.e., unlabeled data. Because conversational data has privacy concerns, corpora of real-world data are rare. Conversational modalities, which we 1  also refer to as domains, vary in the amount of labeled data available, and for most there is no large standard corpus. In our approach to summarization, we seek to use machine learning to select informative sentences from a conversation that we then concatenate to obtain an extractive summary of the conversation. We incorporate the available labeled data, and supplement it with unlabeled data that is generally more plentiful. We hypothesize that labeled data in one modality can be useful in training a summarizer for another modality, though different modalities vary widely in terms of their structural and lexical attributes. The challenges of training with data from outside the domain of interest are the subject of a recent avenue of study in machine learning called domain adaptation. Domain adaptation aims to use labeled data in a wellstudied source domain and a limited amount of labeled data from a different target domain to train a supervised model that performs well in the target domain. In this thesis, we investigate several domain adaptation algorithms for the purpose of using the AMI corpus of labeled data in the source domain of meetings to improve summarization in the target domain of email threads, where we use the small labeled BC3 corpus, a subset of the large unlabeled W3C corpus. We implement baselines of using data from only one domain, simple domain adaptation techniques, and state-of-the-art domain adaptation algorithms which we tuned for our problem. As only some of the approaches can incorporate unlabeled data, we consider separate scenarios for comparing performance, depending on what type of data is available in the target domain. We also investigate domain adaptation using small set of focused, general conversational features which have proven useful for determining sentence informativeness in the email and meetings modalities [33] in supervised extractive summarization, and compare it with a large set of simple lexical features. From the results of our comparative studies, we glean several important insights. In the supervised scenario, where in-domain labeled data is available for training, the domain adaptation algorithms we tested perform no better than the baseline of using the in-domain data, demonstrating that in this scenario domain adaption is not required. However, in the semi-supervised scenario with no labeled data to train on in the target domain, domain adaptation is helpful in determining sentence informativeness in the target by leveraging available labeled data from 2  the source domain and unlabeled data. In particular, the structural correspondence learning algorithm yields a large and significant improvement in extractive summarization of email threads. This also indicates that domain adaption can be useful for summarizing conversations in domains that have been less well studied than meetings and email. As for the usefulness of features for training an adapted classifier, we observe that conversational features are more appropriate than lexical features for the supervised problem, whereas in semi-supervised adaptation, lexical features are much more helpful. We start this thesis in Chapter 2 by describing the summarization problem and surveying past research in methods and evaluation for general summarization and for conversations. We follow this with a survey of domain adaptation strategies in Chapter 3. In Chapter 4 we describe our setting for extractive summarization of conversational data with domain adaptation, including the data sets, features, and domain adaptation methods used. We then detail our experiments and their results in Chapter 5. A discussion of the results and suggested future work follows in Chapter 6, before our conclusion in Chapter 7.  3  Chapter 2  Summarization Background Summarization in natural language processing is the task of taking an input document and creating an abridged version that preserves the important points of the original. This chapter provides a background on automatic summarization systems. First we first characterize summarization along different dimensions: single versus multiple documents as input, generic summarization versus query-based, supervised versus unsupervised selection of sentences, and extractive versus abstractive summary output. We focus the survey on work related to our approach to summarization, and outline previous work on summarizing conversations. In addition, we discuss metrics used to evaluate extractive summarization.  2.1  Summarization problem  Given today’s information overload, it can be very useful to automatically compress the content of documents. Many of us are exposed daily to a large amount of text of different sources, from documentation and reports which we need to read and understand for our job, email discussions involving several participants with different view points, to online news articles and blog posts, and web page hits from our ubiquitous search engine queries. The objective of managing this data has motivated both academic research into summarization and the use of natural language processing in many practical applications. For a general definition of the summarization problem, [23] suggests:  4  Text summarization is the process of distilling the most important information from a text to produce an abridged version for a particular task and user. Several approaches to automatic summarization have been researched, varying with the type of documents and the purpose of the summaries. We will categorize summarization according to different dimensions. We briefly define differences in the inputs to summarizers, then we focus on the distinction between extractive and abstractive summarization, and finally we outline supervised versus unsupervised sentence selection strategies.  2.1.1  Summary of a single or multiple documents  Recent research in summarization has focused on either creating a summary of a single document, or creating a summary of multiple related documents. Multidocument summarization is more complex since for instance the summarizer also has to reorder the important points in the documents to present them in a logical sequence, and not repeat points common to the documents many times. Summarizing email threads shares some of the complications of the multi-document summarization problem.  2.1.2  Generic or query-based summary  In generic document summarization, the input to the summarizer is a text, and the output is a summary presenting the information in the text in a compressed manner. In contrast, in query-based summarization, the user specifies a particular information need, for example by entering a set of keywords to query a search engine. The output of a query-based summarizer must not only be a good summary of the text, but also tailored to the query [23]. In this work, we focus on generic summarization.  2.2  Extractive and abstractive summarization  An extractive summary of a text contains sentences found in the original text, whereas an abstractive summary is composed of sentences that reformulate infor-  5  mation found in the text. Table 2.1 illustrates these two types of summarization output. We reproduce Abraham Lincoln’s famous the Gettysburg Address, along with an example of an abstract summary and an extractive summary of this monologue. The manually generated summaries in the example were taken from [23] and modified to be of comparable length. Now that we have defined these two approaches, we describe the differences between extractive and abstractive summarization, before focusing on content in the selection extractive approach as it is more relevant to our domain adaptation approach.  2.2.1  Comparison of extractive and abstractive summarization approaches  In the extractive approach, sentences from the original text are selected and concatenated into a summary, which is meant to present the most important information from the text. Jurafsky [23] identifies the following three steps in creating an extractive summary: 1. content selection: selecting sentences from the original text to use in the summary 2. information ordering: arranging the selected sentences and structuring the summary 3. sentence realization: cleaning up the ordered sentences to form a coherent summary Extractive summarization of text has been studied more extensively than abstractive summarization. Its most processing-intensive step, content selection, usually involves extracting features from sentences and using machine learning to rate their informativeness. The next steps in single-document extractive summarization can be quite straightforward: many summarizers perform no reordering or cleaning up of selected sentences and present them as they were in the original text. The information ordering component is more important in multi-document summarizers. Its goals would be to reduce redundancy by ordering sentences in a coherent way. In sentence realization, the content can be made more readable after either parsing the sentences and applying rules for which parts to remove, or supervised learning 6  Table 2.1: Example of an abstractive and an extractive summary (originally in [23]) Original document Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate – we can not consecrate – we can not hallow – this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us – that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion – that we here highly resolve that these dead shall not have died in vain – that this nation, under God, shall have a new birth of freedom – and that government of the people, by the people, for the people, shall not perish from the earth. Extractive summary Four score and seven years ago our fathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation can long endure. We have come to dedicate a portion of that field. From these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that government of the people, by the people, for the people, shall not perish from the earth. Abstractive summary This speech by Abraham Lincoln commemorates soldiers who have died in the Battle of Gettysburg. It addresses the reason why they gave their lives, and the meaning of their sacrifice. He reminds us that America was founded on principles of freedom and equality, and that that the civil war is a fight to maintain those values.  7  on human-written summaries. For clarity, the sentence realization step can also resolve pronouns and ensure that the first and subsequent mentions of an entity are resolved [23]. One disadvantage of extractive summarization is that if the relevant sentences are simply copied and pasted, the resulting summary can seem incoherent to the reader. For instance, in the extractive summary of the Gettysburg address in Table 2.1, what the “field” and the “honored dead” refer to is not explicit from the context. Abstractive summaries can be of higher quality, as can be observed from comparing the abstractive and extractive summaries in Table 2.1. However, abstractive summarization is more complicated to automate and as a drawback may require applying additional domain-specific knowledge. The goal of abstractive summarization is to create summaries that emulate human-written summaries. An abstractive system would extract information from the document, represent it in an internal structure, and then draw inferences and compress knowledge scattered across several input sentences. A natural language generation component is then required to convert the processed internal representation into a textual summary. Sp¨arck Jones [45] describes abstractive summarization as a different three-step process: 1. interpretation of the source text into a representation 2. transformation to summarize the representation 3. generation of the text of the summary In abstractive summarization, the interpretation and transformation steps are parallel to content selection in extractive summarization, though more complex. Abstractive systems are often tailored to a specific domain to better recognize the relevant information in the input. For instance, if the system were intended to create summaries of corporate meetings in which action items are important, the summarizer could first identify action item dialogue acts (DAS) in a meeting, then select sentences from it that support each particular item, and finally summarize them into a single sentence. Abstractive and extractive summarization are not mutually exclusive: sentence realization in extractive summarization can be considered a  8  simple form of abstraction, and abstractive summarization can use the sentence informativeness output of an upstream extractive summarizer content selection component. As extractive summarization is more developed and more amenable to domain adaptation, we focus on the extractive approach in the rest of this work. We nevertheless recognize that abstractive summarization research holds a lot of promise for practical applications.  2.2.2  Sentence selection  A key question in summarization, and in particular of the extractive approach, is how to rank the importance of sentences in the document so that a subset can be selected for the summary. An important goal of this thesis is to find a good algorithm for categorizing the sentences in a conversation as salient or not in order to perform content selection. The following is an overview of previous work in unsupervised and supervised content selection. Unsupervised content selection One approach to sentence selection, which we refer to as unsupervised since it needs no human labeling, is to rank sentences in a given text by word statistics. An early example of the statistical approach is Luhn [29]’s summarizer for literature articles which counts frequencies of terms after the stopwords are removed; recall that stopwords are common English words that don’t add information content e.g., “the”. The unsupervised content selection process can be achieved by computing a topic signature of the document as the words that are most salient, then scoring sentences based on the overlap of their words with the topic signature. An example of a successful summarizer that uses word frequency as a measure of saliency is SumBasic [40]. However, even after removing stopwords, for summarizing a document the words that are most relevant are not just those common in the document, but those that are particularly more frequent in the document than across the entire corpus. Hence, finer measures for scoring words for summarization are Term Frequency - Inverse Document Frequency (TF - IDF) and log-likelihood ratio. Another algorithm for unsupervised content selection is to compute the distance between pairs of sentences based on their terms and select  9  the sentences that are most central, as they are close to many other sentences in the text. More details on these and other approaches to determine sentence saliency can be found in [23]. A problem with using word scores in a sentence to rank sentences is that it may result in selecting several sentences that overlap in content. To counter this redundancy, a popular more context-aware scoring of sentences is Maximal Marginal Relevance (MMR). MMR seeks to maximize a linear combination of the relevance of a passage to a query and the novelty of the passage, determined by having low similarity to other selected passages [10]. Supervised content selection In supervised sentence selection, a machine learning model is trained on a set of documents that are labeled for summarization by humans and the performance of the system can be evaluated by comparing to the human gold-standard summaries on the testing set. Since Kupiec et al. [26]’s summarizer, which used a supervised classifier to determine relevance of sentences, many summarizers have followed a supervised approach. Unlike unsupervised content selection, this requires additional effort in annotation. Usually, the training part of the corpus needs to have sentences annotated with a binary label of informativeness to summarization. In the case where the corpus contains natural human-written abstracts, e.g., a set of conference proceedings, annotation can be used to map sentences from the abstract to sentences in the text. The classifier uses various features extracted from each sentence, for instance sentence length, position, cue phrases, and word informativeness with respect to the topic signature [23]. The classifier then classifies sentences in unseen documents as relevant or not, and the relevant sentences are concatenated into an extractive summary. Note that this approach doesn’t reduce redundancy between the selected sentences, so a post-processing step is needed.  2.3  Evaluation of summarization  For evaluating an automatic summary, one can use extrinsic or intrinsic measures. An extrinsic evaluation would measure how good a summary is for a specific user task, for example whether reading the summary of a document returned as a search result helps the user determine whether it matches their information need. Intrinsic 10  Table 2.2: Confusion matrix  predicted positive predicted negative  actual positive TP FN  actual negative FP TN  metrics are more task-independent, and because our interest is in creating general as opposed to query-driven summaries, we consider task-independent metrics in the evaluation. A common intrinsic way to evaluate summarization performance is to measure how much the output summary of the automatic summarizer matches the content of one or several human-created gold-standard summaries for the text.  2.3.1  General classification metrics  Statistical measures like precision, recall, and specificity are commonly used to measure performance of two-class classifiers systems given a ground truth. They can be computed from the counts listed in the confusion matrix (Table 2.2). For the counts, each datum is assigned to one of four categories, depending on whether it is positive or negative in the ground truth, and whether the classifier predicted it as positive or negative. Precision Precision, sometimes also called accuracy, is the fraction out of the data predicted positive which are actually positive: precision =  TP T P + FP  (2.1)  Recall Recall measures, out of the actual positives, how many were correctly predicted as positive: recall =  TP T P + FN  11  (2.2)  Specificity Specificity measures, out of the actual negatives, how many were correctly predicted as negative: speci f icity =  TN T N + FP  (2.3)  Area under the Receiver-Operator Characteristic (auROC) Receiver-Operator Characteristic (ROC) graphs are plots of recall on the y-axis versus 1-specificity on the x-axis. A classifier that only outputs the positive or negative class on the data corresponds to a single point in ROC space. Probabilistic classifiers, however, output a probability or score. To assign a positive or negative label to a point, its score is compared to a given threshold. Hence, for classifiers such as these, the values of precision, recall, and specificity depend on the score chosen. When this threshold is varied between 0 to 1, several points are obtained in ROC  space for one classifier. To compare classifiers, we can compare the shape of  a curve through the set of ROC points of each classifier, or compute the area under the ROC curve (auROC) to obtain a score for each classifier that is not dependent on the choice of threshold. We compute the auROC as a metric in comparing classifiers following the procedure in [17].  2.3.2  Metrics specific to summarization  Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a commonly used intrinsic evaluation metric for summarization. It was introduced by Lin and Hovy [28], and inspired by BLEU, an evaluation procedure embraced by the machine translation community and based on n-gram overlap between the candidate summary to evaluate and reference gold-standard summaries. BLEU averages the precision of n-grams of varying lengths. The BLEU formula is listed below, and Countclip (ngram) refers to the maximum number of co-occurring n-grams between the reference and candidate summaries:  12  pn =  ∑S∈CandidateSummaries ∑ngram∈S Countclip (ngram) ∑S∈CandidateSummaries ∑ngram∈S Count(ngram)  (2.4)  On the other hand, ROUGE fixes the length n of the n-grams. For example, ROUGE-2 involves counting the bigrams that match between the reference and candidate:  ROUGE − 2 =  ∑S∈ReferenceSummaries ∑bigram∈S Countmatch (bigram) ∑S∈ReferenceSummaries ∑bigram∈S Count(bigram)  (2.5)  Note that ROUGE measures recall of the n-grams in the reference summaries, whereas BLEU measures the precision of the n-grams reported in the candidate summaries. In the machine translation community BLEU is generally accepted, though for extractive summarization recall is more relevant, i.e., how much information is present in the summary. The field of summarization has the added problem that summaries are quite subjective; summaries created by different people can have little overlap in sentences extracted from the text. Hence, the ROUGE score can be heavily dependent on the sentences selected in the reference. Pyramid method Ani Nenkova [3] identifies several problems with using a single human summary and a metric such as ROUGE: 1. different people can choose very different sentences for an extractive summary, hence using a single gold standard summary is limiting 2. different sentences in the text can be semantically equivalent, and this overlap should be considered when scoring As a solution, Ani Nenkova [3] proposed the pyramid method for scoring, which assumes several human-generated summaries for each document. The gold standard is composed from a set of human summaries that are merged as a pyramid of Summary Content Unit (SCU)s. Each summary content unit is a unit of meaning with a weight given by how many of the human summaries it matches. Note that the annotators need not pick specific sentences for an extractive reference summary 13  since summaries are compared in semantic terms. The pyramid method is a manual evaluation method, where humans label the SCUs in both the reference summaries and the candidate summaries. Ani Nenkova [3] investigated how the number of summaries used affects the score on a set of documents from the Document Understanding Conference 2003 and found that five or more reference summaries are needed for the score of the candidate to be independent of the set of reference summaries chosen. Weighted recall Weighted recall is a metric used by Murray [35] for the AMI corpus, where multiple annotators link DAS in a meeting to sentences in a human-written abstractive summary for the meeting, and there can be a many-to-many mapping between DAS and abstract sentences. weighted recall =  N ∑M i=1 ∑ j=1 L(si , a j ) N ∑O i=1 ∑ j=1 L(si , a j )  (2.6)  Here, L(si , a j ) is the number of links for a sentence si in an automatic extractive summary according to annotator ai , M is the total number of DA’s in the automatic summary, N is the number of annotators, and O is the total number of DA’s in the meeting.  2.4 2.4.1  Summarization of conversational data Nature and challenges of conversations  We use conversation to refer to an interaction between several participants who take turns making statements. For our purposes, we process conversational data from a text as opposed to from sound input. Both the structure of the conversation and the data available can vary with the modality. For instance, a spoken conversation can include the duration of each word, an email can contain a subject line and direct quotation of another participant, and a twitter discussion can contain tags or direct replies. Also depending on the intended audience, the summary may be required to have a specific format: a terse statement of the main issue, a report of 14  how a decision was reached, a rating of the positive or negative sentiment or of the different opinions expressed about a issue of interest being queried. There are several differences between conversations and monologue that are relevant to summarization. For one, conversational data is less formal and less coherent, especially in speech discussions which often include pauses, disfluencies, ungrammatical sentences, and if transcribed by Automatic Speech Recognition (ASR), high error rates. The structure and turn-taking of conversations can be very different from text documents, and the informativeness of sentences and utterances can differ between participants. To determine topics, structure, and opinions in conversations, some extensions of traditional text techniques are required, and the same is true of summarization. A further obstacle to traditional supervised summarization is that large, openly available corpora are not as common for human conversations as they are for more traditional document types. For instance, in the news domain, the Wall Street Journal (WSJ) corpus has been widely used in research. In this section, we overview previous work in summarizing conversations and particularly previous research in summarizing meetings and emails as it is relevant to this thesis. We also note that although most conversational summarizers are specific to one domain, our aim is to summarize conversations in multiple modalities.  2.4.2  Meetings summarization  An early summarizer for meetings is Finke et al. [18]’s meeting browser. The browser interface presents the n top sentences from a meeting. These are selected by the system through an adapted MMR algorithm that scores utterances based on lexical features and iteratively adds the most informative utterances to the summary set. Their system was evaluated extrinsically through a small-scale user study. After reading a short summary of a conversation transcript from the Switchboard telephone conversation corpus, users could correctly categorize the conversation into a topic with 92.8% accuracy. In a related task of answering key questions about the information discussed, the study found that user performance depended on the size of the summary, with longer summaries being more helpful. The meeting browser summarizer, inspired by text summarization techniques, gives some evidence for  15  the usefulness of extractive meeting summaries. Zechner [50] implemented DIASUMM, a speech summarizer, and applied it to several different genres, e.g., phone dialogue, news shows, and project group meetings. Their algorithm is also based on MMR with TF - IDF as term weights, and in addition adds speech-specific components to identify disfluencies, sentence boundaries, and question-answer pairs. To evaluate the system’s performance, the average accuracy score for the summary produced was computed by comparing the system output with a gold-standard from several human annotators. When run alongside two baselines, one purely based on MMR and the other that simply extracted the beginning of each of the segments in the conversation, DIASUMM significantly improved over the two on the telephone calls and group meetings genres but not on news. This system is hence more appropriate to the specifics of informal conversations. Murray et al. [36] compared unsupervised approaches to summarization that use MMR and Latent Semantic Analysis (LSA) with supervised approaches using lexical features and prosodic features. In their evaluation, they found that LSA and MMR outperformed the feature-based approaches according to the ROUGE-1 and ROUGE-2 metrics. However, in an extrinsic evaluation of meeting summaries comparing summarizers that use prosodic features with purely textual-based techniques, Murray et al. [37] found that humans prefer summaries produced by the approach that includes prosodic features, motivating conversation-specific summarization research.  2.4.3  Emails summarization  Emails summarization is also of practical importance and most research efforts have been extractive. Some work has focused on summarizing single email messages, like that of Muresan et al. [32] who used supervised learning to label the salience of sentences in a message by extracting noun phrases. Relevant to this thesis is the summarization of email threads, which can be viewed as multi-party conversations. For thread summarization, Rambow et al. [42] used three types of features for sentences: basic features like sentence length that consider the thread as a single document, thread-based features like the position of a message in the  16  thread and of sentences in the message, and email conversation-specific features like number of recipients. They found that using the email-specific and threadspecific features in addition to the basic features improves the precision, recall, and F-measure of classifying sentences into relevant or not relevant. However, the gold standard they used in computing these metrics also had an effect: the improvement varies depending on which annotator’s labeling is used in the scoring, and combining annotations from two labelers actually hurt performance. Lam’s system [27] uses a hybrid of single and multi-document summarization to create a more useful summary of incoming emails by also summarizing the preceding messages in the same conversational thread. As their target domain is corporate email, the system also extracts entities from the text representing names and dates. An extrinsic evaluation of the system used a user study to investigate the usefulness of the summaries in tasks such as an individual’s mailbox cleanup, triage, and calendaring. They concluded that the users did not find the entity extraction useful, and that presenting context from the previous messages improved user satisfaction. However, when including context, the length of the summary increased with the length of the thread, and users sometimes reported the size of the summary to be unmanageable. Email conversations are less amenable to summarization systems developed for documents, since they are less formal, less cohesive, and differ in structure. For summarizing email conversations, Carenini et al. [11] reconstruct the logical structure of a thread through a fragment quotation graph, where nodes are message fragments and directed edges are identified from explicit quotation or implicit reference between fragments. A clue word score is computed for each sentence, where clue words are words repeated between neighboring fragments in the graph. The top sentences are extracted in the summary, following the intuition that in email conversations, references between fragments and local lexical cohesion are informative for summarization. When comparing the clue word score-based summarization algorithm to a centroid-based multi-document summarization system that considers the global rather than local importance of sentences, for a summary length of 15% of the input, the clue word score improves the mean precision, recall, and F-measure (a metric combining precision and recall). Wan and McKeown [47] also created a summarizer for ongoing email discus17  sion. They made a number of assumptions about the input to their system: that the discussion is about making a decision, that the thread is focused on one main task and does not veer off into other issues, and that the main issue is present in the first message of the thread. Their work is hence focused on determining the main issue in a thread through Singular Value Decomposition (SVD) and word vector techniques. The replies to the first message are used to compute a comparison vector, and then the sentence in the initial message closest in cosine similarity to the vector is selected as the main issue sentence. Following the main sentence in the initial email, the first sentence in each of the replies is added to the summary. In an evaluation of whether the issue identified matches a gold standard, they showed that the issue found by the summarizer beats the precision and recall of a baseline of selecting the first sentence of the initial message. However, Wan and McKeown’s system is limited in its applicability to non-decision making email, and limited by its assumptions.  2.4.4  Cross-modality summarization  Murray and Carenini [33] pave the way to cross-domain conversation summarization by considering a common view of both meetings and emails as conversations comprised of turns between multiple participants. For the extractive supervised summarizer ConverSumm, Murray and Carenini represent both types of data with a set of general conversational features for the purpose of supervised summarization. The features for each sentence take into account the specificity of the terms used to the current turn and participant, the length and position of the sentence in the turn, and the context of the conversation before and after the sentence. Without making any assumptions specific to emails or meetings, they achieve performance competitive to modality-specific systems. When ConverSumm was applied to meetings, its auROC and weighted F-measure scores were not significantly different from a system that uses prosodic and meeting specific-features [38]. On emails, ConverSumm reaches a significantly better auROC of 0.75 than the Rambow systems 0.64, though pyramid precision scores are not statistically different according to paired t-test.  18  2.5  Conclusion  Summarizing conversations poses more challenges than general text summarization. Although less well studied than for monologue, summarization approaches specific to one type of conversations have been researched, and most have been extractive. However, for portability across types of discussion and to new modalities of conversations that are arising on the web, domain-independent summarization is an important avenue of research. Our approach to summarization for the rest of the thesis is to create an extractive summary of a single document representing a human conversation. We take the supervised approach: a conversation is divided into sentences, each of which has a set of extracted features and an output label indicating its relevance to a summary, and machine learning techniques are used to train on the data and select sentences for including in the summary. To evaluate the summaries generated, we compute the auROC of the classifier output versus the gold-standard annotation by humans.  19  Chapter 3  Domain Adaptation Background In this chapter, we survey previous work in domain adaptation. We first describe the domain adaptation problem and its supervised versus unsupervised and semisupervised variants. We then present specific domain adaptation approaches and discuss their performance in various application domains. Our specific implementation of domain adaptation methods for our summarization task is described in Chapter 4.  3.1  Domain adaptation problem  Domain adaptation is necessary when the data available for training in a target domain is not sufficient for satisfactory performance, but there is plenty of data from a source domain, and the source and target domains have related but different distributions. The goal of domain adaptation is to integrate the available out-ofdomain data with some target domain-specific information, whether it be labeled or unlabeled target data. Through domain adaptation, we hope to overcome the difference in distribution between the two domains to improve on training directly on the available training data. We define the different types of data following the notation in [22]: N  s,l • labeled source data Ds,l = {(xis,l , ys,l i )}i=1  N  t,l t,l • labeled target data Dt,l = {(xt,l i , yi )}i=1  20  N  t,u • unlabeled target data Dt,u = {xt,u i }i=1  N  s,u • unlabeled source data Ds,u = {xis,u }i=1  Depending on the labeled and unlabeled data available here are three different domain adaptation scenarios: • supervised case: Ds,l and Dt,l available • unsupervised case: Dt,u and Ds,l available, also possibly Ds,u • semi-supervised case: Dt,l , Dt,u , and Ds,l available, also possibly Ds,u Initial domain adaptation methods were proposed for the supervised case, and many have been successful. The unsupervised scenario is common since new domains often have only unlabeled data available, but it is more difficult to obtain good performance without labeled target data. When a small amount of data is labeled in the target domain, semi-supervised domain adaptation can be done and is possibly more effective than in the unsupervised scenario. Domain adaptation methods can roughly be categorized into instance-based algorithms, which determine how important the labeled instances are in making predictions on the test data, and feature-based algorithms, which modify the feature space with derived features to transfer information between domains.  3.2  Supervised methods  Many domain adaptation methods have been proposed for the supervised case, where an amount of labeled data in the source domain is used with a usually smaller amount of labeled data in the target domain. The effectiveness of supervised domain adaptation methods can vary with the similarity between the two domains, the task and features used to represent the data, and the relative amounts of source and target labeled data. A baseline for the supervised approach is not to perform domain adaptation at all and train only on the in-domain data from the target domain. We will call this baseline indomain. When there is enough labeled data from the same distribution as the target data to give good performance on the task at hand, training 21  in-domain is hard to beat by using out-of-domain data. For the supervised scenario, this thesis will explore whether domain adaptation methods leveraging the source labeled data can improve over training in-domain.  3.2.1  Instance weighting  Jiang and Zhai’s instance weighting framework [21] is an instance-based domain adaptation approach which integrates three intuitions: that misleading labeled source instances should be removed, that target data should be weighted more than source data, and that target unlabeled data can be labeled with a classifier trained on source added to the labeled training set. The domain adaptation problem is modeled as a complex objective function with multiple parts, in order to optimize the cost assigned to each different instance (refer to [21] for the formula). The objective contains are several hyperparameters for trading off the contribution of the different data, and these are set heuristically. The instance weighting framework was applied to different tasks [21]: • Part of Speech (POS) tagging with the source domain as WSJ from the Penn Treebank, and biomedical as the target, specifically the Oncology section of PennBioIE • Named Entity Type classification with the source domain as news and target as blogs and telephone conversations (note that this is the same data as used by Daume III and Marcu [15]) • personalized spam filtering using the KDD 2006 challenge email data set with general training emails as the source and specific users’ mailboxes as the target To implement the first intuition of removing misleading labeled source instances, in [21] a classifier is trained on the labeled target, tested on the source, and instances in the source that are misclassified compared to their true label are ranked in increasing order of their prediction confidence in order to discard the top k from the training set. To test this strategy, one experiment varied the number k of mislabeled source instances to be removed but without including any labeled target data, the maximum k to remove all the misclassified instances yielded the 22  largest improvement, except that in the Named Entity Type classification on weblogs, where it actually decreased accuracy (refer to Table 1 in [21]). For the second intuition of setting the relative contribution of data from the two domains, Jiang and Zhai [21] weights the target data more than the source data by a factor of a times, and reports accuracy results for several different values of a. In almost all their results, a = 5 yields accuracies one percentile higher than for a = 1, and the marginal improvements for increasing the contribution of the target over a = 5 are small (refer to Table 2 in [21]). For the third intuition of self-training to assign labels to some of the unlabeled target data and add it to the training set, experiments by Jiang and Zhai [21] found that self-training improves accuracy when no labeled target data is used, but not when it is used. They also tried assigning greater weight to the just-labeled target instances than to the labeled source instances, although this was found ineffective. The algorithm used is also highly parameterized and requires tuning. We did not implement instance weighting, in part because [8] claims that instance weighting works best for feature spaces that have few and dimensions, whereas we deal with high-dimensional feature spaces, and most of the features we use are discrete.  3.2.2  Adding prediction as an input feature  A simple feature-based domain adaptation method called pred is to train a predictor on the source data, run it on the target data, and then use its predictions on each instance as additional features for a target-trained model. An initial successful use of this technique was introduced in Florian et al. [19], who used outputs of classifiers among the input features to a multilingual named entity detection and tracking system. With most training data from English, and little in-domain data from Arabic and Chinese, their implementation performed well for all three languages when compared to others in the Automatic Content Extraction challenge. A similar approach was used by [7] to augment labeled target data for sentiment analysis with prediction based on an unsupervised Structural Correspondence Learning classifier. We apply pred for our task since it is a natural way of injecting information into the target by simply adding a feature to the target data. We also expect it to do at least as well as training in-domain, although when adding one feature to  23  data with less training instances than features, the effect of the extra feature may not influence the model learned significantly.  3.2.3  Using source data for a prior  Another relatively old approach for using the source data is as a prior for the model learned on the target data. Algorithms integrating such priors have been helpful in adaptation for language modeling and parsing ([4], [43]). To illustrate a specific application using a prior, in [12] the source data is used to find optimal parameter values of a maximum entropy model on that data. These values are then set as means for the Gaussian prior of the model that is trained on the target data. Chelba and Acero [12]’s task is automatic capitalization, which they treat as a sequence labeling problem and for which they implement a Maximum Entropy Markov Model (MEMM) capitalizer. The MEMM is originally trained on out-of-domain Wall Street Journal (WSJ) data. The two target data sets are Broadcast News data, one from CNN and one from ABC. Results of comparing the unadapted MEMM with domain adaptation using either labeled data from the target news domain or the other news domain which we gleaned from [12] are listed in Table 3.1. There is an improvement in capitalization when using a prior for domain adaptation versus simply using a WSJ-trained model with no adaptation, for a relative reduction in capitalization error of 20% to 25% when using the model adapted on the proper in-domain training set. Table 3.1: Capitalization error by MEMM with and without in-domain training data Target Training Data \ Test Data none ABC CNN  24  ABC 1.8 1.4 2.4  CNN 2.2 1.7 1.4  Table 3.2: Results for adaptation with MEGAM % accuracy merge Prior MEGAM  3.2.4  MType 84.9 87.9 92.1  MTag 80.9 85.1 88.2  Recap(ABC) 96.4 97.9 98.1  Recap(CNN) 95 95.9 96.8  average 89.3 91.7 93.9  Maximum entropy genre adaptation model  Daume’s MEGAM model is more theoretically founded than Chelba’s Prior model. Similar to the prior method, Daume’s MEGAM model [15] trains a discriminative MEMM .  It generalizes the maximum entropy model into a Maximum Entropy  Genre Adaptation Model (MEGAM), adding indicator variables for whether an instance is generated by a source, target, or general distribution as hyperparameters in the model. When trained on labeled data from the source and target, MEGAM learns through the hyperparameters which instances to consider general and which domain specific. Unlike prior, MEGAM treats the source and target domain symmetrically. The optimal values of the model parameters are learned through conditional expectation maximization, an algorithm with several iterations of expectation and maximization step. Then the trained model is applied on the testing data, using the mixture of the general and target models learned. Because iteration involves maximum entropy optimization to learn the parameters, it is expensive to train. Daume tested MEGAM on three tasks (see [15]): Mention Type classification (MType), Mention Tagging (MTag), and the text recapitalization task (Recap) to which Chelba and Acero had applied their prior model. In Table 3.2, we reproduce accuracy results obtained in [15] to compare domain adaptation with MEGAM, prior, and the baseline merge of simply combining the source and target data sets. MEGAM outperforms prior and the baseline on all three tasks. One drawback of using MEGAM for domain adaptation is that the model is more complicated and time-intensive than competing domain adaptation methods like easyadapt.  25  3.2.5  Feature copying method (easyadapt)  A simpler method easyadapt that achieves a performance similar to prior and MEGAM was successfully applied to a variety of Natural Language Processing (NLP) sequence labeling problems, such as named entity recognition, shallow parsing, and POS tagging. easyadapt augments the feature space by making three versions of each original feature: a source-specific version, a target-specific version, and a general version. For instances from the source data, the original feature values are copied into the source and general versions of the features, whereas the target versions are set to 0; and conversely for the target data. The mapping function Φ(x) from the original features to the easyadapt feature space is the following: Φ(x) =  x, x, 0  if x ∈ source  x, 0, x if x ∈ target Daume found easyadapt to be much easier to train than MEGAM and prior while achieving similar performance [16]. An interesting conclusion of their experiments is that this method works well when the domains can be distinguished by the examined features, and less well for very similar domains, e.g., between subdomains of the Brown Corpus. In applications of domain adaptation, easyadapt works best when the labeled data available in the target is sufficient on its own to learn a good model [8], and hence we expect it to work well in the supervised scenario with labeled data in both the source and the target.  3.3 3.3.1  Unsupervised and semi-supervised methods Baseline transfer  We will call training only on the labeled source data, and not using any information from the target data transfer. Transfer learning works best when the target domain is very similar to the source domain. Past research has shown a linear correlation between the loss from transfer and the degree of difference between the two domains [8].  26  3.3.2  Self-training, co-training, and boosting methods  Boosting Boosting is a method for taking classifiers that perform weakly but better than random guessing, and finding a way to combine them that will result in a strong classifier. The boosting algorithm AdaBoost [20], which iteratively adapts the learner to improve performance on hard instances in the training set, has been shown to be effective, though it is dependent on the input weak learner and the training data. TrAdaBoost [13] is an adaptation of AdaBoost to transfer learning, which given a small amount of data with the same distribution as the target and a large amount of different-distribution data, iteratively sets the contributions of the training data to give higher weights to data with distribution similar to the target. In the current thesis we use separate classifiers by directly combining their predictions on the test set. For future work, iterative optimization of the learner over training instances seems promising for domain adaptation. Co-training Co-training was first introduced in NLP by [9] and applied to classifying web pages into categories based on the text of the page and on the anchor text description of links pointing to the web page. When the data can be represented as such by two different views or sets of features and the two views are conditionally independent of each other given the class, each of the views can be used to train a weak classifier on a small labeled set of data and bootstrap by labeling points that they are most confident about and giving them as labeled input to the other classifier. To effectively leverage the unlabeled data, co-training assumes consistency between the two views, so that for any given instance the two classifiers based on different feature sets would agree on the output label. For their experiments, Blum and Mitchell [9] use a set of web pages from computer science departments, handlabel them into categories, and analyze prediction only one of the target labels corresponding to the homepage of a course. They compare co-training on a small number of labeled instances and a larger number of unlabeled instances with the baseline of supervised training on the small labeled set. The two classifiers are  27  Figure 3.1: Co-training pseudocode  Naive Bayes classifiers, one based on the bag of words representation of the words on the page, and the other based on the words in the anchor text. Once self-trained, the two classifiers are evaluated individually on the test set and also their combination is used by multiplying the probabilistic predictions of each one. For this limited experiment, the authors find that the co-trained classifiers beat their supervised counterparts, and the combined classifier reduces the error rate from 11.1% to 5% from using co-training. Figure 3.1 shows the co-training algorithm as it appears in [1]. Variations are possible, and one of interest to us is that Blum and Mitchell [9] found that making predictions on a pool P, a small subset of the unlabeled data U, gave better accuracy than making predictions on the full set U. Nigam and Ghani [41] compare co-training with the Expectation-Maximization algorithm (EM), listing the assumptions that each of the two algorithms makes on the data, and how robust they both are to these assumptions being violated. Different variations of the algorithms are compared: • Co-training, which uses two classifiers, one per feature set, incrementally adds the unlabeled instance with the highest confidence prediction to the labeled set before retraining the classifiers on the augmented set. • EM, which doesn’t use a feature split and iteratively makes predictions on the entire unlabeled set to use them in training a model for the next iteration. • Self-training, which is incremental like co-training, though without splitting 28  the feature set. • Co-EM, which is a variant of EM that uses two classifiers each trained on a separate set of features. Nigam and Ghani [41] concluded that co-training works best when the feature space can be split into two subsets of features that are independent given the class, and when this is not the case, self-training is more appropriate. Even when the split is not known a priori, manufacturing a split of the features is beneficial, though less than a natural split of features in two sets. The authors also argue in favour of cotraining and self-training that they are less prone than EM to get stuck in local minima because of the incremental approach. Self-training Self-training is similar to co-training and simpler since a single classifier is used. The most cited self-training algorithm is the one due to [49]. Figure 3.2 lists pseudocode for this algorithm as it appears in [1]. This algorithm is iterative: on a set of labeled data L0 , a model c is trained and its predictions on unlabeled data are added to the labeled set L before repeating the process. The stopping criterion can vary, for instance the algorithm can be run for a fixed number of iterations or until convergence. Similarly, the number of hard labels created at each iteration can be either fixed, include a set proportion of labels of each class, or depend on the confidence of the predictions. For the version of self-training we implemented, refer to Section 4.5. Figure 3.2: Self-training pseudocode  29  In [30] and [31] McClosky applies self-training to parsing in a novel way, since previous efforts found self-training to be unsuccessful for parsing. Their goal is to train a parser for the target domain of Brown using only unlabeled target data and labeled (WSJ) and unlabeled (NANC) data from the source domain of news articles. In an algorithm using the two techniques of parse reranking and selftraining, they improved performance of the standard Charniak parser trained on Wall Street Journal and tested on Brown corpus data. Instead of just using the parser output to add to the labeled set for training the next iteration of self-training, the reranker reorders the candidate parses produced by the parser for picking the set of best sentences to add. Their major finding is that a parser self-trained on the WSJ  and the NANC data, without training on Brown, performs almost as well as a  Brown-only trained model. They found that it was helpful to weight the original source labeled data more than the self-trained labeled data in learning. In their analysis to determine when self-training worked best, they compared self-trained parsed sentences to optimal parses, and found that the predictions were better on medium-length sentences than short or long sentences. They also concluded that the reranking is responsible for most of the improvement of their algorithm.  3.3.3  Structural correspondence learning (SCL)  Structural Correspondence Learning (SCL) is a feature-based approach of leveraging unlabeled data for domain adaptation which is both theoretically sound [8] and has been successfully applied to several NLP problems. SCL is a complex algorithm, and as we have found its subtleties to be important in our application, we will discuss it in detail. It is inspired by Ando and Zhang [2]’s Alternating Structure Optimization (ASO), a semi-supervised learning algorithm, and extends it to learning across domains. The problem that Structural Correspondence Learning aims to solve is that many of the features in the source domain that are useful in supervised learning in that domain, may be expressed differently in the target domain and therefore misleading to a classifier for the target. SCL learns a correspondence θ and then applies it to the labeled source data to obtain a feature representation more effective for learning across domains than the original features. One important assumption of SCL is that of so-called pivot features: features that  30  are frequent, expressed similarly in the two domains, and can be correlated with the more domain-dependent rest of the features through unlabeled data. The pivots hence bring together the feature spaces of the source and target. The Structural Correspondence Learning algorithm is most effective in a sparse, high-dimensional feature space with plenty of unlabeled data for detecting correlations. The steps of the Structural Correspondence Learning algorithm are outlined in Algorithm 1. The first step, the choice of m pivot features, is very important since the pivot features must be both predictive of the label and correspond with features that are similar between the two domains. In the next step, for each pivot feature, a model is trained to predict it from the rest of the features using unlabeled data from the two domains. The weight vectors learned in each of the models are concatenated into a correlation matrix W . The top k left singular vectors of the SVD of W yield the projection matrix θ . θ can then applied to the source and target data to project it into a lower-dimensional feature space. The new representation of the training data, made up of the original feature values x and the SCL representation θ x, is then used in supervised learning. We note that these SCL features derived from unlabeled data can be integrated with other domain adaptation methods. Algorithm 1 Inputs: T labeled source data (xt ,yt )t=1 unlabeled data from both domains (x j ) 1. Choose m pivot features. 2. Create m binary prediction problems, pl (x), l = 1...m for l = 1 to m do wˆl = argmin(∑ j L(w · x j , pl (x j )) + λ w 2 ) end for T 3. W = [wˆ 1 |...|wˆ m ], [UDV T ] = SV D(W ), θ = U[1:h,:] x T 4. Return f , a predictor trained on ( i , yt )t=1 θ xi Structural Correspondence Learning has been effective for domain adaptation in several applications. Blitzer et al. [6] used SCL in adapting a POS tagger from WSJ  financial news, with a large amount of data, 40000 labeled and 100000 unla-  beled sentences, to biomedical MEDLINE abstracts, using 200000 unlabeled target 31  Table 3.3: POS tagging results with SCL model supervised baseline ASO SCL  accuracy 87.9 88.4 88.9  data. Results that were obtained in Blitzer et al. [6] and we list in Table 3.3 show improvements of SCL for POS tagging over a supervised baseline using only the source labeled data, and over ASO which makes use of the unlabeled target data but not the unlabeled source. Blitzer et al. [7] also applied SCL to sentiment analysis, to adapt from reviews of one type of product to another. Their data consists of product reviews crawled from Amazon for four domains of products: books, DVDs, electronics, and kitchen appliances. The data set contains 2000 labeled and 3000 to 6000 unlabeled reviews from each domain. The review label was set to positive if its human star rating is greater than 3, to negative if the rating is less than 3, and the set was balanced to contain similar numbers of positive and negative reviews. See the results comparing the baseline and different variants of the algorithm in Table 3.4, taken from [8]. Blitzer et al. [7] compared using feature frequency versus mutual information with the label in selecting the m pivot features. Although for part-of-speech tagging, choosing the m most frequent features was sufficient, for sentiment analysis mutual information was found to reduce error. Refer to the columns scl-f using frequency versus scl which uses mutual information in Table 3.4. Beyond the method of selecting pivots, other parameters to be determined in SCL  are the number of pivot features m, the truncation factor k which determines  the number of projected SCL features, and how to combine the original and projected features. Since the projected features are derived from more domain-specific initial features, we would expect that they would be more important in domain adaptation than the original features. And indeed, when [8] used twice as much unlabeled data to find more meaningful correspondences (refer to the last column  32  Table 3.4: Sentiment analysis results with SCL from [8] domain \ model books dvd electronics kitchen average  base 8.9 8.9 8.3 10.2 9.1  scl-f 7.4 7.8 6 7 7.1  scl 5.8 6.1 5.5 5.6 5.8  scl+target 4.4 5.3 4.8 5.1 4.9  scl-only 1.8 3.8 1.3 3.9 2.7  scl-only in Table 3.4), they got a much smaller error than for any of the variants of SCL .  This shows that the projected features are actually useful without the large set  of raw input features. Although SCL was intended to be applied in the unsupervised scenario where there is only unlabeled training data for the target domain, injecting some information on labeling in the target can be necessary to correct an issue with the correspondences learned, in that the projected features (of θ xi ) sometimes misalign features from the two domains. Blitzer et al. [7] proposed a modified algorithm that also uses a small number of labeled target instances to adjust the weights of the projected features in the final model. After this modification, in all cases the semi-supervised SCL with mutual information and 50 labeled target instances (column scl+target in Table 3.4) reduces the error compared to the supervised baseline and to SCL with mutual information.  3.3.4  Easyadapt with unlabeled data  As previously mentioned, Daume’s easyadapt is a domain adaptation approach that is simple to implement as a pre-processing step and is not dependent on any particular classifier or application. However, it only uses labeled data from the source and target and not the commonly more plentiful unlabeled data. A semisupervised extension of the feature copying method, easyadapt++ proposed in [14], uses unlabeled data to co-regularize the source and target by making the predictions of source and target hypotheses agree on unlabeled data. Similarly to easyadapt, easyadapt++ maps the input data by copying features, though it  33  adds a new mapping for unlabeled data. Here is the mapping function Φ(x) from the original features: x, x, 0 Φ(x) = x, 0, x  if x ∈ labeled source if x ∈ labeled target  0, x, −x if x ∈ unlabeled In practice, [14] creates for each unlabeled instance two instances both of which have the augmented features as in Φ(x), and where one has a positive label and one has a negative label. Kumar et al. [25] provides both a theoretical proof that easyadapt++ generalizes well from source to target domains and results showing improvement on sentiment analysis on product reviews. However, they offer no direct comparison to the SCL results with that data [7]. Similarly to Blitzer [8], they also measure the proxy A-distance as a measure of how far domains are between the different types of product. In [25], the error rates of easyadapt , easyadapt++, transfer by training on a plentiful amount of source, training on the same amount of target (as a gold standard), training on small amount of target, and on the combination of the small amount of target with the source are compared. For adapting from DVDs to books, which are distant domains given an A-distance of 0.7616, easyadapt++ outperforms all but the gold standard. For adapting from kitchen to apparel, domains that are closer (A-distance of 0.0459), easyadapt++ outperforms all other other methods. These results show promise for this method for domain adaptation when the target domain has both labeled and unlabeled data available.  3.4  Summary  The goal of domain adaptation can be summed up as to bridge the gap when the training data and test data come from domains with different distributions. Many domain adaptation approaches have been proposed for the supervised case where there is labeled data in the target. We implement and test pred and easyadapt for our problem. For the scenario with less supervision, where the data in the target is mostly unlabeled, effective domain adaptation requires more complex algorithms. For unsupervised domain adaptation, we implement and test self-training, Structural Correspondence Learning, and easyadapt++.  34  Chapter 4  Extractive Summarization with Domain Adaptation In our approach to summarization, the source domain is meetings transcripts and the target domain is email threads. In this chapter, we present the data sets used, the features we extracted and the classifier used, and the different domain adaptation methods as we implemented them. For the source, we use large set of labeled meetings data, the AMI corpus. We also have available a smaller set of labeled email data, the BC3 corpus, which we use for the target domain. In the supervised scenario, we assume that we have part of the BC3 corpus as labeled training data for summarization along with a larger labeled set of data from AMI. In the unsupervised scenario, we restrict the labeled training data to only meetings domain sentences and try to leverage unlabeled email domain data in the form of the W3C corpus, an unlabeled superset of BC3. In the semi-supervised scenario, we use a combination of some labeled email data with unlabeled data in addition to the outof-domain meetings data. After we describe the domain adaptation algorithms we implemented, we will present a comparison of the experimental results for each of the scenarios and with different feature sets in Chapter 5.  35  4.1 4.1.1  Data AMI corpus  The AMI corpus is an artificial corpus of meetings data, and we are interested in the scenario subset of corpus. In this portion, to generate data similar to corporate meetings, people in groups of four simulated meetings in which each was assigned a role in a company. The dataset contains approximately 115,000 DAS segments. The dataset contains both manual and ASR transcripts, though we only use the manual data as it is higher in transcription accuracy. Here is a description of the annotation of AMI for summarization from previous work at of our group [33]: For the AMI corpus, annotators wrote abstract summaries of each meeting and extracted transcript DA segments that best conveyed or supported the information in the abstracts. A many-to-many mapping between transcript DAs and sentences from the human abstract was obtained for each annotator, with three annotators assigned to each meeting. It is possible for a DA to be extracted by an annotator but not linked to the abstract, but for training our binary classifiers, we simply consider a DA to be a positive example if it is linked to a given human summary, and a negative example otherwise. This is done to maximize the likelihood that a data point labeled as “extractive” is truly an informative example for training purposes. Approximately 13% of the total DAs are ultimately labeled as positive, extractive examples. This meetings data is a valuable set of, even if not naturally-ocurring, realistic multi-party conversations. It is also large enough to possibly support domain adaptation for learners for domains other than corporate meetings transcripts.  4.1.2  W3C corpus  The W3C corpus is data crawled from the WorldWideWeb Consortium’s mailing list (w3c.org). Among several different types of data, it contains a mailing list portion with over 50,000 email threads. This is a very sizeable set of conversational data, though the topics discussed are in the technical domain and hence a 36  summarizer for general conversational data may not be directly applicable.  4.1.3  BC3 corpus  BC3 is a subset of 40 threads, totaling 3,222 sentences, from the W3C corpus. It was annotated for summarization in [46]. W3C emails are quite technical, and even segmenting them into sentences was a challenge and had to be done manually for BC3. The threads to annotate were selected to be less technical than average so that they would be amenable to annotation by non-domain experts, and also they were selected to be have a non-trivial conversational structure; as a result, in the BC3 corpus the average number of participants per thread is 6, and the average size of a thread is 11 emails. BC3 was labeled by humans similarly to the annotation procedure in AMI: each annotator wrote an abstractive summary of each thread, then linked sentences from the summary to sentences form the thread that correspond in content. Hence, each sentence extracted from the thread can be weighed by the number of times it is linked to an abstract sentence. BC3 sentences also were annotated for whether they are meta comments, meaning if they refer to the email conversation itself, and by their speech act: propose, request, commit, agreement/disagreement, and meeting.  4.1.4  Enron corpus  The Enron email corpus is a corpus of email data from a different source of emails than the W3C mailing list. It was released as part of the legal investigation into Enron, and it soon became a popular corpus for NLP research due to being realistic, naturally-occurring data similar to email conversations within a corporation. A subset of Enron of 39 threads also has been labeled for summarization, though we chose BC3 as the target set of emails for domain adaptation.  4.2  Features  The classification of sentences into informative or non-informative is based on the values of the features extracted from each sentence. We consider two sets of features: a small set of features relating to conversational structure, and a larger set of raw lexico-syntactic features. We investigate using them separately and together. 37  4.2.1  Conversational features  We use a set of 24 sophisticated conversational features from both the email and meetings domain. They were designed to be modality-independent and to model attributes of conversational structure common across domains, so we hypothesize that they will be useful in domain adaptation. In Table 4.1 we list the conversational features proposed by Murray and Carenini [33], and we refer the reader to their work for a detailed description. Using these features for in-domain extractive summarization was found to be competitive with domain-specific approaches in [33], and also useful in abstractive summarization in [34]. To derive these features from emails and meetings, as in [33] we treat both sets of data in a similar way: as a succession of turns, each of which includes statements by a particular contributor. A turn in an email thread corresponds to an email by one sender, and each turn can be subdivided into sentences. A turn in a meeting is an uninterrupted set of statements by one participant, and can be subdivided into DAS. The set of features derived for each of the sentences includes length, position in the turn and in the conversation, time span between turns, similarity between sentences and lexical, and specificity of terms to different turns and different participants.  4.2.2  Lexico-syntactic features  We derive an extensive set of raw lexical and syntactic features from the AMI and BC3 data sets, and then we compute their occurrence in the Enron corpus. After removing rare features, i.e., those with less than 5 occurrences, a considerable set of approximately 200,000 features remain. The features derived are basic word and POS features, which were introduced in [39]. We list those same feature types inTable 4.2. For each of these feature types, we extract all those present in the corpora. Each unique element that was retained is assigned a binary feature indicating its presence or absence in a given sentence. This set of features was successfully used in the first step of interpretation of the abstractive summarizer of [39] in mapping sentences to relations in an ontology involving a conversation participant and an entity such as decisions, action items, and subjective sentences, and also to generally important sentences. The authors attribute the high auROC values attained (from around 0.75 to 0.93 for meetings,  38  Table 4.1: Conversational features as proposed in [33] Feature ID SLEN SLEN2 TLOC CLOC TPOS1 TPOS2 PPAU SPAU DOM BEGAUTH MXS MNS SMS MXT MNT SMT COS1 COS2 CENT1 CENT2 PENT SENT THISENT CWS  Description word count, globally normalized word count, locally normalized position in turn position in conversation time from beginning of conversation to turn time from turn to end of conversation time between current and prior turn time between current and next turn participant dominance in words is first participant (0/1) max Speaker specificity score mean Speaker specificity score sum of Speaker specificity scores max Turn specificity score mean Turn specificity score sum of Turn specificity scores cosine of conversation splits, with Speaker specificity cosine of conversation splits, with Turn specificity cosine of sentence and conversation, with Speaker specificity cosine of sentence and conversation, with Turn specificity entropy of conversation up to sentence entropy of conversation after the sentence entropy of current sentence rough ClueWordScore  and a lower 0.75 for emails) in classifying sentences into any of these categories to the large feature set given to the classifiers, so we are encouraged to also try applying this feature set. In [34], this set of features was used in subjectivity detection in email and meeting conversations. The authors compared different types of features across four tasks: subjective utterance detection, subjective question and utterance detection, classification of positive-subjective utterances, and classification of negativesubjective utterances. They found that the set of conversational features supplemented by these lexico-syntactic features gave the best results. In some cases such 39  Table 4.2: Lexical feature types Feature type Character trigrams Word bigrams POS bigrams Word pairs POS pairs  generic c1 c2 c3 w1 ,w2 p1 ,p2 w1 ,w2 p1 ,p2  VINs  p(w)1 ,p(w)2  description triplets of consecutive characters pairs of consecutive words consecutive POS tags if w1 occurs before w2 in the same sentence if POS p1 occurs before p2 in the same sentence word bigram w1 ,w2 and two word-POS pairs p1 ,w2 and w1 ,p2  as performing the tasks on ASR transcripts of meetings and in classifying positive subjective statements in email, the conversational features alone gave similar performance to using the two sets of features merged.  4.2.3  Feature selection  Because the set of lexico-syntactic features was so large, we selected a subset of the features to train the classifiers. We found we could obtain good performance and faster training while using only the top 10,000 features scored by conditional entropy. Intuitively, the most informative features are not only the more frequent, but the ones that are indicative of a positive or negative label in the labeled data. The formula we used for conditional entropy is (similar to [8]): i  i  c(x ,−1) H(Y |xi ) = −(log c(xc(x,+1) i ) + log c(xi ) )  In this conditional entropy expression, c(xi ) is the empirical count of feature xi in the instances; c(xi , +1), c(xi , −1) are the joint empirical counts of feature xi with the label.  4.2.4  Combining conversational and lexical features  The choice of which set of features to use for domain adaptation is important. The conversational features on their own were shown to be effective for summarization. The lexico-syntactic features are a large set of raw features, so combining  40  them with the 24 more sophisticated features should be done carefully because the conversational features set may or may not get their due importance in the learned model amidst noise. Therefore, in our experiments we investigate using the two sets of features separately, merged into one, as well as training separate classifiers on each feature set and combining their predictions.  4.3  Classification approach  We measure the quality of summaries produced by having the model learned on training data, possibly modified for domain adaptation, classify sentences from the test set and comparing the output labels of the classifier with the ground-truth labels assigned by humans on the test sentences. We looked for a classification method that would be effective on data with numeric-valued features, binary labels, a high number of dimensions, and a large data set. We chose to train linear classifiers over SVMs because of they are fast to train and effective on large-dimensional datasets. In particular, we used the implementation of L2-regularized logistic regression found in the Liblinear toolkit. This learner solves the following optimization problem [44]: T  minw 12 wT w +C ∑li=1 log(1 + e−yi w xi ) Depending on the domain adaptation method, we sometimes use the prediction accuracy of the classifier on test data, and other times, either the labels predicted on each data point or the weights learned in the model itself.  4.4  Evaluation metrics  Given the predicted labels on a test set and the existing gold-standard labels of the test set data, as a measure of the performance achieved in classification, we compute accuracy and the area under the Receiver-Operator Characteristic (auROC). The auROC is a common summary statistic used to measure the quality of binary classification, where a perfect classifier would achieve an auROC of 1.0, and a random classifier would produce a value near 0.5.  41  4.5  Domain adaptation methods  Here we present the algorithms that we implemented for each of the methods, and for the different sets of data we use the notation established in Section 3.1. Ds,l is always randomly selected as 10,000 instances from AMI, and Dt,l is split into 5 folds, of which 4 are used for training and one for testing. The classifiers C or Ci are trained using Liblinear with the logistic regression option.  4.5.1  Supervised baseline indomain  indomain train C on the training folds of Dt,l  4.5.2  Unsupervised baseline transfer  transfer train C on Ds,l  4.5.3  Merge  merge train C on the training fold of Dt,l merged with Ds,l  4.5.4  Ensemble  ensemble 1. train C1 on Ds,l 2. train C2 on the training fold of Dt,l 3. run C1 on Dt,l and obtain probabilities of positive label P1 4. run C2 on Dt,l and obtain as probabilities of positive label P2 5. for each {xi , yi } in Dt,l , predict the label yˆi = I((P1 xi + P2 xi ) > 0.5) then test the prediction {xi , yˆi } against the actual {xi , yi } 42  4.5.5  Pred  pred 1. train C1 on Ds,l 2. run C1 on Dt,l and obtain probabilities of positive label P1 3. for {xi , yi } in Dt,l , add the feature P1 xi to the instance, then train C on the augmented data Dt,l  4.5.6  Easyadapt  easyadapt 1. augment Ds,l and Dt,l as in Section 3.2.5 2. train C on Ds,l merged with the training fold of Dt,l and test on the testing fold of Dt,l  4.5.7  Easyadapt++  easyadapt++ 1. augment Ds,l , Dt,l , and Dt,u as in Section 3.3.4 2. train C on Ds,l merged with Dt,u and the training fold of Dt,l , and test on the testing fold of Dt,l  4.5.8  Selftrain  For the selftrain algorithm, we picked as parameters U=50 as the size of the unlabeled pool to predict on at each iteration. We also selected p=3 and n=17 for a ratio of summary to total sentences of 15%, which is near to ratio of AMI. selftrain 1. Start with a labeled training set T (labeled source data) 2. Create a subset of a fixed size of the unlabeled data U 43  3. Repeat until no more unlabeled data: • train a classifier on T • make a prediction on U: take the highest-confidence positive p predictions and highest-confidence negative n predictions from U and add them to T • replenish U from the remaining unlabeled data set  4.5.9  Original SCL  Because of the similarities between Blitzer’s sentiment analysis task and our supervised extractive summarization, we hypothesized that SCL will lead to a significant improvement over the baseline. On our data, after this scalability analysis, we apply SCL with a set of features selected by mutual information, and with a smaller number of pivots for a more efficient implementation. The pseudocode with our parameters is in Algorithm 2. Algorithm 2 scl T , unlabeled data from both domains (x ) Inputs: labeled source data (xt ,yt )t=1 j Choose 100 pivot features. Create 100 binary prediction problems, pl (x), l = 1...100 for l = 1 to 100 do wˆl = argmin(∑ j L(w · x j , pl (x j )) + λ w 2 ) end for W = [wˆ 1 |...|wˆ 100 ] [UDV T ] = SV D(W ) T θ = U[1:5,:] x T Return f , a predictor trained on ( i , yt )t=1 θ xi  4.5.10  SCL with projected features only  We also test sclsmall, which uses the same algorithm as scl to find augmented features, except it then uses only the SCL features to train, not adding them to the original features. The pseudocode is identical to that for scl, except the last step is changed to: 44  T Return f , a predictor trained on (θ xi , yt )t=1  4.5.11  Setting SCL parameters  When we re-implemented Structural Correspondence Learning, we also tuned it on the product review data set which Blitzer et al. [7] used to test domain adaptation with SCL for sentiment analysis. We found that on the test machines we have available, the intermediate steps of the algorithm run with the original set of features of over 300,000 caused it to run out of heap memory. Hence, we aimed to reduce the number of features used to represent the data while maintaining improvements of SCL over the baseline transfer similar to those obtained in [7] and [8]. While we obtained good accuracy results with a subset of the features of the order of tens of thousands and the original value for the number of pivots of 1,000, the runtime was quite large since the algorithm requires optimizing a model for each of the pivots. To find an effective number of features, we performed feature selection by mutual information and visualize the relationship between the number of features selected and the accuracy of SCL on the sentiment analysis task. Across several source and target product review domains, we found that the highest accuracy to be achieved between 1,000 and 5,000 features. We show the results for adaption from books to dvd in Figure 4.1. To tune the number of pivots for a smaller running time of SCL, we studied the impact on accuracy of reducing the number of pivots for several numbers of features. As can be seen in Figure 4.2, we find that the maximum accuracy is reached at 128 pivots. This is similar to [8]’s finding that for SCL, like for the original ASO algorithm, increasing the number of pivots above a similar number did increase change the performance. Reducing the number of pivots from Blitzer’s original 1,000 to 100 reduces the time taken by SCL by a factor of 10 while maintaining high accuracy. After performing this analysis, we decided that for our task of summarization, we will set the number of pivots to 100 and select a subset of the 10,000 lexical features from the original over 200,000 by mutual information.  45  Figure 4.1: Effect of varying the number of features in SCL  4.6  Summary  Our domain adaptation setting is from the source domain of meetings, with a large labeled data set AMI, to emails, where we have as labeled data BC3 and unlabeled data W3C. We use two different sets of features, one a set of high-level features derived from conversational structure, and the second a set of raw lexical and syntactic features. In this chapter we have outlined domain adaptation algorithms we use for summarizing conversations. This will be followed in Chapter 5 with a description of our experiments and the results.  46  Figure 4.2: Effect of varying number of pivots in SCL on accuracy and time  47  Chapter 5  Experiments Our goal in this thesis is to investigate the effectiveness of adapting from the meetings domain to the emails domain under the supervised and the unsupervised scenarios, with different possible feature representations. First, to investigate the difference between the two domains, we estimate the distance between sets of data from different domains, and between two sets of data from within the same domain. Then, we describe the set-up of our comparative studies between domain adaptation methods, with data represented by the separate feature sets. We report the performance of each method, note improvements relative to the baseline and finally draw insights from the different scenarios.  5.1  Distance between domains  Before implementing domain adaptation for a specific application, it is useful to estimate how well it can work. Ben-David et al. [5] derive a simple bound on the expected error on the target from a theoretical measure of distance between domains, dHδ H , and the error of a classifier on the source. However, computing the distance between domains is intractable with a limited sample. An empirical approximation of domain distance, the proxy A-distance, is shown in [8] to correlate with loss due to adaptation. However, A-distance is not an absolute measure of performance across domains. Performance also depends on the representation used by the classifier and the amount of data. In practice, given the choice between  48  two possible source domains for adapting to a particular target domain, the source with the lowest A-distance to the target will yield better results. A-distance was also used in [25], who measured the distance between the different product types in Blitzer’s dataset of Amazon reviews. [25] bounded the expected target error using the source and target empirical errors and computed hypothesis class complexities, and showed that the feature copying method with unlabeled data easyadapt++ has a lower expected error than easyadapt. The A-distance measure is computed from unlabeled data of the two domains by labeling instances according to the domain of origin, then training a linear classifier to distinguish between the two domains, and finally combining the empirical loss for the instances. For our experimental measure of distance between domains, we will estimate domain distance from classifier accuracy. This is similar to proxy A-distance, with the per-instance empirical loss set to 0 for a correct label and to 1 for an incorrect label.  5.1.1  Experiment to differentiate between domains  In this experiment, we produce an estimate of the domain distance between the meetings and email data that we use in our experiments with domain adaptation. To do so, we follow the approach of building a classifier to differentiate between the two domains. We use a subset of AMI as meetings data, a subset of BC3 as email data, and the labeled Enron corpus as a second email data set. Enron contains emails form within a corporation, whereas BC3 contains emails from the W3C technical mailing list. We expect the classifier to have more difficulty identifying the origin in a mixed set of BC3 and Enron data, and more ease in telling data from BC3 and AMI apart. In this experiment, for each pair of domains, we: 1. took equal numbers of instances from each domain 2. removed the label and labeled them with 1/-1 according to the domain of origin 3. split each set into 50% train and 50% test 4. combined the two training sets and the two testing sets  49  5. trained a classifier and reported 1 - its fractional accuracy For a control on the experiment, we repeat the experiment above with two sets of data from the same source of data. For example, to compute the in-domain distance for AMI, we take the AMI data, label half with 1 and half with -1. Since the labels are assigned at random, we expect a classifier trained to distinguish between the two datasets to yield a value close to 0.5, or random guessing. The results obtained between domains and in-domain as a control are in Table 5.1. We report 1 - the fraction representing accuracy. A small estimate value is obtained when the classifier was more successful in telling them apart. An estimate value closer to 0.5 is obtained when the classifier had difficulty telling them apart, hence the source of the data harder to distinguish, implying that with the given feature representation, data from the two sets are more similar. Table 5.1: Distance estimate between domains and within a domain BETWEEN DOMAINS using conversational features between ami and bc3 between ami and enron between bc3 and enron using lexical features between ami and bc3 between bc3 and enron IN-DOMAIN using conversational features between two subsets of ami between two subsets of bc3 between two subsets of enron using lexical features between two subsets of AMI between two subsets of BC3 between two subsets of Enron  50  estimate 0.005276 0.017934 0.381636 0.016 0.002152 estimate 0.335991 0.456656 0.510714 0.438556 0.427245 0.557143  5.1.2  Discussion  The high results of the estimate in the control experiment contrasted with the low results in distinguishing between AMI meetings data and email indicate a difference in the data distribution of email and meetings. Also, conversational features appear to be more stable across variations in the data distribution, as we can see from the experiment differentiating between the two email corpora BC3 and Enron. The classifier that uses conversational features has low accuracy in telling them apart, while the classifier that uses lexical features has higher accuracy. This suggests that conversational features are more representative of the commonality between conversational data from different sources than the set of lexical features. From these results, we expect that direct transfer of a classifier from meetings to emails will not perform well because the source and target data are different with the given feature sets, hence domain adaptation is needed to bridge the gap between domains.  5.2  Comparison between domain adaptation methods  We describe here the set-up of the comparative study between the different domain adaptation methods we implemented, which we run with different sets of features. We select training and test data in a precise manner to get a statistically sound measurements. BC3 labeled email data totals about 3000 sentences, and AMI labeled meetings data totals over 100,000 sentences, so for both efficiency and to not overwhelm the in-domain data, in each of our runs we subsample 10,000 sentences from the AMI data to use for training. For the target domain, we use a subset of BC3 of 2000 sentences, and a subset of W3C of 8000 unlabeled sentences in the comparison experiments for a set ratio source to target, which we also vary in another experiment. We also randomly split the BC3 data into five folds, using each subset of four folds for training and testing on the remainder. We repeat this 5-fold cross-validation three times, for three different splits into folds. In each comparison of different methods, we use the same set of data for labeled source, labeled target, and unlabeled target. For each method, we process and augment the data as appropriate, and then train a logistic regression classifier on the training set of the method. For the supervised methods, the labeled training set includes both source 51  data and target data, whereas for the unsupervised methods, the only labeled data used to train is from the source, along with unlabeled data from the target. All methods are tested on the testing fold of the set of BC3 data. For each run of a method, we compute the resulting accuracy of the classifier, the time taken, and points on the ROC curve. We use the ROC points to compute the auROC. We report the mean accuracy, auROC, and time of each method over the three runs. To test for significant differences between the performances of the various methods, we compute pairwise t-tests between different auROC values obtained in the first run. Because in each experiment we report results of pairwise ttests between several methods and the baseline, to account for an increased chance of type I error or reporting a significant difference to the baseline where there is none, we compare the p-value to an α value of 0.005 rather than the customary 0.05.  5.2.1  Features used  To study the effect of the feature representation on domain adaptation between our source and target domains, we use the two sets of conversational and lexical features in parallel experiments, and then merge the two sets and run an experiment with all the features. The set of conversational features We use the set of 24 general conversational features introduced in section Section 4.2.1. These features are specific to conversations and expected to be useful in transferring information between the two modalities. The set of lexical features We derive an extensive set of lexical features from the AMI and BC3 datasets, and then compute their occurrence in the W3C data. After throwing out features that occur less than five times, approximately 200,000 features remain. We further select a subset of 10,000 features from this large set of features by mutual information, as detailed in Section 4.2.3. The experimental set-up is nearly identical for the different feature sets. For 52  scl, when we use the lexical or all the features, we choose the parameters of SCL to 100 pivots and k=5. The derived SCL features and the original features are used in the scl method, whereas only the derived features are used in the sclsmall method. For the experiment with conversational features, since the original number of features was 24, we only consider one version of SCL with the original and the derived features, i.e., scl not sclsmall.  5.2.2  Supervised and unsupervised scenarios  For the supervised domain adaptation setting, indomain or training on the labeled email data, is the baseline for the methods merge, ensemble, easyadapt, and pred. We also consider indomain to also be a baseline for easyadapt++ as this domain adaptation method trains on labeled target data. To be more accurate, since unlike the supervised methods easyadapt++ also uses unlabeled target-domain data, it is a semi-supervised algorithm. For the unsupervised domain adaptation setting, transfer, training on the labeled AMI data, is the baseline for the methods selftrain, scl, and sclsmall.  5.3  Results  In our results, we report the mean auROC and time of the different domain adaptation algorithms, as well as the p-value from a t-test between the auROCs achieved and those of the baseline. Because we consider auROC to be a better performance measure than accuracy, we don’t report accuracy in this section.  5.3.1  Supervised domain adaptation results  For the supervised methods, Figure 5.1 lists for each method the mean auROC achieved, its standard deviation, the time taken for one run, and its p-value compared to the baselineindomain. P-values lower than the significance threshold of 0.005 are highlighted by shading. We visualize the auROCs achieved by the different methods with each of the feature sets in Figure 5.2, Figure 5.3, and Figure 5.4. We also plot side-by-side the performances of each method with the conversational, lexical, and merged feature sets in Figure 5.5.  53  Figure 5.1: Results for supervised domain adaptation  Figure 5.2: Supervised scenario auROC with the conversational features  54  Figure 5.3: Supervised scenario auROC with the lexical features  Figure 5.4: Supervised scenario auROC with the merged set of features  55  Figure 5.5: Comparing supervised performance with the different feature sets  5.3.2  Unsupervised domain adaptation results  Results for the unsupervised methods and baseline transfer are listed in Figure 5.6. Again, for each method we list the mean auROC achieved, its standard deviation, the time taken for one run, and its p-value compared to the baseline. We also visualize the auROCs achieved with each of the feature sets in Figure 5.7, Figure 5.8, and Figure 5.9. We also plot side-by-side the performances of each method with the conversational, lexical, and merged feature sets in Figure 5.10.  5.3.3  ROC curve plots  Recall from Evaluation of Summarization that points in the ROC graph corresponding to values of recall and 1−specificity, and that when we vary a classifier’s threshold through its range, we obtain ROC curves. We also plot ROC curves for our experiments as the same auROC can correspond to different curve shapes with different interpretations. auROC curves of the domain adaptation methods are plotted in Figure 5.11 and Figure 5.12, each graph for a set of features.  56  Figure 5.6: Results for unsupervised domain adaptation  5.4 5.4.1  Discussion Supervised domain adaptation  With conversational features, compared to the baseline indomain, auROCs similar to the baseline are achieved by easyadapt, pred, and easyadapt++. These train on the labeled target data and are feature-based adaptation methods, influenced by the source data through additional features. merge and ensemble perform more poorly than the baseline. Since both of these use labeled meetings data along with the labeled email data in supervised training, we find that adding the source data from a different domain without modifying the feature space actually hurts performance. Most of the algorithms implemented achieve a worse performance with lexical features than with the conversational features. With an indomain auROC of 0.6471, the model appears to be over-fitting even after selecting the top lexical features by mutual information. Hence, we would conclude that the lexical  57  Figure 5.7: Unsupervised scenario auROC with the conversational features  feature representation is detrimental to supervised performance. The only method that performs well is the semi-supervised easyadapt++ which yields a significant improvement of 13% over the auROC of the baseline. easyadapt++ also outperforms the easyadapt algorithm, which augments the feature space the same way but does not make use of the unlabeled email data. This is a sign that easyadapt++ successfully uses the unlabeled W3C data to identify which features are general and which are domain-specific, hence to overcome the difference in how the lexical features are expressed in the two disparate domains and learn a classifier that is useful for scoring sentences in the target. To compare the different features sets, we observe from Figure 5.5 that the conversational features perform better or similarly to the set of all features, which in turn perform similarly or better than the set of lexical features only.  58  Figure 5.8: Unsupervised scenario auROC with the lexical features  5.4.2  Unsupervised domain adaptation  With the conversational features, both selftrain and scl perform similarly to the baseline transfer. selftrain yields a statistically significant improvement when p < 0.005, though the auROC obtained is within a standard deviation of that of the baseline. With the lexical features set, the improvement of domain adaptation over the baseline is more pronounced. The standard structural correspondence learning algorithm scl yields a gain of 3.7% over the auROC of transfer, and sclsmall which uses only the derived SCL features yields a large gain of 20.4%. The SCL algorithm hence is useful in unsupervised adaptation, and this through the projected features, after throwing out the original set of lexical features. We surmise that this is because the lexical features are raw and too domain-specific to be used directly in a classifier, though once aligned by SCL they produce new and better features.  59  Figure 5.9: Unsupervised scenario auROC with the merged set of features  5.4.3  Effectiveness of domain adaptation  Because the auROC results for the supervised and the unsupervised methods were obtained on the same data and the same split into folds, the performance can be compared over all algorithms. We show the baselines indomain and transfer along with the best performing algorithm in Figure 5.13 for the conversational feature set and in Figure 5.14 for the lexical feature set. With the conversational features, all unsupervised domain adaptation methods and their baseline transfer have lower auROC than indomain. This indicates both that labeled in-domain data is needed to achieve the best performance, and that when this labeled email data is available, out-of-domain meetings data does not additionally help performance. However, with the set of lexical features, both easyadapt++ and sclsmall improve over the baselines. Since both of these use unlabeled target domain data, we can conclude that meetings data is useful in unsupervised or semi-supervised domain adaptation to improve over performance with no adaptation. 60  Figure 5.10: Comparing unsupervised performance with the different feature sets  As for the feature sets, the conversational feature set is smaller and hence more time-effective than the lexical feature set. When merging the conversational feature set with the lexical features, auROC performance degrades, so it seems that some of the quality of the conversational features is lost when adding the more raw lexico-syntactic features. However, because the different sets of features were used in different experiments where both the data used and the training versus testing split were randomized, we cannot make a direct statistical comparison of the different feature sets for a single algorithm. Finally, in ranking the effectiveness of the different algorithms, training indomain is less time-intensive than more complex algorithms and is hard to beat. With less labeled data in the target, domain adaptation may be more justified. In particular, algorithms that use the unlabeled data such as easyadapt++ and sclsmall  61  Figure 5.11: ROC curves of domain adaptation methods with the conversational features  yield the most improvement over the baselines, though the increased performance comes at a higher computational cost.  5.5  Conclusion from the experiments  We experimentally measured distance between domains using our data, and found that email and meetings data are farther apart than data from different email corpora and than data from within the meetings corpus. The difference in data distribution between domains depends on the feature representation, so features that are expressed similarly between the source and target will be more useful for an adapted classifier. The conversational features and the projected SCL features are two examples of such good features. Also for the features used in our experiments, the small set of conversational features performs the best in the supervised  62  Figure 5.12: ROC curves of domain adaptation methods with the lexical features  scenario and the set of raw lexical features performs well for unsupervised and semi-supervised domain adaptation. Overall, training in-domain on the available labeled email data achieves an auROC that is hard to beat, though in the absence of this data, we’ve found that domain adaptation methods improve significantly over using the out-of-domain data directly.  63  Figure 5.13: Baselines and best domain adaptation methods with conversational features  64  Figure 5.14: Baselines and best domain adaptation methods with lexical features  65  Chapter 6  Further Analysis and Future Work In the previous chapter, we have presented our comparative study and the main conclusions we draw from the results. Here we present further analysis and suggest a number of avenues for future research.  6.1  Amount of labeled data  Domain adaptation was found to work best when the amount of labeled source data used is not much larger than the amount of labeled target data. For an experimental example of this, see [48]. An interesting question for supervised domain adaptation which we have explored is whether there’s a difference in performance when amount of target is close to amount of source, versus when the amount of source used is an order of magnitude larger.  6.1.1  Dependence on the amount of source data  We investigate how varying the amount of labeled data from the meetings domain affects performance of different domain adaptation algorithms. With the set of lexical features, we vary the amount of AMI used in training while keeping the amount of target training data fixed at 2,000 sentences of BC3, and perform cross-validation on the target. We compare the trends in auROC performance 66  Figure 6.1: Domain adaptation performance vs. amount of source data  given increasing amounts of source data for indomain, easyadapt, pred, transfer, ensemble, and sclsmall. The results of this experiment are plotted in Figure 6.1. The most important conclusion we learn from this is that once a certain amount of source data is given, increasing that amount even by a factor of 10 does not improve the auROC in domain adaptation. The results also confirm the previously observed point that performance is best when the amount of source is similar to the amount of target, i.e., 2,000 sentences. sclsmall achieves the top auROC even though it uses no target training data. ensemble is also better than both indomain and transfer for varied amounts of source data. This suggests that combining classifiers trained on different data can be helpful for domain adaptation.  6.1.2  Dependence on the amount of target data  We also investigate how varying the amount of labeled target domain data affects performance of domain adaptation methods. With the conversational fea-  67  Figure 6.2: Domain adaptation performance vs. amount of target data  tures, keeping the amount of source data at a fixed 5,000 sentences, we vary the amount of training labeled BC3 data between 0 and 2,000 sentences, always testing on a held-out set of 1,000 lines of BC3. We compare the auROC performance of indomain, easyadapt++, pred, transfer, and scl with the different amounts of target data. We display the results of this experiment in Figure 6.2. Performance of supervised domain adaptation improves very quickly with the amount of target. With 200 labeled target instances, the auROC is near to the top achieved, and after 400 instances, we see no further improvement in performance with additional labeled target. The performance curves for the supervised methods are close together and don’t beat the indomain baseline. This experiment hence shows that with the conversational features, only a relatively small amount of labeled in-domain data is necessary for good performance, and domain adaptation is not required.  68  6.2 6.2.1  Future work Using classifiers trained on different feature sets  Since we have observed that domain adaptation performance varies with the features used to represent the data from different domains, and that directly combining the lexical and conversational feature sets was not effective, future work should explore training several separate classifiers. Previous work has found that combining different learners can improve performance over the best single classifier. For theoretical background on combining classifiers, see [24]. We expect that an ensemble of classifiers, where each exploits a different type of feature in representing the data, can improve performance. We suggest training a classifier on the lexical features, another on the conversational features, and a third on the derived SCL features and then combining the three. This can be extended to training separate classifiers on the source and target data with each of the feature representations. To improve overall performance, the final contribution of each classifier should be parametrized and the parameter values tuned on a development set. A natural extension for the unsupervised scenario would be co-training. Recall that in our implementation of self-training we used a classifier trained on the labeled source data to incrementally label target unlabeled data, and that from our experimental results, self-training didn’t improve over the baseline. We suggest training two classifiers, one using the lexical features and a second using conversational features, and using both of them to incrementally label instances. Since the different feature sets are different views of the data, we expect that co-training can do better than each of the separate self-trained classifiers.  6.2.2  Weighting data  In this thesis, most domain adaptation methods treated the data coming from the source, labeled data coming from the target, and unlabeled data from the target in different ways to account for the differences between these types of data. One way in which target data has been assigned more importance compared to the source was by adding several copies of the labeled target data to the training set. From our analysis, we don’t deem this necessary. Differing contribution of these sources 69  of data or even of individual instances of data have also be modeled explicitly with using instance weights, as in [21]. We suggest including domain-specific parameters for the source and the target domains, and a parameter for the data labeled by self-training and learning their values on a development set of target data.  6.2.3  Semi-supervised SCL  SCL is an unsupervised domain adaptation algorithm as it does not use any labeled target domain data. It can be extended to be semi-supervised. As mentioned in the background on SCL section (Section 3.3.3), Blitzer et al. [7] observed that sometimes the projected SCL features had misaligned original features from the two domains. They hence proposed a version of SCL that uses a few labeled target domain instances to correct the weights of these SCL features. In the same analysis, they found that with a much larger amount of unlabeled data and no labeled target, using only the projected features reduced the error compared to both the original SCL and SCL with target. For our adaptation setting, we got good performance with only the few projected features in sclsmall. However, as their algorithm integrates information from the labeled target to score these features, it could be used in future work as a semi-supervised version of SCL. We have also experimented with adding the labeled target data as training data for the final SCL classifier, and found that the performance with lexical features was not significantly different than that of sclsmall when representing data with the lexical features, and was similar to indomain in the representation with conversational features.  6.3  Summary  The effectiveness of domain adaptation depends on the amount of data available, so we have further investigated the effect of varying the amount of data. Also, we proposed ways of integrating different sets of features and algorithms for future work on algorithms for this domain adaptation setting. As the versions of most domain adaptation methods we implemented were basic, tuning the algorithms on a development set from the target can also be helpful for achieving top classification performance. 70  Chapter 7  Conclusion The automatic summarization of conversations is an important and difficult problem. This research fits into the supervised extractive approach to summarization: sentences in the conversation are selected to include in a summary, and this selection is performed by a classifier which is learned on a human-annotated corpus of training data. Human conversations occur in many modalities, most of which lack publicly available data sets for training supervised models. Unlabeled data occurs naturally, whereas labeling data in each new modality of domain is expensive. Also, conversational data can vary in form depending on the modality, hence a model trained on one modality is rarely effective in a new modality. In our case, we have a large set of meetings recordings labeled for summarization along with a relatively small number of labeled email threads and additional set of unlabeled emails. One limitation of our approach is that our evaluation only compares the set of sentences selected for extraction with those labeled by humans. The quality of a summary is subjective, so to show that the summaries are useful may require downstream processing and abstraction to yield coherent summaries, as well as an extrinsic evaluation by users. Domain adaptation is the general problem of how to use data outside a target domain, i.e., from a different source domain, to improve performance on the target. In the supervised setting, labeled training data is available both in the target domain and in the source, whereas in the unsupervised setting labeled data is only available outside the domain of interest. Many supervised domain adaptation al71  gorithms have been found to be effective. Learning in the absence of in-domain labeled data is more difficult, hence fewer unsupervised domain adaptation algorithms have been successful. We selected a number of supervised and unsupervised domain adaptation algorithms to implement for our problem, and compared their performance to baselines. We designed the comparative study to ensure that our results and conclusions are statistically sound. Also of note, we investigated the structural correspondence learning algorithm, a relatively new and successful algorithm that aligns features of high-dimensional natural language problems between domains. Another research problem is how best to represent conversational data for domain adaptation. We used two different feature sets: one small set of features specific to conversations, and one large set of raw lexical and syntactic features, and found that the features used in the adapted learners had a marked impact on their performance. The conversational features resulted in very good performance for supervised summarization, and also a high baseline performance. With this set of features and some labeled target data, we don’t recommend doing domain adaptation because training on the available in-domain data is sufficient. The less sophisticated set of lexical feature gave worse summarization performance. However, these were useful in leveraging unlabeled data from the target domain in easyadapt and scl. sclsmall was very successful, though it does not use the lexical features but only a projection of these features in a lower-dimensional space after inferring correspondences from unlabeled data. Given only the lexical features, domain adaptation improves over both the transfer and indomain baselines. Therefore, in the unsupervised scenario, sclsmall with lexical features is the best method and beats the baseline. The performance of domain adaptation depends on the distance between domains, so an interesting comparison would be with adaptation between pairs of domains with different domain distances, such as different conversational domains within the same modality, or between other pairs of modalities. Since unlabeled data is often naturally available in-domain, and whatever labeled data is available in-domain is valuable, for future research we recommend investigating semisupervised algorithms like easyadapt++ or combining different algorithms and types of features. 72  Bibliography [1] Steven Abney. Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC, 1st edition, 2007. ISBN 1584885599, 9781584885597. → pages 28, 29 [2] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6: 1817–1853, December 2005. ISSN 1532-4435. URL http://portal.acm.org/citation.cfm?id=1046920.1194905. → pages 30 [3] Kathleen McKeown Ani Nenkova, Rebecca Passonneau. The pyramid method. ACM Transactions on Speech and Language Processing, 4(2):4–es, 2007. URL http://portal.acm.org/citation.cfm?doid=1233912.1233913. → pages 13, 14 [4] M. Bacchiani and B. Roark. Unsupervised language model adaptation. In Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, volume 1, pages I–224 – I–227 vol.1, 2003. doi:10.1109/ICASSP.2003.1198758. → pages 24 [5] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of Representations for Domain Adaptation. NIPS, 20:137–144, 2007. → pages 48 [6] J. Blitzer, R. McDonald, and F. Pereira. Domain Adaptation with Structural Correspondence Learning. Proc. of EMNLP 2006, pages 120–128, 2006. → pages 31, 32 [7] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Proc. of ACL 2007, 2007. → pages 23, 32, 33, 34, 45, 70  73  [8] John Blitzer. Domain Adaptation of Natural Language Processing Systems. PhD thesis, University of Pennsylvania, 2008. → pages viii, 23, 26, 30, 32, 33, 34, 40, 45, 48 [9] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proc. Conference on Computational Learning Theory, pages 92–100, 1998. → pages 27, 28 [10] Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’98, pages 335–336, New York, NY, USA, 1998. ACM. ISBN 1-58113-015-5. doi:http://doi.acm.org/10.1145/290941.291025. URL http://doi.acm.org/10.1145/290941.291025. → pages 10 [11] Giuseppe Carenini, Raymond T Ng, and Xiaodong Zhou. Summarizing email conversations with clue words. WWW 07 Proceedings of the 16th international conference on World Wide Web, pages 91–100, 2007. URL http://portal.acm.org/citation.cfm?doid=1242572.1242586. → pages 17 [12] C. Chelba and A. Acero. Adaptation of Maximum Entropy Capitalizer: Little data can help a lot. Computer Speech & Language, 20(4):382–399, October 2006. ISSN 08852308. → pages 24 [13] W Dai, Q Yang, Gui-Rong Xue, and Y Yu. Boosting for transfer learning. Proceedings of the 24th International Conference on Machine Learning (2007), pages 193–200, 2007. URL http://portal.acm.org/citation.cfm?doid=1273496.1273521. → pages 27 [14] Hal Daume, III, Abhishek Kumar, and Avishek Saha. Frustratingly Easy Semi-Supervised Domain Adaptation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 53–59, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL http://www.aclweb.org/anthology-new/W/W10/W10-2608.bib. → pages 33, 34 [15] H. Daume III and D. Marcu. Domain Adaptation for Statistical Classifiers. Journal of Artificial Intelligence Research, 26:101–126, 2006. → pages 22, 25 [16] Hal Daume III. Frustratingly easy domain adaptation. In Proc. of ACL 2007, 2007. → pages 26 74  [17] Tom Fawcett. Roc graphs: Notes and practical considerations for researchers. Machine Learning, 31(HPL-2003-4):1–38, 2004. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi= rep1&type=pdf. → pages 12  [18] Waibel Bett Finke, A. Waibel, M. Bett, M. Finke, and R. Stiefelhagen. Meeting browser: tracking and summarizing meetings. In Proceedings of the DARPA Broadcast News Workshop, pages 281–286. Morgan Kaufmann, 1998. → pages 15 [19] R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, H. Nicolov, S. Roukos, and T. Zhang. A statistical model for multilingual entity detection and tracking. In Proc. HLT-NAACL 2004, pages 1–8, 2004. → pages 23 [20] Yoav Freund, Robert E Schapire, and N Abe. A short introduction to boosting. JOURNALJAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, 14(5):771–780, 1999. URL http://citeseerx.ist.psu.edu/ viewdoc/download?doi= → pages 27 [21] J. Jiang and C. Zhai. Instance Weighting for Domain Adaptation in NLP. In ACL 2007, 2007. → pages 22, 23, 70 [22] Jing Jiang. Domain Adaptation in Natural Language Processing. PhD thesis, University of Illinois at Urbana-Champaign, 2008. → pages 20 [23] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (Prentice Hall Series in Artificial Intelligence). Prentice Hall, 1 edition, 2000. ISBN 0130950696. URL http://www.amazon. com/gp/redirect.htmlde=xm2nId=13CT5CVB80YFWJEPWS02. neue Auflage kommt im Fr¨uhjahr 2008. → pages viii, 4, 5, 6, 7, 8, 10 [24] Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell., 20:226–239, March 1998. ISSN 0162-8828. doi:10.1109/34.667881. URL http://portal.acm.org/citation.cfm?id=279005.279007. → pages 69 [25] Abhishek Kumar, Avishek Saha, and Hal Daum´e III. A Co-regularization Based Semi-supervised Domain Adaptation. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, 2010. URL http://hal3.name/docs/#daume10coreg. → pages 34, 49 75  [26] Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’95, pages 68–73, New York, NY, USA, 1995. ACM. ISBN 0-89791-714-6. doi:http://doi.acm.org/10.1145/215206.215333. URL http://doi.acm.org/10.1145/215206.215333. → pages 10 [27] Derek Lam, Steven L. Rohall, Chris Schmandt, and Mia K. Stern. Exploiting E-mail Structure to Improve Summarization. Technical Report TR2002-02, 2002. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi= → pages 17 [28] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology NAACL 03, pages 71–78, 2003. URL http://portal.acm.org/citation.cfm?doid=1073445.1073465. → pages 12 [29] H P Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159–165, 1958. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5392672.  → pages 9 [30] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In HLT-NAACL 2006, pages 152–159, 2006. → pages 30 [31] McClosky, D., Charniak, E., and Johnson, M. Reranking and self-training for parser adaptation. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL ACL 06, pages 337–344, July 2006. URL http://portal.acm.org/citation.cfm?doid=1220175.1220218. → pages 30 [32] Smaranda Muresan, Evelyne Tzoukermann, and Judith L Klavans. Combining linguistic and machine learning techniques for email summarization. Proceedings of the 2001 workshop on Computational Natural Language Learning ConLL 01, pages 1–8, 2001. URL http://dx.doi.org/10.3115/1117822.1117837. → pages 16 [33] G. Murray and G. Carenini. Summarizing spoken and written conversations. In Proc. of EMNLP, pages 773–782, 2008. → pages viii, 2, 18, 36, 38, 39 76  [34] G. Murray and G. Carenini. Subjectivity Detection in Spoken and Written Conversations. Journal of Natural Language Engineering, 2010. → pages 38, 39 [35] Gabriel Murray. Using Speech-Specific Characteristics for Automatic Speech Summarization. PhD thesis, University of Edinburgh, 2008. → pages 14 [36] Gabriel Murray, Steve Renals, and Jean Carletta. Extractive summarization of meeting recordings. In in Proceedings of the 9th European Conference on Speech Communication and Technology, pages 593–596, 2005. → pages 16 [37] Gabriel Murray, Steve Renals, Jean Carletta, and Johanna Moore. Evaluating automatic summaries of meeting recordings. In in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Workshop on Machine Translation and Summarization Evaluation (MTSE), Ann Arbor, pages 39–52. Rodopi, 2005. → pages 16 [38] Gabriel Murray, Steve Renals, and Jean Carletta. Extractive summarization of meeting recordings. Analysis, 2006. URL http://www.isca-speech.org/archive/interspeech 2005. → pages 18 [39] Gabriel Murray, Giuseppe Carenini, and Raymond Ng. Interpretation and transformation for abstracting conversations. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 894–902, Morristown, NJ, USA, 2010. Association for Computational Linguistics. ISBN 1-932432-65-5. URL http://portal.acm.org/citation.cfm?id=1857999.1858131. → pages 38 [40] Nenkova, A. and Vanderwende, L. The impact of frequency on summarization. Microsoft Research Redmond Washington Tech Rep MSRTR2005101, 2005. URL http://www.cs.bgu.ac.il/∼elhadad/nlp09/sumbasic.pdf. → pages 9 [41] Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training, pages 86–93. ACM Press, 2000. URL http://portal.acm.org/citation.cfm?doid=354756.354805. → pages 28, 29 [42] O. Rambow, L. Shrestha, and J. Chen. Summarizing email threads. In Proc. of HLT-NAACL 2004, pages 105–108, 2004. → pages 16  77  [43] Brian Roark and Michiel Bacchiani. Supervised and unsupervised pcfg adaptation to novel domains. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 126–133, Morristown, NJ, USA, 2003. Association for Computational Linguistics. doi:http://dx.doi.org/10.3115/1073445.1073472. URL http://dx.doi.org/10.3115/1073445.1073472. → pages 24 [44] Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang Xiang-Rui, and Lin Chih-Jen. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9(1871-1874), 2008. URL http://www.csie.ntu.edu.tw/∼{}cjlin/papers/liblinear.pdf. → pages 41 [45] Karen Sp¨arck Jones. Automatic summarising: The state of the art. Information Processing Management, 43(6):1449 – 1481, 2007. → pages 8 [46] J. Ulrich, G. Murray, and G. Carenini. A publicly available annotated corpus for supervised email summarization. In AAAI08 EMAIL Workshop, Chicago, USA, 2008. AAAI. → pages 37 [47] S Wan and K McKeown. Generating overview summaries of ongoing email thread discussions. In Proceedings of the International Conference on Computational Linguistics COLING, pages 549–555, 2004. → pages 17, 18 [48] Christian Widmer. Domain adaptation in sequence analysis. Master’s thesis, University of Tuebingen, 2008. → pages 66 [49] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189–196, Cambridge, Massachusetts, USA, June 1995. Association for Computational Linguistics. doi:10.3115/981658.981684. URL http://www.aclweb.org/anthology/P95-1026. → pages 29 [50] Klaus Zechner. Automatic summarization of open-domain multiparty dialogues in diverse genres. Comput. Linguist., 28:447–485, December 2002. ISSN 0891-2017. doi:http://dx.doi.org/10.1162/089120102762671945. URL http://dx.doi.org/10.1162/089120102762671945. → pages 16  78  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items