Domain adaptation for summarizing conversations

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Domain adaptation for summarizing conversations Sandu, Oana

Abstract

The goal of summarization in natural language processing is to create abridged and informative versions of documents. A popular approach is supervised extractive summarization: given a training source corpus of documents with sentences labeled with their informativeness, train a model to select sentences from a target document and produce an extract. Conversational text is challenging to summarize because it is less formal, its structure depends on the modality or domain, and few annotated corpora exist. We use a labeled corpus of meeting transcripts as the source, and attempt to summarize a different target domain, threaded emails. We study two domain adaptation scenarios: a supervised scenario in which some labeled target domain data is available for training, and an unsupervised scenario with only unlabeled data in the target and labeled data available in a related but different domain. We implement several recent domain adaptation algorithms and perform a comparative study of their performance. We also compare the effectiveness of using a small set of conversation-specific features with a large set of raw lexical and syntactic features in domain adaptation. We report significant improvements of the algorithms over their baselines. Our results show that in the supervised case, given the amount of email data available and the set of features specific to conversations, training directly in-domain and ignoring the out-of-domain data is best. With only the more domain-specific lexical features, though overall performance is lower, domain adaptation can effectively leverage the lexical features to improve in both the supervised and unsupervised scenarios.

Item Metadata

Title	Domain adaptation for summarizing conversations
Creator	Sandu, Oana
Publisher	University of British Columbia
Date Issued	2011
Description	The goal of summarization in natural language processing is to create abridged and informative versions of documents. A popular approach is supervised extractive summarization: given a training source corpus of documents with sentences labeled with their informativeness, train a model to select sentences from a target document and produce an extract. Conversational text is challenging to summarize because it is less formal, its structure depends on the modality or domain, and few annotated corpora exist. We use a labeled corpus of meeting transcripts as the source, and attempt to summarize a different target domain, threaded emails. We study two domain adaptation scenarios: a supervised scenario in which some labeled target domain data is available for training, and an unsupervised scenario with only unlabeled data in the target and labeled data available in a related but different domain. We implement several recent domain adaptation algorithms and perform a comparative study of their performance. We also compare the effectiveness of using a small set of conversation-specific features with a large set of raw lexical and syntactic features in domain adaptation. We report significant improvements of the algorithms over their baselines. Our results show that in the supervised case, given the amount of email data available and the set of features specific to conversations, training directly in-domain and ignoring the out-of-domain data is best. With only the more domain-specific lexical features, though overall performance is lower, domain adaptation can effectively leverage the lexical features to improve in both the supervised and unsupervised scenarios.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2011-04-21
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-ShareAlike 3.0 Unported
DOI	10.14288/1.0051250
URI	http://hdl.handle.net/2429/33932
Degree (Theses)	Master of Science - MSc
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2011-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-sa/3.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Domain adaptation for summarizing conversations Sandu, Oana

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights