Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Automatic abstractive summarization of meeting conversations Oya, Tatsuro 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_september_oya_tatsuro.pdf [ 888.62kB ]
Metadata
JSON: 24-1.0165907.json
JSON-LD: 24-1.0165907-ld.json
RDF/XML (Pretty): 24-1.0165907-rdf.xml
RDF/JSON: 24-1.0165907-rdf.json
Turtle: 24-1.0165907-turtle.txt
N-Triples: 24-1.0165907-rdf-ntriples.txt
Original Record: 24-1.0165907-source.json
Full Text
24-1.0165907-fulltext.txt
Citation
24-1.0165907.ris

Full Text

Automatic Abstractive Summarization of Meeting Conversations  by  Tatsuro Oya  B.S., The University of Washington, 2011  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2014  © Tatsuro Oya, 2014 ii  Abstract  Nowadays, there are various ways for people to share and exchange information. Phone calls, E-mails, and social networking applications are tools which have made it much easier for us to communicate. Despite the existence of these convenient methods for exchanging ideas, meetings are still one of the most important ways for people to collaborate, share information, discuss their plans, and make decisions for their organizations. However, some drawbacks exist to them as well. Generally, meetings are time consuming and require the participation of all members. Taking meeting minutes for the benefit of those who miss meetings also requires considerable time and effort. To this end, there has been increasing demand for the creation of systems to automatically summarize meetings. So far, most summarization systems have applied extractive approaches whereby summaries are simply created by extracting important phrases or sentences and concatenating them in sequence. However, considering that meeting transcripts consist of spontaneous utterances containing speech disfluencies such as repetitions and filled pauses, traditional extractive summarization approaches do not work effectively in this domain.  To address these issues, we present a novel template-based abstractive meeting summarization system requiring less annotated data than that needed for previous abstractive summarization approaches. In order to generate abstract and robust templates that can guide the summarization process, our system extends a novel multi-sentence fusion algorithm and utilizes lexico-semantic information. It also leverages the relationship between human-authored summaries and their source meeting transcripts to select the best templates for generating abstractive summaries of meetings.  In our experiment, we use the AMI corpus to instantiate our framework and compare it with state-of-the-art extractive and abstractive systems as well as human extractive and abstractive summaries. Our comprehensive evaluations, based on both automatic and manual approaches, have demonstrated that our system outperforms all baseline systems and human extractive summaries in terms of both readability and informativeness. iii  Furthermore, it has achieved a level of quality nearly equal to that of human abstracts based on a crowd-sourced manual evaluation.  iv  Preface  This work is based on past research by Giuseppe Carenini, Yashar Mehdad and Raymond Ng on the automatic summarizing of conversational data. The present writer conducted all the experiments and wrote most of the manuscript for this thesis. Giuseppe Carenini, Yashar Mehdad and Raymond Ng were the supervisory authors of this project and were involved throughout the project in concept formation and manuscript editing. A version of Chapters 4 and 5 has been published as: Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, and Raymond Ng. A Template-based Abstractive Meeting Summarization: Leveraging Summary and Source Text Relationships. In Proceedings of the 8th International Conference on Natural Language Generation (INLG), June 2014. Philadelphia, PA USA. All of the experiments performed for this version were conducted by the present writer. The published paper was written by the present writer in conjunction with the other co-authors.  v  Table of Contents  Abstract .......................................................................................................................................... ii Preface ........................................................................................................................................... iv Table of Contents ...........................................................................................................................v List of Tables ............................................................................................................................... vii List of Figures ............................................................................................................................. viii Glossary ........................................................................................................................................ ix Acknowledgements ........................................................................................................................x Chapter 1: Introduction ................................................................................................................1 1.1 Problems of Automatic Meeting Summarization  .......................................................... 1 1.2 Previous Approaches ...................................................................................................... 1 1.3 Our Approach.................................................................................................................. 2 1.4 Contributions................................................................................................................... 3  1.5 Outline of Thesis ............................................................................................................. 4 Chapter 2: Summarization Background .....................................................................................5 2.1 Extractive Summarization ........................................................................................... 5 2.1.1 Unsuperviesed Extractive Summarization Approaches ...................................... 5 2.1.2 Superviesed Extractive Summarization Approaches .......................................... 7 2.2 Abstractive Summarization ......................................................................................... 8 2.3 Summary Evaluation ................................................................................................. 11 2.3.1 Rouge ................................................................................................................ 12 2.3.2 Pyramid Method................................................................................................ 12 Chapter 3: Meeting Summarization ..........................................................................................14 3.1 Extractive Meeting Summarization .............................................................................. 14 3.2 Abstractive Meeting Summarization ............................................................................ 15 3.3 Meeting Corpus ............................................................................................................. 17 Chapter 4: A Templage-based Automatic Meeting Summarization System .........................20 4.1 Template Generation Module ....................................................................................... 22 4.1.1 Hypernym Labeling .......................................................................................... 22 vi  4.1.2 Clustering .......................................................................................................... 23 4.1.3 Template Fusion................................................................................................ 23 4.2 Summary Generation Module ....................................................................................... 26 4.2.1 Topic Segmentation .......................................................................................... 26 4.2.2 Phrase and Speaker Extraction .......................................................................... 27 4.2.3 Template Selection and Filling ......................................................................... 29 4.2.3.1      Associating Communities with Relevant Templates .................................. 29 4.2.3.2       Finding Templates for Each Topic Segment .............................................. 30 4.2.4 Sentence Ranking.............................................................................................. 31 Chapter 5: Experimental Results and Discussion .....................................................................33 5.1 Data ............................................................................................................................... 33 5.2 Antomatic Evaluation ................................................................................................... 33 5.3 Manual Evaluation ........................................................................................................ 34 5.4 Further Analysis ............................................................................................................ 35 Chapter 6: Conclusion and Future work ...................................................................................40 6.1 Summary of Main Contributions .................................................................................. 40 6.2 Limitations and Future Work ........................................................................................ 41 Bibliography .................................................................................................................................42  vii  List of Tables   Table 4.1: Dominant speakers and high scored phrases extracted from a topic segment ............. 27 Table 5.1: An evaluation of summarization performance using the F1 measure of ROUGE_1 2, and SU4 ......................................................................................................................................... 34 Table 5.2: Average rating scores .................................................................................................. 35 Table 5.3: T-test results of manual evaluation .............................................................................. 35  viii  List of Figures  Figure 3.1: An excerpt of a meeting transcript ............................................................................. 19 Figure 4.1: Our meeting summarization framework..................................................................... 21 Figure 4.2: Some examples of the hypernym labeling task .......................................................... 23 Figure 4.3: A word graph generated from related templates and the highest scored path (shown in bold) .............................................................................................................................................. 24 Figure 4.4:A link from an abstractive summary sentence to a subset of a meeting transcript that conveys or supports the information in the abstractive sentence .................................................. 28 Figure 4.5: Process of associating each community with a group containing templates .............. 29 Figure 4.6: Process of computing the average cosine similarities between a topic segment and all sets of communities in each group ................................................................................................ 30 Figure 5.1: An example demonstrating how our system generates a summary sentence ............. 37 Figure 5.2: Another example demonstrating how our system generates a summary sentence ..... 38 Figure 5.3: A comparison between a human-authored summary and a summary created by our system ........................................................................................................................................... 39  ix  Glossary  NLP Natural Language Processing POS Part of Speech MMR Maximal Marginal Relevance ROUGE Recall-Oriented Understudy for Gisting Evaluation CRF Conditional Random Field FQG Fragment Quotation Graph SVM Support Vector Machine DA Dialog Act WSD Word Sense Disambiguation  x  Acknowledgements  I offer my enduring gratitude to Drs. Giuseppe Carenini and Raymond Ng for their supportive supervision of my entire research career at the University of British Columbia (UBC). I owe particular thanks to Dr. Yashar Mehdad for his help and guidance. Special thanks are owed to my family for supporting me throughout my academic journey. 1  Chapter 1: Introduction  Many people spend a vast amount of time in meetings, which play a prominent role in their lives. Specifically, meetings are an important tool for making decisions, planning new businesses, collaborating with other participants, and other tasks that help organizations function efficiently. However, these meetings usually tend to be very time consuming and often require the attendance of numerous participants, which is a burden for those with busy schedules. Minutes are often kept of the most important meetings, but writing high quality minutes is a time consuming and challenging task. To solve these problems, the study of automatic meeting summarization has come to attract much attention, as it can save a great deal of time and increase productivity.  1.1 Problems of Automatic Meeting Summarization Most summarization systems that have been extensively studied so far are designed in a way that they work well on organized texts such as news articles, where all documents have few grammatical errors and less redundancy. Compared with these studies, the study on automatic meeting summarization is a relatively new area. In general, conversational data is very different from traditional text processing and is more complicated. For example, while written news documents are revised multiple times by the writer and are consequently well organized, meeting conversations are completely spontaneous and contain numerous disfluencies, such as repetitions and filled pauses. In addition, because meetings require that all participants be actively involved, participants might sometimes break into conversations before a current speaker ends his/her turn. All of these characteristics make meeting summarization tasks difficult, and the traditional text summarization approaches studied so far do not work well in this domain.  1.2 Previous Approaches Until recently, the extractive approach has been the most commonly used for meeting summarizations. Since extractive summaries can be built by only using existing phrases in source meeting transcripts, the approaches are simpler compared to abstractive ones and, thus, some notable approaches have been proposed in the past. In most cases, extractive meeting summary 2  systems are generated by modifying general extractive summarization techniques in such a way that they can deal with meeting transcripts.   For example, Garg et al. [16] proposed ClusterRank, which is an extension of one of the most common graph-based text summarization systems, TextRank [38]. The authors applied a different approach to compute the similarities between the two utterances and demonstrate its effectiveness in meeting transcripts. Murray et al. [43] proposed a speech feature-based extractive meeting summarization system. They introduced prosodic features in addition to traditional lexical and syntactical features and successfully created a model that can identify important utterances in meetings. However, considering the spontaneous speech issues described above, it is not enough to use these extractive approaches in meeting domains. In fact, a study of users conducted by Murray et al. [43] indicates that users prefer abstractive summaries to extractive ones. Thereafter, more attention has been paid to abstractive meeting summarization systems [28, 37, 43, 52]. Mehdad et al. [37] creates abstractive summaries by synthesizing the multiple human utterances that are extracted from source transcripts, and their experiment results demonstrated the informativeness of their summaries. On the other hand, Murray et al. [43] proposed a system that does not rely heavily on human utterances. It first maps specific information onto conversation ontologies and then generates summaries by using an existing natural language generation tool.  Even though these approaches have greatly outperformed the state-of-the-art extractive ones, they still contain several drawbacks. First, in terms of the study conducted by Mehdad et al. [37], the performance degrades when meeting transcripts contain structural and grammatical errors, which is common in spontaneous speech. While Murray et al. [43]’s system generates summaries that are similar to human-authored ones, they rely heavily on annotated data.   1.3 Our Approach In this thesis, we address the issues discussed above by introducing a novel summarization approach that can create readable summaries with less need for annotated data. Our system first acquires templates from human-authored summaries using a clustering and multi-sentence fusion algorithm. It then takes a meeting transcript to be summarized, segments the transcript according 3  to topic, and extracts important phrases from it. Finally, our system selects templates by referring to the relationship between summaries and their sources, and then fills the templates with the extracted phrases to create summaries. We instantiate our framework on the AMI corpus [5] and compare our summaries with those created from a state-of-the-art system as well as those created by human annotators. The evaluation results demonstrate that our system successfully creates informative and readable summaries.  1.4 Contributions There are three main contributions in this thesis:  Abstractive meeting summarization is still a new area due to its specific challenges, such as its having little available conversational data and its complexity. This thesis tackles these problems by introducing a novel summarization approach that acquires templates from human-authored summaries and is able to create robust summaries with less annotated data.      In terms of the algorithm inside our framework, by adapting and extending a novel multi-sentence fusion algorithm, our system can successfully generate abstract templates that can be used in generating readable and informative abstract summaries. In addition, the implementation of a novel template selection algorithm that effectively leverages the relationship between human-authored summary sentences and their source meeting transcripts ensures that our system will generate summaries relevant to human-authored ones.   We have conducted thorough evaluation using both automatic and manual approaches. An automatic evaluation using ROUGE suggests that our system-generated summaries highly correlate with human authored ones and outperform the state-of-the-art abstractive meeting summarization approach. Furthermore, a manual evaluation using the crowdsourcing tool indicates that our system generates summaries that are more readable than and nearly as informative as human-created extractive summaries.  4  1.5 Outline of Thesis The remainder of the thesis is organized as follows. In Chapter 2, we introduce both abstractive and extractive traditional text summarization approaches and explain how these are evaluated. In Chapter 3, we outline previous work on meeting summarization systems, explain their drawbacks and introduce a publically available meeting corpus. Chapter 4 presents our framework, which consists of two modules, namely, template generation and sentence generation. In Chapter 5, we describe our evaluation strategy and report its results. Finally, we conclude the thesis in Chapter 6.  5  Chapter 2: Summarization Background  According to Mani and Maybury [33], text summarization is the process of distilling the most important information from a source document in order to produce an abridged version of it for a particular user task. Generally, automatic text summarization can be categorized into two different approaches: extractive and abstractive. In the extractive approach, systems create summaries by simply selecting salient text units such as phrases or sentences from the source document(s) and concatenating them. Whether or not a unit is selected is generally determined based on the linguistic and statistical features it possesses. On the other hand, an abstractive approach requires a deeper understanding of the source texts. Usually, systems first interpret the concept of the source text using natural language processing (NLP) techniques, such as information extraction (IE), and then generate completely novel shorter texts based on the information extracted. In this chapter, we survey each of the two approaches in detail by introducing general extractive and abstractive summarization techniques followed by an explanation of how system summaries are evaluated.   2.1 Extractive Summarization Here, we outline the extractive summarization approaches. We first introduce unsupervised approaches, and then focus on supervised approaches later in this section.  2.1.1 Unsupervised Extractive Summarization Approaches As they do not require training data, unsupervised text summarization approaches have been studied over the years. Generally, there are three different approaches to unsupervised extractive summarization, namely, 1) rank-based, 2) cluster-based, and 3) graph-based.  1) Ranking-based Approach In this approach, each of the text units is ranked based on sentence-level or word-level features, and then the system determines whether or not to include a unit according to its ranking score. One of the most common ranking-based approaches is the maximal marginal relevance (MMR) 6  [6]. MMR is a query based summarization approach which attempts to produce summaries satisfying requests expressed in the form of queries. This approach is based on an iterative algorithm. At each iterative step, one candidate text unit, c (e.g., a sentence), is selected and assigned the following score:        (   )  (   )           (   ) (2.1) The score consists of two functions, Sim_1 and Sim_2. The function Sim_1 computes the similarity of a candidate unit c to a query q and the function Sim_2 computes the similarity of the unit c to each extracted text unit s already in Summary S.  The system can avoid text redundancy in a summary by subtracting the maximum score of Sim_2 from Sim_1. The degree of redundancy can be tuned by Parameter λ. For each iterative step, the score is recalculated and, if the max score exceeds the threshold, the corresponding sentence will be included in the summary, which continues until the summary reaches a certain length. This experiment was conducted manually and demonstrated that MMR can greatly reduce redundancy in summaries.  2) Cluster-based Approach The second common approach to unsupervised extractive summarization is cluster-based. The object of this approach is very similar to the rank-based approach: that of selecting salient phrases and avoiding redundancy. In order to achieve this goal, the system clusters the documents according to topics, assigns relevance scores to each text unit, and selects high scored ones from each cluster to generate the final summary. One of the text summarization systems using this approach was introduced by Aliguliyev [1], by which he generates a summary by applying clustering algorithms to a single document. The algorithm clusters sentences in a document so that the intra-cluster cosine similarity is maximized while keeping the inter-cluster cosine similarity as minimum as possible. In order to make a summary covering all topics, a representative sentence is chosen for each cluster by measuring its proximity to other sentences in the cluster. Then, after collecting those candidate sentences, the algorithm computes the sentence relevance score and ranks them accordingly.  Finally, those high ranked sentences are then added to the summary. Although the evaluation is not clearly reported, the author claims that the clustering approach works effectively on text 7  summarization as it satisfies as much homogeneity as possible within each cluster while maintaining distinction between the clusters.  3) Graph-based Approach The third common approach is graph-based. Usually, the graph-based approach is used when deciding on the importance of the nodes given in a graph. By treating text units as nodes, the approach can be applied to extractive summarization systems. One of the most notable systems is TextRank, introduced by Rada and Tarau [38]. In this approach, sentences in a document are represented by nodes in the graph and any two nodes are connected by edges whose weights are computed based on the lexical similarity of the two sentences. The similarity is measured based on the number of overlapping words between the two sentences. The graph is then processed by to the PageRank [4] algorithm in order to rank each node. Finally, top ranked sentences are selected for the summary. The framework is publically available and has been applied in several systems [22, 52].  However, although the algorithm works well for organized documents such as news articles, it is not suited for highly redundant documents such as conversational data since it does not handle redundancy effectively.  Another popular graph-based approach is LexRank [9], which is essentially identical to TextRank. The two methods were developed by different groups at the same time. The main difference between them is that, in TextRank, the edge weight is computed based on the number of overlapping words of two sentences, while in LexRank, the edge weight between two sentences is computed based on semantic similarity.  2.1.2 Supervised Extractive Summarization Approaches  When a corpus contains a set of training documents and their extractive summaries, we can see the extractive summarization process as a simple binary classification problem. In general, classification models are learned from linguistic and syntactic features extracted from each text unit in a set of training data.  Using the learned model, text units in documents are then classified as important or unimportant and those units that are classified as important are included in the final summary.  8  Hirao et al. [21] proposed a supervised method for sentence extraction based on the Support Vector Machine (SVM). In their work, both syntactic and linguistic features are extracted from sentences in the training data to learn the SVM model. Then, they tested these sentences on the Text Summarization Challenge (TSC) [12], whose corpus contains 180 Japanese documents from the Mainichi Newspapers. In their experiment, their approach is compared with several unsupervised approaches, and it is demonstrated that their method offers the highest accuracy.  Extractive summarization using a sequential labeling technique has also been studied. Shen et al. [48] proposed a linear-chain Conditional Random Field (CRF) based approach for extractive document summarization. In their work, they treated the summarization task as a sequence labeling problem to take advantage of the interaction relationships between sentences. Their approach showed significant improvement in its results as compared with those employing other non-sequential classifiers.  2.2 Abstractive Summarization There are several weak points in extractive summaries. For example, extractive summaries consist of many text units, and those units cannot be divided further. Therefore, summaries tend to include unimportant information since not all parts of the extracted units are important. Also, there are cases in which users who read extractive summaries cannot completely understand the contents of the document since extractive summarization systems cannot deal with some linguistic problems such as coreference resolution. In such cases, it is necessary for them to explore the original source documents and read the context of the extracted sentences to fully understand the document’s contents. Thus, abstractive summarization is the key to resolving these problems, since it can generate a completely new summary that conveys important information using words or phrases that do not appear in source documents. The main difficulty of abstractive summarization is that the system tends to become more complex, as it requires a deeper understanding of texts than extractive approaches.  In this section, we introduce several common approaches for abstractive summarizations. We first discuss the main concept of each approach, and after this we outline the relevant research work.  1)  Sentence Compression-based Approach 9  The easiest way to create an abstract summary is by compressing sentences in a source document. The goal of sentence compression is to prune words from a sentence without losing its content, and without degrading its grammaticality. Thus, applying this technique to the entire document can create an abstractive summary of it.  One of the most notable sentence compression-based abstractive summarization systems was introduced by Knintt and Marcu [25]. In particular, they proposed a noisy-channel model for sentence compression. This channel model was created from stochastic context-free grammar. Each rule is extracted by parsing a parallel corpus consisting of sentences and their compressed counterparts, and the probability of the rule is then estimated using its maximum likelihood. To compress sentences, two types of probabilistic models are considered: a language model P(y) whose purpose is to determine whether the compressed sentence y is grammatical, and a channel model P(x|y) which captures the probability that the source sentence x can be constructed from the compressed sentence y. In the decoding process, the algorithm searches for the compressed sentence y that maximizes P(y)P(x|y). In their evaluation, the participants were asked to rate both system-generated and human-made compressed sentences. The results demonstrate that the compression approach functions as well as the manual approach. One weakness of this approach is that it only creates summaries from words in source documents. These are not ideal abstractive summaries since they are created by paraphrasing the sentences in source documents using words not explicitly present in the sources.  2) Predicate-argument Structure-based Approach The next approach introduced here relies on predicate-argument structures. The system generally works as follows. It extracts predicate-argument structures from a source document and leverages them to represent the contents of a document. It then uses a language generator to create its summaries. The most notable summarization framework using this approach is introduced by Barzilay et al. [2].  Given a set of similar sentences with one specific theme from multiple documents, the system first maps all of these documents onto predicate-argument tree structures. Then, each of these tree structures is traversed and compared with the others in order to merge it into new output phrases covering the most common topics across the theme. After 10  sorting these phrases, they are introduced into a language generation system, FUF/SURGE, to aggregate the selected phrases into a summary. Their experiment indicates that the summarization approach which merges similar information into one sentence significantly improves the quality of the resulting summaries.  3) Template-based Structure-based Approach Some abstractive summarization systems use templates for creating summaries. This technique first manually or automatically derives the templates from texts of a specific domain and then uses them to create a summary of a given document or data. In particular, it extracts salient information from the document or data to be summarized and creates summaries by filling the information into the templates. One example of this approach is introduced by Kondadadi et al. [26]. This approach is based on deriving a template bank from a corpus of summary texts in a target domain. To create a template bank, the system first identifies the domain specific entity on a per sentence basis. Each sentence is labeled with such tags as “Data” and “Event”, and then organized into semantically similar groups by k-means clustering. Next, the system extracts features from templates and develops a ranking SVM model to be able to select the best templates for the input data. Finally, given the input data to be summarized, the system selects templates based on the ranking SVM model, fills the data into the templates, and creates the summary. The researchers implemented this approach for weather and biography domains and demonstrate its effectiveness by evaluating it manually using a crowdsourcing tool.  4) Rule-based Approach Another common way to generate abstractive summaries is to manually create rules for making summaries. One example of this type of summarization system is introduced by Genest and Lapalme [18]. In this paper, the system generates abstractive summaries from news documents whose events are strictly defined and categorized. The methodology is based on an abstraction scheme. The abstraction scheme first extracts aspects (e.g., WHAT, WHEN, and WHERE) of a given news category using a manually defined rule-based Information Extraction (IE) module. For example, in the Attack category, the IE module extracts the information about types of 11  attacks, and when and where they occurred. It then chooses the most relevant aspects, which can be the contents of the summary. Finally, those selected aspects are sent to a language generation module, the SimpleNLG realizer [14], to generate the summary.  In their experiment, they create a summary in the Attack category and demonstrate that their approach can create summaries with high information density based on the Pyramid method (See Section 2.3.2). The only drawback is that it requires tremendous manual effort to create all the rules for extracting aspects.  5) Semantic Model-based Approach  Studies have been performed on creating summaries based on semantic models. In this approach, a semantic model consisting of concepts and relationships between concepts is first constructed to represent the documents in question. The summary is then created using this model. As an example, Greenbacker [19] proposes a semantic model-based abstractive summarization system for a multimodal document containing both texts and images. Their framework is mainly divided into three steps: building the semantic model, rating the informational content, and generating a summary. In the first step, the system builds a semantic model using knowledge representation based on structured objects organized under a foundational ontology [32]. Once the semantic model has been constructed, the system rates each concept based on information density metrics. In particular, the degree of density for each concept is determined by the completeness of the attributes of the concept, the number of connections/relationships it has with other concepts, and the number of expressions that exist in realizing the concept in the document. Finally, the important concepts are expressed as sentences. Even though this project, particularly the final step, is still unfinished, the strength of this approach is that using the semantic model, the system can incorporate the concepts obtained from non-text components (e.g., images) and generate summaries of multimodal documents.  2.3 Summarization Evaluation A good summary should be easily readable and contain only the most important information. To automatically evaluate the quality of summaries, several approaches have been introduced so far. In this section, we introduce the two most common summarization evaluation techniques. Both 12  of these approaches, which are widely available these days, evaluate summaries by comparing them with gold-standard human summaries. One thing to note here is that summary evaluation is still an open issue and a perfect method has not yet been discovered for evaluating system summaries other than manually evaluating them.  2.3.1 Rouge The most common method for summarization evaluation is ROUGE [27], which has been widely used for the evaluation of various summarization systems. The algorithm measures the quality of system-generated summaries by computing the recall between the summary and reference summaries (which can be more than one) based on overlapping words or word sequences. For example, ROUGE_N is an n-gram recall between a system summary and reference summaries, which is computed as follows:          ∑ ∑           (     )                           ∑ ∑      (     )                            (2.2) where gramn is an n-gram and Countmatch(gramn) is the maximum number of n-grams that appear in both system and reference summaries. An important note is that the algorithm effectively gives more weight to system summaries containing n-grams that co-occur in more than one reference summary. Thus, a summary having words shared by several references gets the higher score in the ROUGE_N measure.  There are other types of evaluation metrics in ROUGE. For example, ROUGE_SU4 computes overlap based on skip bigrams. In skip bigrams, all sets of two words having less than four words between them are treated as bigrams. This gives more flexibility than ROUGE_N metrics. ROUGE was evaluated on newswire data and has been shown to correlate well with manual evaluations. However, in terms of speech domains, recent research showed that the correlation of ROUGE scores and human evaluations is generally low [28, 42].  2.3.2 Pyramid Method One drawback of using ROUGE for summary evaluation is that, in general, the content to be included in gold-standard summaries depends on its annotator. Thus, ROUGE scores vary according to the nature of the human summaries utilized. To address this problem, Nenkova and 13  Passonneau [45] introduced a new evaluation approach, the pyramid method, which combines multiple human summaries, and which builds its evaluation reference by effectively combining information from multiple human summaries.  The pyramid method works as follows. First, annotators identify similar sentences. Then, they closely examine at the sentences and extract subparts that represent the content of those collected sentences. These are called Summarization Content Units (SCU). Each SCU is weighted according to the number of summaries containing the SCU. Finally, a pyramid is built based on the weight of the SCUs. The score for a system summary S is computed as a ratio of the sum of the weights of its SCUs in S to the sum of the weights of an ideal summary with the same number of SCUs:    ( )                                                                                                          (2.3) The advantage of this pyramid method is that the process makes it easy to identify missing information in the system summary, which helps tune the summarization system to improve its results. According to the investigation of Nenkova et al., in order for the scores to be independent of the set of reference summaries chosen, it is necessary to employ at least five reference summaries. However, building a pyramid from reference summaries requires considerable effort. 14  Chapter 3: Meeting Summarization  In this section, we first survey recent work on meeting summarizations in both extractive and abstractive terms and then introduce a meeting corpus.   3.1 Extractive Meeting Summarizations Here, we describe most notable approaches to extractive meeting summarization. In most cases, extractive meeting summarization systems are generated by extending traditional summarization techniques.   1) Rank-based Approach In Chapter 2.1, we introduced the MMR [6] for text summarization. This approach has been applied to meeting summarization [53]. In [53], Xie and Liu have implemented the MMR framework using different similarity measures on the ICSI meeting corpus [23]. Specifically, in order to effectively compute similarity at the semantic level, they have introduced a new similarity measurement which is based on counting the number of times that a word appears near each word in the corpus. Using ROUGE [27], they compared their approach with an orthodox MMR system and demonstrated that their approach significantly outperformed the baseline.  2) Graph-based Approach Garg et al. [16] proposed a graph-based meeting summarization system, ClusterRank, which is an extension of TextRank [38]. They modified TextRank so that the algorithm can handle the high noise and redundancy that appears in meeting transcripts.  In ClusterRank, the algorithm first creates sentence clustering by merging similar sentences. Specifically, at the beginning, each sentence is treated as a separate cluster and the clusters are then merged if their cosine similarity score is above a certain threshold defined by their development set. Then, a graph is constructed by treating each cluster as a node, and this is introduced to the PageRank algorithm [4]. The important difference between ClusterRank and TextRank is that ClusterRank takes the cosine similarity measure for computing edge weight, while the TextRank algorithm simply counts the number of common words between the two 15  nodes. Garg et al. conducted the experiment using the AMI Corpus [5], and their results show better performance with is than with TextRank.    3) Supervised Approaches Just as with traditional text summarization systems, some meeting summarization systems use labeled training data to learn statistical models. Usually, supervised approaches perform better than unsupervised ones in the meeting domain.  In [42], to summarize meeting recordings, Murray et al. introduced the speech feature-based extractive summarization system for meeting conversations. In addition to its possessing simple lexical features such as tf-idf scores of words in utterances, they also used basic prosodic features, such as fundamental frequency, energy, and duration, for each utterance. Gaussian mixture models were used to classify sentences with these features. They compared their approach with MMR, LSA, and the traditional feature-based classification approach. The experiment was conducted according to human judgment and its results showed their approach as outperforming all the baselines. Later, Murray et al. [39] proposed an effective summarization technique that works on both the meeting and email domains. In their work, they introduced general conversational features that are effective in several different domains. They treated meetings and emails as general conversations, extracted only general features, and developed a model using logistic regression in order to create summaries. Their experiment demonstrates that using these general conversational features in a machine-learning sentence classification framework results in performance that is as competitive as state-of-the art systems that rely on domain-specific features.  3.2 Abstractive Meeting Summarization As in the other domains, the most common approach to automatic meeting summarization has been extractive. Since extractive approaches do not require natural language generation techniques, they are arguably simpler to apply and have been extensively investigated. However, in meeting transcripts, input source documents consist of noisy, unstructured text such as ungrammatical, disfluent utterances. Moreover, if the input for summarization is automatically 16  recognized transcripts, the word errors will also be contained in system-generated extractive summaries. Thus, simply concatenating important sentences to create extractive summary does not work effectively. For this reason, recently, more attention has been paid to the abstractive meeting summarization systems. Some of the most notable studies on abstractive meeting summarization include those of Liu and Liu, Mehdad et al., Murray et al., and Wang and Cardie [28, 37,43,52], each of  which are explained below. In [28], Liu and Liu create a meeting summary by applying several sentence compression algorithms to important extracted sentences. First, they applied the integer programming-(IP)-based sentence compression algorithm [8], which prune words and only preserve the word sequence that maximizes an objective function obtained from a language model. Second, they used the lexicalized Markov grammar-based sentence compression approach [15] which is an extension of the noisy-channel-based compression approach [25]. They used the ICSI meeting corpus for their experiment, compared these approaches to abstractive summaries and demonstrated that their approach improved human readability and the ROUGE scores. The approach introduced by Mehdad et al. [37] first clusters human utterances into communities [41] and then builds an entailment graph over each of the latter in order to select the salient utterances. It then applies a semantic word graph algorithm to them and creates abstractive summaries. Their results show some improvement in creating informative summaries.  The previous two approaches create summaries of entire meetings. Recently, some study on creating abstract summaries of specific aspects of meetings such as decisions, actions and problems, which are called Focused Meeting Summarization [7], are also introduced. The latest studies of this type of summarization were conducted by Murray et al. [43] and Wang and Cardie [52]. The system introduced by Murray et al. [43] first classifies human utterances into specific aspects of meetings, e.g. decisions, problems, and actions, and then maps them onto conversation ontologies. It then selects the most informative subsets from these ontologies and finally generates abstractive summaries of them, utilizing a natural language generation tool, simpleNLG [17]. After creating summaries of specific aspects, it aggregates them into one and successfully creates summaries covering whole meetings. Wang and Cardie [52] introduced a template-based focused abstractive meeting summarization system. Their system first clusters human-authored summary sentences and 17  applies a Multiple-sequence Alignment Algorithm to them to generate templates. Then, given the meeting transcript to be summarized, it identifies human utterance clusters describing specific aspect and extracts all summary-worthy relation instances, i.e. indicator-argument pairs, from them. Finally, the templates are filled with these relation instances and ranked accordingly, to generate summaries of a specific aspect of the meeting.  Although the two approaches above are both successful in creating readable summaries, they rely on much annotated information, such as dialog act types, and also require the classification of human utterances that contain much noise and ill-structured grammar. Our approach is inspired by the works introduced here but improves on their shortcomings. Unlike those of Murray et al. and Wang and Cardie, our system relies less on annotated training data and does not require a classifier. In addition, our evaluation indicates that our system can create summaries of entire conversations that are more informative and readable than those of Mehdad et al.   3.3 Meeting Corpus The AMI Meeting Corpus [5] is the most common meeting corpus among those which are publically available, and thus our work was done using this dataset.  The corpus consists of 100 hours of meeting recordings which consists of 139 different meetings in total. These meetings were elicited, which means that four participants role-played a scenario in which each participant was assigned a particular role in a fictitious company and took part in a series of four meetings. The duration of each meeting varied between 15 and 45 minutes, depending on the progress of the meeting and on the participants. All meetings were transcribed manually by annotators and also automatically by the automatic speech recognition (ASR) system. In addition, the annotators wrote abstractive summaries of each meeting. And for each abstractive summary sentence, they extracted the utterances from the meeting transcripts that best explained the information in the summary sentence, and connected these utterances and the summary sentences with links. A set of utterances linked to an abstractive summary sentence is called a community [41]. All extractive summaries consist of these communities. Figure 3.1 is an excerpt of the manually transcribed meeting conversation, where each line corresponds to an utterance along with its speaker and the turn number. From this figure, it is 18  obvious that meeting transcripts are different by far from the text documents usually used in text summarization research (e.g., news articles). Specifically, utterances often contain disfluencies and consist of incomplete sentences. 19    Figure 3.1: An excerpt of a meeting transcript 20  Chapter 4: A Template-based Automatic Meeting Summarization System In this chapter, we present our novel template-based abstractive meeting summarization system. In order for summaries to be readable and informative, they should be grammatically correct and contain important information in meetings. To this end, we have created our framework consisting of the following two components: 1) An off-line template generation module, which generalizes collected human-authored summaries and creates templates from them; and 2) An on-line summary generation module, which segments meeting transcripts based on the topics discussed, extracts the important phrases from these segments, and generate abstractive summaries of them by filling the phrases into the appropriate templates. Figure 4.1 depicts our framework. In the following sections, we describe each of the two components in detail. 21    Figure 4.1: Our meeting summarization framework. Top: Off-line template generation module.  Bottom: On-line summary generation module.22  4.1 Template Generation Module Our template generation module attempts to satisfy two possibly conflicting objectives. First, templates should be quite specific such that they accept only the relevant fillers. Second, our module should generate generalized templates that can be used in many situations. We assume that the former is achieved by labeling phrases with their hypernyms that are not too general and the latter by merging related templates. Based on these assumptions, we divide our module into the three tasks: 1) Hypernym labeling; 2) Clustering; and 3) Template fusion.  4.1.1 Hypernym Labeling Templates are derived from human-authored meeting summaries in the training data. We first collect sentences whose subjects are meeting participant(s) and that contain active root verbs, from the summaries. This is achieved by utilizing meeting participant information provided in the corpus and parsing sentences with the Stanford Parser [34]. The motivation behind this process is to collect sentences that are syntactically similar. We then identify all noun phrases in these sentences using the Illinois Chunker [46]. This chunker extracts all noun phrases as well as part of speech (POS) for all words. To add further information on each noun phrase, we label the right most nouns (the head nouns) in each phrase with their hypernyms using WordNet [10]. In WordNet, hypernyms are organized into hierarchies ranging from the most abstract to the most specific. For our work, we utilize the fourth most abstract hypernyms in light of the first goal discussed at the beginning of Section 4.1, i.e. not too general. To disambiguate the sense of the nouns, we simply select the sense that has the highest frequency in WordNet.  At this stage, all noun phrases in sentences are tagged with their hypernyms defined in WordNet, such as “artifact.n.01”, and “act.n.02”, where n’s stands for nouns and the two digit numbers represent their sense numbers. We treat these hypernym-labeled sentences as templates and the phrases as blanks. In addition, we also create two additional rules for tagging noun phrases: 1) Since the subjects of all collected sentences are meeting participant(s), we label all subject noun phrases as “speaker”; and 2) If the noun phrases consist of meeting specific terms such as “the meeting” or “the group”, we do not convert them into blanks. These two rules guarantee the creation of templates suitable for meetings. 23   Figure 4.2: Some examples of the hypernym labeling task  4.1.2 Clustering Next, we cluster the templates into similar groups. We utilize root verb information for this process, assuming that these verbs such as “discuss” and “suggest” that appear in summaries are the most informative factors in describing meetings. Therefore, after extracting root verbs in summary sentences, we create fully connected graphs where each node represents the root verbs and each edge represents a score denoting how similar the two word senses are. To measure the similarity of two verbs, we first identify the verb senses based on their frequency in WordNet and compute the similarity score based on the shortest path that connects the senses in the hypernym taxonomy. We then convert the graph into a similarity matrix and apply a Normalized Cuts method [49] to cluster the root verbs. Finally, all templates are organized into the groups created by their root verbs.  4.1.3 Template Fusion We further generalize the clustered templates by creating a word graph and selecting the best paths in the graph. This approach was originally proven to be effective in summarizing a cluster of related sentences [3, 11, 37]. We extend the graph so that it can be applied to templates.  Word Graph Construction  In our system, a word graph is a directed graph with words or blanks serving as nodes and edges representing adjacency relations. Given a set of related templates in a group, the graph is constructed by first creating a start and end node, and then iteratively adding templates to it. 24  When adding a new template, the algorithm first checks each word in the template to see if it can be mapped onto existing nodes in the graph. The word is mapped onto a node if the node consists of the same word and the same POS tag, and if no word from this template has been mapped onto this node yet. Then, the algorithm checks each blank in the template and maps it onto a node if the node consists of the same hypernym-labeled blank and no blank from this template has been mapped onto this node yet.  When more than one node refer to the same word or blank in the template, or when more than one word or blank in the template can be mapped to the same node in the graph, the algorithm checks the neighboring nodes in the current graph, as well as the preceding and the subsequent words or blanks in the template. Then, those word-node or blank-node pairs with higher overlap in the context are selected for mapping. Otherwise, a new node is created and added to the graph. As a simplified illustration, we show a word graph in Figure 4.3 obtained from the following four templates.  After introducing [situation.n.01], [speaker] then discussed [content.n.05] .  Before beginning [act.n.02] of [artifact.n.01], [speaker] discussed [act.n.02] and [content.n.05] for [artifact.n.01] .  [speaker] discussed [content.n.05] of [artifact.n.01] and [material.n.01] .  [speaker] discussed [act.n.02] and [asset.n.01] in attracting [living_thing.n.01] .   Figure 4.3: A word graph generated from related templates and the highest scored path (shown in bold)  25  Path Selection  The word graph generates many paths connecting its start and end nodes, not all of which are readable and cannot be used as templates. Our aim is to create concise and generalized templates. Therefore, we create the following ranking strategy to be able to select the ideal paths. First, to filter unacceptable or complex templates, the algorithm prunes away the paths having more than three blanks; having subordinate clauses; not containing any verb; having two consecutive blanks; containing blanks which are not labeled by any hypernym; or whose lengths are shorter than three words. Note that these rules, which were defined based on close observation of the results obtained from our development set, greatly reduce the chance of selecting ill-structured templates. Second, the remaining paths are reranked by 1) A normalized path weight and 2) A language model learned from hypernym-labeled human-authored summaries in our training data, each of which is described below.  1) Normalized Path Weight We adapt Filippova [11]’s approach to compute the edge weight. The formula is shown as:    (    )       ( )        ( )  ∑     (     )         ( )        ( ) (4.1) where ei,j  is an edge that connects the nodes i and j in a graph, freq(i) is the number of words and blanks in the templates that are mapped to node i and diff(p,i,j) is the distance between the offset positions of nodes i and j in path p. This weight is defined so that the paths that are informative and that contain salient (frequent) words are selected. To calculate a path score, W(p), all the edge weights on the path are summed and normalized by its length.  2) Language Model Although the goal is to create concise templates, these templates must be grammatically correct. Hence, we train an n-gram language model using all templates generated from the training data in the hypernym labeling stage. Then for each path, we compute a sum of negative log probabilities of n-gram occurrences and normalize the score by its length, which is represented 26  as H(p). The final score of each path is calculated as follows:       ( )       ( )        ( ) (4.2) where α and β are the coefficient factors which are tuned using our development set. For each group of clusters, the top ten best scored paths are selected as templates and added to its group. As an illustration, the path shown in bold in Figure 4.3 is the highest scored path obtained from this path ranking strategy.  In this section, we have introduced an effective way for generating generalized templates from human-authored summaries. The approach which consists of the three different tasks, i.e., hypernym labeling, clustering, and template fusion, makes it possible for the system to generate generalized templates that can be used in describing as many different situations as possible.    4.2 Summary Generation Module This section explains our summary generation module consisting of four tasks: 1) Topic segmentation; 2) Phrase and speaker extraction; 3) Template selection and filling; and 4) Sentence ranking.   4.2.1 Topic Segmentation It is important for a summary to cover as many topics as possible. Therefore, given a meeting transcript to be summarized, and after removing speech disfluencies such as “uh”, and “ah”, we employ a topic segmenter, LCSeg [14] to segment the meeting based on its topics. The LCSeg assumes that topic shifts occur where strong term repetitions start and end. The algorithm at first identifies the lexical chains by observing word repetitions. It then ranks each chain according to its frequency and compactness. Those chains that are compact and frequent achieve higher scores. For each sentence boundary, the LCSeg computes the cosine similarity at the transition between the contexts using the chains, and when a sharp similarity change is identified, it creates a topic boundary. One shortcoming of LCSeg is that it ignores speaker information when segmenting transcripts. Important topics are often discussed by one or two speakers. Therefore, in order to take advantage of the speaker information, we extend LCSeg by adding the following post-process step: If a topic segment contains more than 25 utterances, we subdivide the segment based on the 27  speakers. These subsegments are then compared with one another using cosine similarity, and if the similarity score is greater than that of the threshold (0.05), they are merged. The two numbers, i.e. 25 and 0.05, were selected based on the development set so that, when segmenting a transcript, the system can effectively take into account speaker information without creating too many segments.  4.2.2 Phrase and Speaker Extraction Subsequent to performing the above, all salient phrases are then extracted from each topic segment in the same manner as is performed in the template generation module in Section 4.1, by: 1) Extracting all noun phrases; and 2) Labeling each phrase with the hypernym of its head noun. Furthermore, to be able to select the salient phrases, these phrases are subsequently scored and ranked based on the sum of the frequency of each word in the segment. Finally, to handle redundancy, we remove phrases that are subsets of others. In addition, for each utterance in the meeting, the transcript contains its speaker’s name. Therefore, we extract the most dominant speakers’ name(s) for each topic segment and label them as “speaker”. These phrases and this speaker information will later be used during the template filling process. Table 4.1 below shows an example of dominant speakers and high scored phrases extracted from a topic segment.  Dominant speakers Project Manager  Industrial Designer High scored phrases and their hypernyms the whole look (appearance.n.01) the company logo (symbol.n.01) the product (artifact.n.01) the outside (region.n.01) electronics (content.n.05) the fashion (manner.n.01)  Table 4.1: Dominant speakers and high scored phrases extracted from a topic segment   28    Figure 4.4:A link from an abstractive summary sentence to a subset of a meeting transcript that conveys or supports the information in the abstractive sentence29  4.2.3 Template Selection and Filling In terms of our training data, all human-authored abstractive summary sentences have links to the subsets of their source transcripts which support and convey the information in the abstractive sentences as illustrated in Figure 4.4. These subsets are called communities [41]. Since each community is used to create one summary sentence, we hypothesize that each community covers one specific topic.  Thus, to find the best templates for each topic segment, we refer to our training data. In particular, we first find communities in the training set that are similar to the topic segment and identify the templates derived from the summary sentences linked to these communities.   This process is done in two steps, by: 1) Associating the communities in the training data with the groups containing templates that were created in our template generation module; and 2) Finding templates for each topic segment by comparing the similarities between the segments and all sets of communities associated with the template groups. Below, we describe the two steps in detail.  4.2.3.1 Associating Communities with Relevant Templates Recall that in the template generation module in Section 4.1, we label human-authored summary sentences in training data with hypernyms and cluster them into similar groups. Thus, as shown in Figure 4.5, we first associate all sets of communities in the training data into these groups by determining to which groups the summary sentences linked by these communities belong.  Figure 4.5: Process of associating each community with a group containing templates 30  4.2.3.2 Finding Templates for Each Topic Segment Next, for each topic segment, we compute average cosine similarity between the segment and all communities in all of the groups.   Figure 4.6: Process of computing the average cosine similarities between a topic segment and all sets of communities in each group  At this stage, each community is already associated with a group that contains ranked templates. In addition, each segment has a list of average-scores which measures the degree of similarity the segment possesses to the communities in each group. Hence, the templates used for each segment are decided by selecting the ones from the groups having the higher scores.  Our system now contains for each segment a set of phrases and ideal templates, both of which are scored, as well as the most dominant speakers’ name(s). Thus, candidate sentences are generated for each segment by first, selecting the speakers’ name(s), then selecting the phrases and the templates based on their scores, and finally filling the templates with matching labels. Here, we limit the maximum number of sentences created for each topic segment to 30. This number is defined so that the system can avoid generating sentences consisting of low scored phrases and templates. Finally, these candidate sentences are passed to our sentence ranking module.  31  4.2.4 Sentence Ranking Our system will create many candidate sentences, and most of them will be redundant. Hence, to be able to select the most fluent, informative and appropriate sentences, we create a sentence ranking model, considering 1) Fluency, 2) Coverage, and 3) The characteristics of the meeting, each of which are summarized below:  1) Fluency We estimate the fluency of the generated sentences in the same manner as in Section 4.1.3. That is, we train a language model on human-authored abstract summaries from the training portions of meeting data and then compute a normalized sum of negative log probabilities of n-gram occurrences in the sentence. The fluency score is represented as H(s) in the equation below.  2) Coverage To select sentences that cover important topics, we give special rewards to the sentences that contain the top five ranked phrases.  3) The Characteristics of the Meeting We also add three additional scoring rules that are specific to the meeting summaries. In particular, these three rules are created based on phrases often used in the opening and closing of meetings in a development set: 1) If sentences derived from the first segment contain the words “open” or “meeting”, they will be rewarded; 2) If sentences derived from the last segment contain the words “close” or “meeting”, the sentences will again be rewarded; and 3)  If sentences not derived  from the first or last segment contain the words “open” or “close”,  they will be penalized. The final ranking score of the candidate sentences is computed using the follow formula:       ( )    ( )  ∑        ( )  ∑        ( ) (4.3) where, Ri (s) is a binary that indicates whether the top i ranked phrase exists in sentence s; Mi (s) is also a binary that indicates whether the i th meeting specific rule can be met for sentence s; 32  and α, β i and γ i are the coefficient factors to tune the ranking score, all of which are tuned using our development set. Finally, the sentence ranked the highest in each segment is selected as the summary sentence, and the entire meeting summary is created by collecting these sentences and sorting them by the chronological order of the topic segments. This chapter demonstrated a fully automatic system for abstractive meeting summarization. As illustrated in Figure 4.1, we have divided the tasks of creating summaries into two modules: off-line template generation module and on-line summary generation module, each of which consists of several different subtasks. The first module creates generalized templates from human-authored summary sentences by replacing their noun phrases with their hypernyms and merging related templates. The second module creates a summary of any meeting transcript by extracting important phrases and filling them into the templates. By taking this approach, the system successfully processes the complicated task of creating readable and informative abstractive summaries of meetings.   33  Chapter 5: Experimental Results and Discussion In this Chapter, we describe an evaluation of our system. First, we describe the corpus data. Next, the results of the automatic and manual evaluations of our system against various baseline approaches are discussed.  5.1 Data For our meeting summarization experiments, we use manually transcribed meeting data and their human-authored summaries in the AMI corpus. The corpus contains 139 meeting records in which groups of four people play different roles in a fictitious team. We reserved 20 meetings for development and implemented a three-fold cross-validation using the remaining data.    5.2 Automatic Evaluation We report the F1-measure of ROUGE_1, ROUGE_2 and ROUGE_SU4 [27] to assess the performance of our system. The scores of automatically generated summaries are calculated by comparing them with human-authored ones.  For our baselines, we use the system introduced by Mehdad et al. (FUSION) [37], which creates abstractive summaries from extracted sentences and was proven to be effective in creating abstractive meeting summaries; and TextRank [38], a graph based sentence ranker that is suitable for creating extractive summaries. Our system can create summaries of any length by adjusting the number of segments to be created by LCSeg. Thus, we create summaries of three different lengths (10, 15, and 20 topic segments) with the average number of words being 100, 137, and 173, respectively. These numbers generally corresponds to human-authored summary length in the corpus which varies from 82 to 200 words.  Table 5.1 shows the results of our system in comparison with those of the two baselines. The results show that our model significantly outperforms the two baselines. Compared with FUSION, our system with 20 segments achieves about 3 % of improvement in all ROUGE scores. This indicates that our system creates summaries that are more lexically similar to human-authored ones. Surprisingly, there was not a significant change in our ROUGE scores over the three different summary lengths. This indicates that our system can create summaries of 34  any length without losing its content.  Models Rouge_1 Rouge_2 Rouge_SU4 TextRank 21.7 2.5 6.5 FUSION 27.9 4.0 8.1 Our System with 10 Segment 28.4 6.7 10.1 Our System with 15 Segment 30.6 6.8 10.9 Our System with 20 Segment 31.5 6.7 11.4  Table 5.1: An evaluation of summarization performance using the F1 measure of ROUGE_1 2, and SU4  5.3 Manual Evaluation We also conduct manual evaluations utilizing a crowdsourcing tool1. In this experiment, our system with 15 segments is compared with FUSION, human-authored summaries (Human Abstract) and, human-annotated extractive summaries (Human Extract).  After randomly selecting 10 meetings, 10 participants were selected for each meeting and given instructions to browse the transcription of the meeting so as to understand its gist. They were then asked to read all different types of summaries described above and rate each of them on a 1-5 scale for the following three items: 1) The summary’s overall quality, with “5” being the best and “1” being the worst possible quality; 2) The summary’s fluency, ignoring capitalization or punctuation, with “5” indicating no grammatical mistakes and “1” indicating too many; and 3) The summary’s informativeness, with “5” indicating that the summary covers all meeting content and “1” indicating that the summary does not cover the content at all.  The results are described in Table 5.2. Overall, 58 people worldwide, who are among the most reliable contributors accounting for 7 % of the overall members and who maintain the highest levels of accuracy on test questions provided in previous crowd sourcing jobs, participated in this rating task.  As to statistical significance, we use the 2-tail pairwise t-test to compare our system with the other three approaches. The results are summarized in Table 5.3.                                                    1 http://www.crowdflower.com/ 35  Models Quality Fluency Informativeness Our System  3.52 3.69 3.54 Human Abstract 3.96 4.03 3.87 Human Extract 3.02 3.16 3.30 FUSION 3.16 3.14 3.05  Table 5.2: Average rating scores  Models Compared Quality (P-value) Fluency (P-value) Informativeness (P-value) Our System vs. Human Abstract 0.000162 0.000437 0.00211 Our System vs. FUSION 0.00142 0.0000135 0.000151 Our System vs. Human Extract 0.000124 0.0000509 0.0621  Table 5.3: T-test results of manual evaluation  As expected, for all of the three items, Human Abstract received the highest of all ratings, while our system received the second highest. The t-test results indicate that the difference in the rating data is statistically significant for all cases except that of informativeness between ours and Human Extract. This can be understood because the extractive summaries were manually created by an annotator and contain all of the important information in the meetings. From this observation, we can conclude that users prefer our template-based summaries over human-annotated extractive summaries and abstractive summaries created from extracted salient sentences. Furthermore, this demonstrates that our summaries are as informative as human-annotated extractive ones.  5.4 Further Analysis This section presents examples of meeting summaries produced by our system and discusses them in detail. First, we describe how the system generates a sentence from a topic segment. As shown in Figure 5.1 (2), we can see that the system successfully extracts the most important phrases covering the topic such as “an advanced chip” and “different chip components” as well as the dominant speakers, i.e., “Project Manager” and “Industrial Designer”. Next, notice that the templates selected for this topic segment are all relevant and suitable for it as they all 36  represent the idea of “someone explaining something”. Based on this information, our system correctly captures the scenario and produces the ideal sentence, “project manager and industrial designer talked about different chip components”. Next, we also investigate another example. Here we focus on several issues that exist in our system. Although the generated sentence ((3) in Figure 5.2) seems to cover what is discussed in the topic segment ((1) in Figure 5.2), several problems affect our system. First, if we closely look at the hypernyms in each phrase, we see that our word sense disambiguation (WSD) method is not functioning well. For example, one of the extracted phrases, “a basic bateer battery” (“bateer” is a disfluency), should be categorized as an “electric battery” instead of a “group of guns or missile launchers” whose hypernym is “organization”. Also, “their remote control” is a “controller” instead of the “power to direct or determine”. Thus, a better WSD technique is required. Second, sometimes the system does not generate a sufficient number of templates. This is plausibly due to our template generation module having too many pruning rules in order to avoid making ungrammatical templates. More sophisticated tuning is required to improve this problem. For these reasons, cases exist in which the system generates incorrect sentences or no sentence at all for some topic segments. 37  Topic Segment from ES2007c ID: Uh if we get a scroll-wheel , that's a higher price range . ID: If we get an advanced chip which is um used for the L_C_D_ , the display thing , then that's even more expensive . PM: Chip on print . It's a bit . PM: Okay , uh what I'm not understanding here PM: is uh , okay , advanced chip on print , which I presume is like one P_C_B_ and that's got all the electronics on one board including the um infra-red sender ? PM: Um what a what alternatives do we have to that ? PM: um what alternatives do we have to the chip on print ? ID: Well , if if it's not chip on print then , I guess , you get different chip components , and you build them separately and doesn't include the infra-red . ID: Technically speaking , it's not as advanced , but it does the job , too . PM: So , why would we not go for that ? PM: If it's something that's inside the the unit . I it doesn't affects whether the customer's gonna buy it or not . PM: Um we wanna go for an i i all PM: So let's not let's uh not bother with the chip on print . 1) A topic segment (ID: Industrial Designer, PM: Project Manager) Dominant Speakers Top 5 Phrases and their hypernyms  (In order of their scores) Top5  Templates (In order of their scores)  Project Manager   Industrial Designer  an advanced chip (fragment.n.01)  different chip components (content.n.05)  Chip (fragment.n.01)  print (print.n.01)  a higher price range  (magnitude.n.01)  [ speaker ] went over [ content.n.05 ] .  [ speaker ] also mentioned [ content.n.05 ] of using [ artifact.n.01 ] .  [ speaker ] talked about [ content.n.05 ] .  [ speaker ] stated [ content.n.05 ] to build [ content.n.05 ] .  [ speaker ] also stated [ content.n.05 ] to build [ content.n.05 ] . 2) Extracted speakers, phrases and selected templates Generated Sentence project manager and industrial designer talked about different chip components . 3) A system-generated summary sentence Figure 5.1: An example demonstrating how our system generates a summary sentence 38  Topic Segment from ES2016c ID: Um as for the energy source , um we were talking about that shortly in the other meeting . ID: Um what we could use is , or what I was offered , or what we could use , is a basic bateer battery . ID: Or a device that was not n not further specified that provides kinetic energy . ID: Such as like watches you know . Where you just move them m move the the actual device and this pr uh provides it with with uh some energy . ID: So um obviously I personally have to say that dynamo is out of the question really . ID: You don't wanna wind up your remote control before you can use it right ? ID: May fail though , every here and there . ID: Would you have to leave it by the window ? ID: Or you know you lose it , it lies behind the couch for a week ID: Works well in Arizona but in Edinburgh not so ID: Um the kinetic energy thing um might work , um but the same problem . ID: You leave it lying around and you first have to shake it before it it starts to work . 1) A topic segment (ID: Industrial Designer) Dominant Speakers Top 5 Phrases  (In order of their scores) Top5  Templates (In order of their scores)  Industrial Designer  kinetic energy (natural_phenomenon.n.01)  the energy source (point.n.02)  a basic bateer battery (organization.n.01)  the actual device (artifact.n.01)  their remote control (power.n.01)  [ speaker ] also recommended using [ natural_phenomenon.n.01 ] , and having [ happening.n.01 ] .  [ speaker ] also recommended using [ organization.n.01 ] rather than [ power.n.01 ] .  [ speaker ] recommended using [ natural_phenomenon.n.01 ] , and having [ happening.n.01 ] .  [ speaker ] recommended using [ organization.n.01 ] rather than [ power.n.01 ] .  None 2) Extracted speakers, phrases and selected templates Generated Sentence industrial designer also recommended using a basic bateer battery rather than their remote control . 3) A system-generated summary sentence Figure 5.2: Another example demonstrating how our system generates a summary sentence39  Finally, we show in Figure 5.3 a summary created by our system in line with a human-authored one.  From this example, we can see that the sentences generated by our system capture the main concepts in the meeting. In fact, by reading the summary, we can understand what is discussed in the meeting. However, the system-generated sentences are shorter and simpler than those of the human-authored summary. This is because, in our system, the shorter the sentence is, the less likely it is to contain grammatical mistakes, as this reduces the chance of selecting the wrong fillers. This leads to our system always preferring shorter and simpler sentences.  Human-authored Summary Summary Generated by Our System  The project manager opened the meeting and had the team members introduce themselves and describe their roles in the upcoming project.  The project manager then described the upcoming project. The team then discussed their experiences with remote controls.  They also discussed the project budget and which features they would like to see in the remote control they are to create.  The team discussed universal usage, how to find remotes when misplaced, shapes and colors, ball shaped remotes, marketing strategies, keyboards on remotes, and remote sizes. team then discussed various features to consider in making the remote. project manager summarized their role of the meeting .  user interface expert and project manager talks about a universal remote .  the group recommended using the International Remote Control Association rather than a remote control .  project manager offered the ball idea .user interface expert suggested few buttons .  user interface expert and industrial designer then asked a member about a nice idea for The idea .  project manager went over a weak point .  the group announced the one-handed design .  project manager and industrial designer went over their remote control idea . project manager instructed a member to research the ball function .  industrial designer went over stability point .industrial designer went over definite points .  Figure 5.3: A comparison between a human-authored summary and a summary created by our system  40  Chapter 6: Conclusion and Future Work   6.1 Summary of Main Contributions In this thesis, we have demonstrated a robust abstractive meeting summarization system.  Our system first creates generalized templates from collected human-authored summaries by replacing their noun phrases with their hypernyms using WordNet, and applying a novel multi-sentence fusion algorithm to them. Next, given a meeting transcript to be summarized, our system segments the transcript based on the topics discussed, extracts important phrases from it, and selects the most appropriate templates by referring to the relationships between the summary sentences and their source transcripts in the training corpus. Finally, after filling the templates with those extracted phrases, each candidate sentence is ranked, and those which ranked the highest for each topic segment are included in the summary. Unlike traditional abstractive meeting summarization methods, our approach does not require much annotated data. Also, using sentence fusion algorithms for generating templates and leveraging summary-source relationships for selecting ideal templates is a new approach that greatly improves the quality of summaries. Overall, our system not only outperforms the state-of-the-art baseline and human-annotated extractive summaries but also is capable of generating abstractive summaries similar to human-authored ones. Below, we summarize the main contributions of our system:  1. Template Generation We have proposed a novel template generation approach, which leverages a multi-sentence fusion algorithm and lexico-semantic information. By applying this approach, we were able to generate generalized templates from human-authored summaries.  2. Template Selection Selecting templates that can correctly explain each topically segmented conversation is in fact a very difficult task because it requires a deeper understanding of the content of the conversation. To this end, we have developed an effective template selection method, which utilize the relationship between human-authored summaries and their source transcripts. By leveraging this relationship, our system is capable of automatically 41  selecting appropriate templates for each topic segment, which greatly contributes toward creating robust summaries.  3. Achieving Good Results on a Comprehensive Evaluation A comprehensive evaluation comprising both manual and automatic approaches has demonstrated that the summaries created by our system are preferred over human-created extractive ones. It has also shown that our summaries outperform those created by a state-of-the-art meeting summarization system in terms of readability and informativeness.  6.2 Limitations and Future Work The current version of our system uses only hypernym information in WordNet to label phrases. Also, no sophisticated word sense disambiguation (WSD) technique has been applied to our system, which results in many phrases being produced that cannot be labeled or are labeled incorrectly. Considering these limitations, future work includes extending our framework by applying a more sophisticated labeling task utilizing a richer knowledge base (e.g., YAGO) and robust WSD techniques. Also, compared to human-authored summaries, our system-generated sentences are much simpler and do not resolve coreferences. These limitations should also be addressed in the future by integrating coreference resolution and sentence concatenation tools. Finally, we also plan to apply our framework to different multi-party conversational domains such as chat logs and forum discussions. 42  Bibliography  [1] Ramiz M. Aliguliyev. A Novel Partitioning-Based Clustering Method and Generic Document Summarization. In WI-IATW ’06: In Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology , pages 626—629, Washington, DC, USA, 2006 [2] Regina Barzilay, Katheleen r. Mckeown and Micael Elhadad. Information fusion in the context of multi-document summarization. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 550-557, 1999. [3] Florian Boudin and Emmanuel Morin. Keyphrase Extraction for N-best Reranking in Multi-Sentence Compression. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), 2013.  [4] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine.  Computer Networks and ISDN Systems, vol. 30, no. 1-7, pages 107–117, 1998. [5] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus: A pre-announcement. In Proceeding of MLMI 2005, Edinburgh, UK, pages 28–39. [6] Jaime G. Carbonell and Jade. Goldstein. The use of MMR, diversity-based re-ranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, , pages 335–336, 1998. [7] Giuseppe Carenini, Gabriel Murray, and Raymond Ng. Methods for Mining and Summarizing Text Conversations, 2011. Morgan Claypool. [8] James Clarke and Mirella Lapata. Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, 31, pages 399–429, 2008. 43  [9] Gunes Erkan and Dragomir R. Radev, LexRank: graph-based lexical centrality as salience in text summarization, Artificial Intelligence Research, vol. 22, pages 457–479, 2004. [10] Christiane Fellbaum 1998. WordNet, An Electronic Lexical Database. The MIT Press. Cambridge, MA. [11] Katja Filippova. Multi-sentence compression: finding shortest paths in word graphs. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 322–330, Stroudsburg, PA, USA 、 2010.. Association for Computational Linguistic [12] Takahiro Fukushima and Manabu Okumura. Text Summarization Challenge Text summarization evaluation in Japan. In Proceeding of the NAACL2001 Workshop on Automatic summarization, pages 51–59, 2001. [13] Michel Galley. A skip-chain conditional random field for ranking meeting utterances by importance.  In Proceeding of EMNLP, 2006. [14] Michel Galley, Kathleen McKeown, Eric Fosler-Lussier and Hongyan Jing. 2003. Discourse segmentation of multi-party conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). 562–569. Sapporo, Japan. [15] Michel Galley and Kathleen R. McKeown. Lexicalized markov grammars for sentence compression. In proceeding of NAACL/HLT, 2007. [16] Nikhil Garg, Benoit Favre, Korbinian Reidhammer, and Dilek Hakkani-Tur. Cluster-Rank: a graph based method for meeting summarization. In proceeding of Interspeech, 2009. [17] Albert Gatt and Ehud Reiter. SimpleNLG: a Realisation Engine for Practical Applications. In Proceedings of the 12th European Workshop on Natural Language Generation, pages 90–93, Morristown, NJ, USA, 2009. Association for Computational Linguistics. [18] Pierre-Etienne Genest and Guy Lapalme. Fully abstractive approach to guided summarization. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers- Volume 2, pages 354-358, 2012. 44  [19] Charles. F. Greenbacker. Towards a framework for abstractive summarization of multimodal documents.  ACL HLT 2011, page 75, 2011. [20] Liwei He, Elizabeth Sanocki, Anoop Gupta and Jonathan Grudin. Auto-summarization of audio-video presentations. In Proceedings of the 7th ACM International Multimedia Conference, pages 289-298, Orlando, FL, 30 October - 5 November 1999 [21] Tsutomu Hirao, Hideki Isozaki, Eisaku Maeda. Extracting Important Sentences with Support Vector Machine. In Proceeding of ACM,  2002  [22] Hongzhao Huang, Arkaitz Zubiaga, Heng Ji, Hongbo Deng, Dong Wang, Hieu Khac Le,Tarek F. Abdelzaher, Jiawei Han, Alice Leung, John Hancock and Clare R. Voss. Tweet Ranking Based on Heterogeneous Networks. In Proceedings of COLING, pages 1239 – 1256, 2012. [23] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters. The ICSI meeting corpus. In Proceeding of ICASSP, 2003. [24] Atif Khan and Nomie Salim. A review on abstractive summarization methods. Journal of Theoretical and Applied Information Technology. vol.59, pages 64-72, January 2014. [25] Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence,139(1),  pages 91-107 2002 [26] Ravi Kondadadi, Blake Howald and Frank Schilder. A Statistical NLG Framework for Aggregated Planning and Realization. In Proceeding of the Annual Conferene for the Association of Computational Linguistic (ACL 2013),2013. [27] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceeding  of The Workshop on Text Summarization Branches Out, 2004. [28] Feifan Liu and Yang Liu. Correlation between rouge and human evaluation of extractive meeting summaries. In Proceeding of ACL-HLT, 2008. [29] Fei Liu and Yang Liu. From extractive to abstractive meeting summaries: Can it be done by sentence compression?  In proceeding of ACL, 2009. [30] Shih-Hsiang Lin and Berlin Chen. A risk minimization framework for extractive speech summarization. In Proceeding of ACL, 2010. 45  [31] Hui Lin, Jeff Bilmes, and Shasha Xie. Graph-based submodular selection for extractive summarization. In proceeding of ASRU, 2009. [32] David D. McDonald. Issues in the representation of real texts: the design of KRISP. In Lucja M. Iwa´nska and Stuart C. Shapiro, editors, Natural Language Processing and Knowledge Representation, pages 77–110. MIT Press, Cambridge, MA, 2000. [33] Inderjeet Mani and Mark T. Maybury. Advances in Automatic Text Summarization.  Cambridge, MA: The MIT Press; 1999; 442 pp. [34] Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06). 2006. [35] Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, and Julia Hirschberg. Do summaries help? A task-based evaluation of multi-document summarization. In Proceedings of SIGIR, 2005. [36] Sameer Maskey and Julia Hirschberg. Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. In Proceeding of Interspeech, 2005. [37] Yashar Mehdad, Giuseppe Carenini, and Frank Tompa. Abstractive Meeting Summarization with Entailment and Fusion. In proceedings of the 14th European Natural Language Generation (ENLG - SIGGEN 2013), Sofia, Bulgaria, 2013. [38] Rada Mihalcea and Paul Tarau. TextRank: Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, July, 2004. [39] Gabriel Murray and Giuseppe Carenini. Summarizing Spoken and Written Conversations, In Proceeding of EMNLP, Waikiki, Hawaii, 2008. [40] Gabriel Murray, Giuseppe Carenini, and Raymond Ng. Interpretation and transformation for abstracting conversations. In Proceeding of NAACL, 2010. [41] Gabriel Murray, Giuseppe Carenini and Raymond Ng. 2012. Using the Omega Index for Evaluating Abstractive Community Detection, NAACL 2012, Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, Montreal, Canada. [42] Gabriel Murray, Steve Renals, Jean Carletta, and Johanna Moore. Evaluating automatic summaries of meeting recordings. In Proceeding of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation, 2005. 46  [43] Gabriel Murray, Giuseppe Carenini, and Raymond T. Ng. 2010. Generating and validating abstracts of meeting conversations: a user study. In INLG 2010. [44] Gabriel Murray, Thomas Kleinbauer, Peter Poller, Steve Renals, Jonathan Kilgour and Tilman Becker. Extrinsic Summarization Evaluation: A Decision Audit Task. ACM Transactions on Speech and Language Processing, Volume 6, Issue 2, October 2009. [45] Ani Nenkova and Rebecca Passonneau. Evaluating content selection in summarization: The pyramid method.  In Proceeding of HLT-NAA. 2004. [46] Vasin Punyakanok and Dan Roth. The Use of Classifiers in Sequential Inference. NIPS (2001) pages 995-1001, 2001. [47] Rath, G.J. Resnick, A. and Savage, T. R. Comparison of four types of lexical indicators of content.  American Documentation, 12(2), pages 126-130, 1961. [48] Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and Zheng Chen. Document summarization using conditional random fields. In Proceeding of IJCAI, volume 7, pages 2862–2867,  2007. [49] Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts & Image Segmentation. IEEE Trans. of PAMI, Aug 2000. [50] David C. Uthus and David W. Aha. Plans toward automated chat summarization. In Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages, WASDGML’11, pages 1-7, Stroudsburg, PA, USA、2011. Association for Computational Linguistics. [51] Manisha Verma and Vasudeva Varma. Exploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search. CLEF (Notebook Papers/Labs/Workshop) 2011. [52] Lu Wang and Claire Cardie. Domain-Independent Abstract Generation for Focused Meeting Summarization. In ACL, 2013. [53] Shasha Xie and Yang Liu. Using corpus and knowledge-based similarity measure in maximum marginal relevance for meeting summarization. In Proceeding of  ICASSP, 2008. [54] Klaus Zechner. Automatic summarization of open-domain multiparty dialogues in diverse genres. Computational Linguistics, vol. 28, pages 447–485, 2002. 47  [55] Liang Zhou and Eduard Hovy. Digesting virtual “geek” culture: The summarization of technical internet relay chats. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 298-305, Ann Arbor, Michigan, 2005,. Association for Computational Linguistics. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0165907/manifest

Comment

Related Items