UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Blog comments classification using tree structured conditional random fields Jin, Wei 2012

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2013_spring_jin_wei.pdf [ 943.44kB ]
[if-you-see-this-DO-NOT-CLICK]
Metadata
JSON: 1.0052167.json
JSON-LD: 1.0052167+ld.json
RDF/XML (Pretty): 1.0052167.xml
RDF/JSON: 1.0052167+rdf.json
Turtle: 1.0052167+rdf-turtle.txt
N-Triples: 1.0052167+rdf-ntriples.txt
Original Record: 1.0052167 +original-record.json
Full Text
1.0052167.txt
Citation
1.0052167.ris

Full Text

Blog Comments Classi cation using Tree Structured Conditional Random Fields by Wei Jin B.Cs., The University of British Columbia, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) November 2012 c Wei Jin 2012Abstract The Internet provides a variety of ways for people to easily share, socialize, and interact with each other. One of the most popular platforms is the online blog. This causes a vast amount of new text data in the form of blog comments and opinions about news, events and products being generated everyday. However, not all comments have equal quality. Informative or high quality comments have greater impact on the readers opinions about the original post content, such as the bene ts of the product discussed in the post, or the interpretation of a political event. Therefore, developing an e cient and e ective mechanism to detect the most informative comments is highly desirable. For this purpose, sites like Slashdot, where users volunteer to rate comments based on their informativeness, can be a great resource to build such automated system using supervised machine learning techniques. Our research concerns building an automatic comment classi cation sys- tem leveraging these freely available valuable resources. Speci cally, we discuss how comments in blogs can be detected using Conditional Ran- dom Fields (CRFs). Blog conversations typically have a tree-like structure in which an initial post is followed by comments, and each comment can be followed by other comments. In this work, we present our approach usingTree-structured Conditional Random Fields (TCRFs) to capture the dependencies in a tree-like conversational structure. This is in contrast with previous work [5] in which results produced by linear-chain CRF models had to be aggregated heuristically. As an additional contribution, we present a new blog corpus consisting of conversations of di erent genres from 6 di er- ent blog websites. We use this corpus to train and test our classi ers based on TCRFs. iiTable of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivations of blog comments classi cation . . . . . . . . . . 1 1.2 Tree Structured Conditional Random Fields Blog Comments Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . 6 2.1.1 Linear Chain Conditional Random Fields . . . . . . . 6 2.1.2 Bayesian Conditional Random Fields . . . . . . . . . 8 2.1.3 Dynamic Conditional Random Fields . . . . . . . . . 9 2.1.4 Skip-Chain Conditional Random Fields . . . . . . . . 11 2.1.5 Tree Structured Conditional Random Fields . . . . . 14 2.1.6 Semi-supervised CRFs . . . . . . . . . . . . . . . . . 15 2.2 Related Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 The Slashdot Dataset . . . . . . . . . . . . . . . . . . 17 iii3 Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Slashdot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 DailyKos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 AndroidCentral . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 BusinessInsider . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Macrumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.7 TSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.8 Data Processing Framework . . . . . . . . . . . . . . . . . . 23 3.8.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . 23 3.8.2 Transformation . . . . . . . . . . . . . . . . . . . . . 24 3.8.3 Feature Extraction . . . . . . . . . . . . . . . . . . . 24 3.8.4 Dataset Summary . . . . . . . . . . . . . . . . . . . . 25 4 Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Linear Chain Conditional Random Fields . . . . . . . . . . . 28 4.2 Tree Structured Conditional Random Fields . . . . . . . . . 29 4.3 GRMM Modi cation . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 36 5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1 Experiments Setup . . . . . . . . . . . . . . . . . . . 36 5.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.1 Classi cation . . . . . . . . . . . . . . . . . . . . . . . 37 5.3.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . 39 6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 42 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 ivList of Tables 2.1 Blog Comments Classi cation Result by FitzGerald et al [5] . 8 3.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Comparison of LCCRF, TCRF and Baseline . . . . . . . . . . 38 5.2 Tree Complexity of Blogs . . . . . . . . . . . . . . . . . . . . 40 5.3 Tree Complexity vs Performance Improvement in percentage 40 vList of Figures 2.1 Blog conversation tree transformation . . . . . . . . . . . . . 7 2.2 Remerging Threads . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Examples of DCRFs . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Skip Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Slashdot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 DailyKos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 AndroidCentral . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 BusinessInsider . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Macrumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6 Macrumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7 Sample table . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.8 XML format data . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.9 Feature Format . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Linear Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 TCRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Linear Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Linear Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . 41 viAcknowledgements I would like to thank my parents, Enlai Jin and Jingyuan Shao, for sup- porting me throughout my education, helping me all the way through my masters, and giving me the freedom of making my own choices. I would also like to thank my wife Xinyi Zhu for standing by me through my endeavour through graduate school. She is always there for me no matter what. I would like to thank Giuseppe Carenini and Raymond Ng for being supportive supervisors throughout my research career at UBC. They have given me the freedom to follow my ideas while providing guidance to keep me on track. Our entire research group has been instrumental throughout my work. This thesis work is de nitely a group e ort and I could not have done it alone. viiChapter 1 Introduction The Internet provides a variety of ways for people to easily share, socialize, and interact with each other. One of the most popular platforms is the online blog. This causes a vast amount of new text data in the form of blog comments and opinions about news, events and products being generated everyday. People using online forum are often overwhelmed by the huge amount of information they are dealing with. They need an e cient way to help them reduce the information overload. Methods to accurately clas- sifying the quality of a blog comment can be used to make various part of viewing online blog comments easier. For example, we can highlight those high quality comments, so people can just read those to get a pretty good grasp about the topic and the content of this conversation. Therefore, we want to have a tool to accurately classify the comments in a given blog con- versation. In this thesis we present an approach based on Tree structured Conditional Random Fields. 1.1 Motivations of blog comments classi cation Often people face huge amount of comments when they are browsing an online blog form. However, not all comments are equally informative. The most informative or high quality comments should have greater impact on the readers opinions about the original post content, such as the quality of the product discussed in the post, or the interpretation of a political event. Comments classi cation into high-quality vs. low-quality comments can help people view the online blog more e ciently. For example, high quality comments of a given blog conversation can be selected. In this way a user can just read the selected part of the conversation so that the reading 1time will be greatly reduced. Another useful way to apply comments classi cation would be to use the selected high quality comments to generate a summary of the whole discus- sion. People can use this summary to make decisions without spending too much time on viewing the whole discussions. For example, a potential smart- phone buyer can get a summary of the reviews submitted by past customers for a given model of phone, so he could get a general idea about the price, quality of product and many other informations about this smartphone. Therefore, developing an e cient and e ective mechanism to detect the most informative comments is highly desirable. 1.2 Tree Structured Conditional Random Fields Blog Comments Classi cation In this thesis we present an approach to classify the quality of blog comments using Conditional Random Fields (CRFs). CRFs are undirected graphical models that can exploit a large numbers of arbitrary local and global fea- tures in predicting labels for a given observation. For many NLP tasks, the underlying graphical model is usually implemented as a linear chain [5], where observations come in a linear sequence and dependencies exist be- tween two consecutive labels. However, since the structure of blog comments is often a tree, we propose to apply Tree Structured Conditional Random Fields (TCRFs) to take advantage of the dependencies across hierarchically laid-out information. The main challenges we faced in applying TCRFs to our work are: (1) Data collection (2) Model implementation (3) Feature Selection (4) Result Analysis. 1.3 Outline of the thesis In this work, we describe the process of creating a new corpus of 6 di erent blog sites for blog classi cation. We then develop a more advanced CRFs approach to take advantage of the underlying tree structure of the blog 2conversation. At last, we test TCRFs approach with di erent feature sets on the corpus we have created to see the improvement with the respect to classi cation accuracy. The thesis comprises six chapters. The  rst chapter, this one, has intro- duced the nature of the work and has provided the motivation. The second chapter provides background in the  eld of comments classi cation, previ- ous application of CRFs and introduces the work our approach is improving upon. In Chapter 3 we describe the new corpus and the e ort and method of building it. In Chapter 4 we introduce our TCRFs approach. Chapter 5 mainly focus on the experiments and the results of our TCRFs model. Finally, in the last chapter we conclude by recapping the achievements of our work. The bibliography and appendix are at the end of the thesis. 1.4 Contributions As a preview we outline the main contributions of this thesis here:  Annotated blog comments corpora are rarely available to the public. This makes research on blog comments classi cation especially di - cult. In addition, in the  eld of classifying blog comments, there was no public accessible labelled comment datasets. Due to these reasons, We compiled a new corpus comprised of the Slashdot1 dataset collected by FitzGerald et al[5], and conversations from  ve other blog websites namely, DailyKos2, AndroidCentral3, BusinessInsider4, Macrumors5 and TSN6. These blog sites cover a variety of genres such as technol- ogy, business. politics and sports; they also have di erent conversation structure (sequential or tree). For example, Slashdot conversations have a tree-like structure; users can directly reply to a given com- ment, and their reply will be placed underneath that comment in a 1http://slashdot.org 2http://www.dailykos.com 3www.androidcentral.com 4www.businessinsider.com 5www.macrumors.com 6www.tsn.ca 3nested structure. On the contrary, comments in conversations from AndroidCentral always form a single linear thread. We will be able to test the generality of our approach on very di erent blogs; and verify how the performance of our proposal varies with respect to the key properties of these blogs. Since our dataset covers several di erent domains, we will be able to test domain adaptation techniques in the future work.  In the previous work done in the  eld of comments classi cation us- ing CRFs, the authors use a linear-chain CRFs (LCCRFs) to detect the informative blog comments through the exploration of conversa- tional and topical features. However, their approach has two main limitations: (1) it ignores many hierarchical dependencies between the output tags/labels and (2) the common internal nodes fall in multiple paths, which cause them to be classi ed multiple time, and possi- ble inconsistent classi cations have to be combined heuristically. Our proposed approach applies a TCRF model to better handle the de- pendencies across the hierarchically structured comments in a blog conversation.  There has not been a comparison of di erent CRFs approaches for blog comments quality classi cation. We have therefore created a framework where we can compare various CRFs models. Through our experiments, we found that our TCRFs approach works better than LCCRFs model. TCRFs outperformed LCCRFs due to the inherent tree structure which captures the semantic and syntactic dependencies better.  In order to build Tree structured CRFs in GRMM, we created our own FactorTemplate class and supplied a a structure  le containing the parent-child pairs information of a given blog conversation trees to the modi ed Arbitrary CRF classes. 4Chapter 2 Related Work CRFs have been successfully applied to many NLP tasks including Part of Speech (POS) tagging, chunking, syntactic parsing, word segmentation and meeting utterances classi cation [6]. The application most relevant to our work is presented in [5], where the authors use a linear-chain CRF (LCCRF) to detect informative comments in a blog conversation through the exploration of conversational and topical features. They apply the LCCRF model to each path (i.e., from the root to the leave) of the conversation’s tree structure. However, this approach has two main limitations: (1) it ignores many hierarchical dependencies between the output tags/labels and (2) the common internal nodes fall in multiple paths, which cause them to be classi ed multiple time, and possible inconsistent classi cations have to be combined heuristically. Our use of Tree-structured CRFs (TCRFs) is inspired by [15] and [3]. [15] applies a TCRF model to better handle the dependencies across the hierarchically laid-out information on the web in the task of semantic annotation and show that TCRF outperforms Support Vector Machines (SVMs) and LCCRFs. According to the work done by [3], in the task of semantic role labelling, TCRFs outperform LCCRFs because they better capture the tree-structured semantic and syntactic dependencies between a sentence constituents. Other forms of CRFs such as Dynamic Conditional Random Fields (DCRFs) and Skip-Chain Conditional Random Fields (SCCRFs) are also introduced in this chapter. We only used TCRFs and LCCRFs models in our work. It would be interesting to see if other forms of CRFs can be applied to our project as future work. 52.1 Conditional Random Fields Conditional random  elds (CRFs) is a framework introduced by [7] for build- ing probabilistic model on sequence data to perform tasks such as segmen- tation sand labelling. CRFs o er serval advantages over other machine learning models like hidden Markov models (HMMs) and maximum entropy Markov models (MEMMs) for these tasks, due to their ability to relax strong independence assumption and avoiding the label bias problem. Two inde- pendence assumptions are made by HMMs model:  rst, it assumes that each state depends only on its immediate predecessor, that is, each state yt is independent of all its ancestors y1, y2,    , yt 2 given its previous state yt 1. Second, an HMM assumes that each observation variable xt depends only on the current state yt. Label bias problem exists in MEMMs model: the transitions leaving a given state compete only against each other, rather than against all other transitions in the model. This causes a bias toward states with fewer outgoing transitions. In the worst case, a state with a single outgoing transition e ectively ignores the observation. 2.1.1 Linear Chain Conditional Random Fields Linear Chain Conditional Random Fields (LCCRFs) is a an special case of CRFs. It de nes conditional probability distributions of label sequence given input sequence. These sequences have chain like format. Linear-chain CRFs have been applied to many natural language process- ing tasks such as named-entity recognition (NER) [10], feature induction for NER [9], Chinese word segmentation Peng et al [12]. In most recent work, Matthieu and Isabelle [4] created a CRF-based Multiword Segmenter and Part-of-Speech Tagger. The work most related to our project is done by FitzGerald et al [5]. They applied LCCRFs on the task of blog comments quality prediction. They also compiled a new corpus comprised of articles and their subsequent user comments from the science and technology news aggregation website Slashdot. Slashdot comments are displayed in a threaded conversation-tree type layout. Users can directly reply to a given comment, and their reply 6Figure 2.1: The transformation from blog conversation tree to threads will be placed underneath that comment in a nested structure. However, LCCRFs can only deal with linear sequence data, this required the authors to transform the tree-like structure of comment conversations into sequences. To solve this problem, each conversation tree is transformed into multiple Threads, one for each leaf-comment in the tree. The blog’s conversation tree has to be transformed into multiple linear sequences before the authors apply LCCRFs to perform both learning and inference . An example is shown by Figure 2.1. On the left side, we have a sample conversation initiated by the post 1, which received one comment 2, and it get two comments 3 and 4. Comment 3 receives two comments 5 and 6. These comments form a tree structure. On the right side, we can see the corresponding three linear threads decomposed from this tree, they are: (a) comment 1,2,3,5; (b) comment 1,2,3,6 and (c) comment 1,2,4. By applying sequence tagging using LCCRFs , they were able to achieve much higher accuracy than the majority class baseline of assigning GOOD to all comments, which yielded a precision of 29.7%. The results are shown in table 2.1. There is one problematic issue with this approach. We can see that one given conversation trees is decomposed into multiple threads in order to cast the problem in the form of sequence labelling. The result of this is that 7Not GOOD GOOD NOT GOOD 4160 467 GOOD 862 1090 Precision 0.700 Recall 0.558 F-Score 0.621 Table 2.1: Blog Comments Classi cation Result by FitzGerald et al [5] after classi cation, each non-leaf thread has been classi ed multiple times, equal to the number of sub-comments of that comment. These di erent classi cations may not be the same. For example, as shown in Figure 2.2, a given comment might well have been classi ed as GOOD in one sequence and NOT-GOOD in another. Notice that, in Figure 2.2, blue nodes are the comments classi ed as GOOD comments, while red nodes are classi ed as NOT-GOOD comments. We can see that node 2 is classi ed as GOOD in the  rst and last thread, while classi ed as NOT-GOOD in the middle one. The author choose to mark a comment as GOOD in the  nal tree (right side of the arrow) if it was classi ed as GOOD at least in one of the thread. The limitation of this method is that it chooses the classi cation result of a non-leaf node in an heuristic fashion. More importantly, it ignores the dependency between two nodes with a common parent node in the learning phrase. 2.1.2 Bayesian Conditional Random Fields Bayesian Conditional Random Fields (BCRFs) are proposed by [16]. BCRF is a Bayesian approach to training and inference for conditional random  elds. The motivation of applying the Bayesian framework is to address over- tting, model selection and many other aspects of the problem. Unlike Maximum likelihood (ML), Maximum a posteriori (MAP) or large-margin approaches, BCRFs are trained by by estimating the posterior distribution of the model parameters. Subsequently, the posterior distribution is averaged for BCRF inference. 8Figure 2.2: Remerging Threads In training, BCRFs approximate the posterior distribution of the pa- rameters using a variant of the power EP method. Also, BCRFs  atten ap- proximation structures to increase the algorithmic stability, e ciency, and prediction accuracy. In testing, BCRFs use approximate model averaging. On synthetic data and FAQ  les, they compared BCRFs with ML and MAP- trained CRFs. In almost all the experiments, BCRFs outperformed ML and MAP-trained CRFs signi cantly. 2.1.3 Dynamic Conditional Random Fields Dynamic conditional random  elds (DCRFs) is a generalization of linear- chain conditional random  elds (CRFs) introduced by [2]. In DCRFs, each time slice contains a set of state variables and edges, which is a distributed state representation as in dynamic Bayesian networks (DBNs)and param- eters are tied across slices. They performed approximate inference using several schedules for belief propagation, including tree-based reparameteri- zation (TRP). The test on a natural-language chunking task showed that a DCRF performs better than a series of linear-chain CRFs, achieving com- parable performance using only half the training data. As described in [2], a dynamic CRF (DCRF) is a conditional distribution 9Figure 2.3: Examples of DCRFs that factorizes according to an undirected graphical model whose structure and parameters are repeated over a sequence. As with a DBN, a DCRF can be speci ed by a template that gives the graphical structure, features, and weights for two time slices, which can then be unrolled given an input X. The same set of features and weights is used at each sequence position, so that the parameters are tied across the network. Several example templates are given in Figure 2.3. The dashed lines indicate the boundary between time steps. The input variables X are not shown. The unrolling process is de ned as the following: let Y = f y1    yT g be a sequence of random vectors yi = (yil    yI0m), where yi is the state vector at time i, and yij is the value of variable j at time i. To give the likelihood equation for arbitrary DCRFs, a way to describe a clique in the unrolled graph independent of its position in the sequence is introduced. Given a time t, any variable yij in Y is denoted by two integers: its index j in the state vector yi, and its time o set  t = i-t. A set c = f( t,j)g of such pairs is called a clique index, which denotes a set of variables yt;c by yt;c = fyt+ ;j j(( t,j) 2 c g. That is, yt;c is the set of variables in the unrolled version of clique index c at time t. The formal de nition of DCRFs is the following: 10p(yjx) = 1 Z(x) Y t Y c2C exp  X k  kfk(yt;c; x; t) ! (2.1) where C is a set of clique indices, F = ffk(yt;c; x; tg is a set of fea- ture functions and  = f kg is a set of real-valued weights. Then the Then the distribution p is a dynamic conditional random  eld. Z(x) = P y Q t Q c2C exp ( P k  kfk(yt;c; x; t)) is the normalizing factor. In summary, dynamic CRFs are conditionally-trained undirected se- quence models with repeated graphical structure and tied parameters. They combine the best of both conditional random  elds and the widely success- ful dynamic Bayesian networks (DBNs). DCRFs address di culties both of DBNs, by easily incorporating arbitrary overlapping input features, and of previous conditional models, by allowing more complex dependencies be- tween labels. 2.1.4 Skip-Chain Conditional Random Fields The Skip Chain CRF is essentially a linear-chain CRF with additional long- distance edges between similar words. These additional edges are called skip edges. The features on skip edges can incorporate information from the context of both endpoints of a sequence, so that strong evidence at one endpoint can in uence the label at the other endpoint. This model was introduced in the work done by [6] and [13]. In the work done by [13], a Skip-Chain CRF is applied to an information extraction problem: building a database automatically from unstructured text. Skip-Chain CRF is used to model certain kinds of long-range depen- dencies between entities. In this case, one important type of dependency within information extraction occurs on repeated mentions of the same  eld. For example, if the same entity is mentioned more than once in a document, such as Steve Jobs, in many cases all mentions have the same label, such as Conference-Speaker. We can take advantage of this fact by favouring label- ings that treat repeated words identically, and by combining features from all occurrences so that the extraction decision can be made based on global 11Figure 2.4: Skip Chain CRFs: Identical words are connected because they are likely to have the same label. information. Furthermore, identifying all mentions of an entity can be useful in itself, because each mention might contain di erent useful information. To perform collective labelling, dependencies between distant terms in the input should be represented. Sequence models make a Markov assump- tion among labels, that is, that any label yt is independent of all previous labels given its immediate predecessors yt k    yt 1. This represents de- pendence only between nearby nodes. For example, between bigrams and trigramsand cannot represent the higher-order dependencies that arise when identical words occur throughout a document. Skip-Chain CRF is used in [13] to relax this assumption. Skip-Chain CRF is a conditional model that collectively segments a document into men- tions and classi es the mentions by entity type, while taking into account probabilistic dependencies between distant mentions. These dependencies are represented in a skip-chain model by augmenting a linear-chain CRF with factors that depend on the labels of distant but similar words. This is shown graphically in Figure 2.4 (Adopted from [13]). According to [13], the skip-chain CRF is de ned as a general CRF with two clique templates: one for the linear-chain portion, and one for the skip edges. For an sentence x, let I = f(u,v)g be the set of all pairs of sequence positions for which there are skip edges. Then the probability of a label sequence y given an input x is modelled as 12p (yjx) = 1 Z(x) TY t=1  t(yt; yt 1; x) Y (u;v)2I  uv(yu; yv; x) (2.2) where  t are the factors for linear-chain edges, and  uv the factors over skip edges. These factors are de ned as  t(yt; yt 1; x) = exp ( X k  1kf1k(yt; yt 1; x; t) ) (2.3)  uv(yu; yv; x) = exp ( X k  2kf2k(yu; yv; x; u; vt) ) (2.4) where  1 = f 1kg K1 k=1 are the parameters of the linear-chain template, and  2 = f 2kg K2 k=1 are the parameters of the skip template. The full set of model parameters are  = f 1;  2g. Both the linear-chain features and skip-chain features are factorized into indicator functions of the outputs and observation functions. The observation functions for the skip edges are chosen to combine the observations from each endpoint. Formally, we de ne the feature functions for the skip edges to factorize as: f 0 k(yu; yv; x; u; v) = 1fyu=eyug1fyv=eyvgq 0 k(x; u; v) (2.5) This choice allows the observation functions q 0 k(x; u; v) to combine in- formation from the neighbourhood of yu and yv. For example, one useful feature is q 0 k(x; u; v) if and only if xu = xv = \Ford" and xv 1 = \Speaker:". This can be a useful feature if the context around xu, such as \Tom Ford is manager of control engineering. . . ," may not make clear whether or not Robert Booth is presenting a talk, but the context around xv is clear, such as \Speaker: Tom Ford." The experiments conducted in [13] showed that a Skip-Chain CRF model brings signi cant improvement over a LCCRF model in the task of identi- fying speakers, locations and capitalized words that occur multiple times in a seminar announcement. 132.1.5 Tree Structured Conditional Random Fields The key advantage of a tree Structured Conditional Random Field (TCRF) model is that it can better incorporates dependencies. It has been used in many tasks such as semantic annotation [15], semantic role labelling [3], and image labelling [1]. In the work done by Jie at al [15], they applied TCRFs to the task of annotating large volume of web pages. More speci cally, they tried to label the semantic meaning of the instances of webpages based on a given ontol- ogy. Some examples of the labels are: Email Address, Registered Address, Phone Number and so on. They compared the proposed TCRFs method to LCCRFs and SVM (Support Vector Machine). The test results show that the proposed TCRF method signi cantly outperforms both the SVMs based method and the linear-chain CRFs based method. For the semantic role labelling task, Trevor Cohn and Philip Blunsom [3] de ned a random  eld over the structure of each sentences syntactic parse tree. For each node of the tree, the model must predict a semantic role label, which is interpreted as the labelling for the corresponding syntactic constituent. They showed how modelling the task as a tree labelling problem allows for the use of e cient CRF inference algorithms, while also increas- ing generalization performance when compared to the equivalent maximum entropy classi er. TCRFs has been also successfully applied to several non-NLP tasks. For example, in the work done in [1], the authors presented a discriminative framework based on conditional random  elds for stochastic modelling of images in a hierarchical fashion. The main advantage of the proposed frame- work is its ability to incorporate a rich set of interactions among the image sites. They achieved this by inducing a hierarchy of hidden variables over the given label  eld. The proposed tree like structure of their model elimi- nates the need for a huge parameter space and at the same time permits the use of exact and e cient inference procedures based on belief propagation. The authors demonstrated the generality of this approach by applying it to two important computer vision tasks, namely image labelling and object de- 14tection. The model parameters are trained using the contrastive divergence algorithm. The results show that TCRF achieved the best classi cation accuracy in both objects detecting and image labelling tasks. More details on TCRFs will be introduced in Chapter 4. 2.1.6 Semi-supervised CRFs One practical di culty in applying CRFs is that training requires obtaining true labels for potentially many sequences. This can be expensive because it is more time consuming for a human labeller to provide labels for sequence labelling than for simple classi cation. For this reason, it would be very useful to have techniques that can obtain good accuracy given only a small amount of labeled data. Semi-supervised CRF is used for achieving this goal. In the work done by [8], the authors present a new semi-supervised training procedure for conditional random  elds that can be used to train sequence segmentors and labellers from a combination of labeled and unlabelled training data. Their approach is based on extending the minimum entropy regularization framework to the structured prediction case, yielding a training objective that combines unlabelled conditional entropy with labeled conditional likeli- hood. This semi-supervised approach is applied to the problem of identifying gene and protein mentions in biological texts. An example on this approach is described in [8]: assume we have a set of labeled examples, Dl =  (x(1); y(1));    ; (x(N); y(N))  , and unlabelled example, Du =  x(N+1);    ; x(M)  . A CRF model is built over sequential input and output data x and y like the following: p (yjx) = 1 Z (x) exp  KX k=1  kfk(x; y) ! = 1 Z (x) exp (h ; f(x; y)i) (2.6) where  = ( 1;    ;  K)T (T stands for Transpose) The goal is to learn a model from the combined set of labelled and unla- belled examples, Dl[Du. The standard supervised CRF training procedure 15is based upon maximizing the log conditional likelihood of the labeled ex- amples in Dl CL( ) = NX i=1 logp (y (i)jx(i)) U( ) (2.7) where U( ) is any standard regularizer on  , e.g. U( ) = k k2 =2. Regu- larization can be used to limit over- tting on rare features and avoid degen- eracy in the case of correlated features. Equation 2.7 ignores the unlabelled examples in Du. To make full use of the available training data, a semi-supervised learn- ing algorithm is introduced to exploit a form of entropy regularization on the unlabelled data. Speci cally, for a semi-supervised CRF, the following objective is maximized: RL( ) = NX i=1 logp (y (i)jx(i)) U( )+ MX i=N+1 X y p (yjx (i))logp (yjx (i)) (2.8) where the  rst term is the penalized log conditional likelihood of the labeled data under the CRF de ned in equation 2.7, and the second line is the negative conditional entropy of the CRF on the unlabelled data. In the equation 2.8,  is a tradeo parameter that controls the in uence of the unlabelled data. The experiments’ results in [8]’s work show that their semi-supervised CRF approach shares all of the bene ts of the standard CRF training, in- cluding the ability to exploit arbitrary features of the inputs, while obtaining improved accuracy through the use of unlabelled data. The main drawback of Semi-CRF approach is that training time is increased compared to stan- dard CRF model. Nevertheless, the algorithm is su ciently e cient to be trained on unlabelled data sets that yield a notable improvement in classi-  cation accuracy over standard supervised training. 162.2 Related Datasets Datasets of actual blog conversations are needed for training (in supervised approaches) and testing. To be useful for this research, the blog comments we are using have to be rated, in the sense that good quality comments have been awarded a good rating by human users. The more data is available the better the algorithms can perform and the results of the evaluations become more accurate. The following datasets have been used in related work done by [5]. It is often very di cult to get access to real blog comments conversations as the selection of the blog comments conversations need to be very careful. For example, the comments of a blog site need to have ratings and should be possible to crawl them so that we have a su cient enough amount of data to perform training and testing on it. For these reasons there are no publicly available corpora for our blog comment research. The only resource we had to start with is the dataset complied by FitzGerald et al [5], which is Slashdot corpus. 2.2.1 The Slashdot Dataset In [5], the authors compiled a new corpus comprised of articles and their subsequent user comments from the science and technology news aggrega- tion website Slashdot. This site was chosen for several reasons. Comments on Slashdot are moderated by users of the site, meaning that each com- ment has a scores from -1 to +5 indicating the total score of moderations assigned, with each moderator able to modify the score of a given comment by +1 or -1. The authors of [5] took this score to be a gold-standard metric for comment quality. These user moderation classes provide ready-made classi cations which can be used as training and testing data for supervised approaches to identifying high-quality comments. Since the default score for an unmoderated comment is either +1 or 0, depending on whether the user is registered or unregistered respectively, they treat comments with scores of 0 or +1 as having no moderation class (class NOT-GOOD). Furthermore, to be able to identify high-quality comments, most of their experiments were 17conducted with comments grouped into two categories: GOOD (all com- ments which fall into the four "Good Comment" classes) and NOT-GOOD (the rest). Slashdot comments are displayed in a threaded conversation-tree type layout. Users can directly reply to a given comment, and their reply will be placed underneath that comment in a nested structure. This is in contrast to other commenting schemes which are merely linear temporal sequences, and conversational links between individual comments must be inferred from context or by the user indicating the comment or user to which they are replying (ie. @23 ...). The size of the total collection in Slashdot corpus is 425,853 comments on 4320 articles. 18Chapter 3 Our Dataset Datasets are fundamental for the evaluation, as well as for the training of statistical machine learning methods. Research done in blog comments clas- si cation has su ered from a lack of publicly available rated blog comments corpora. In this chapter we describe our solution to this problem. More speci cally, we developed a new blog comments corpus that was used for classi cation. It consists of the Slashdot corpus developed by [5] along with  ve other blog websites namely DailyKos, AndroidCentral, BusinessInsider, Macrumors and TSN. 3.1 Background The corpus construction work presented in this project expands on the basis of previous work done by [5]. Following the same requirements applied to select and compile the Slashdot corpus, we chose  ve additional blog sites to build our datasets. Namely, DailyKos (www.dailykos.com), AndroidCen- tral (www.androidcentral.com), BusinessInsider (www.businessinsider.com) Macrumors (www.macrumors.c-om) and TSN (www.tsn.ca). These blog sites cover a variety of genra such as technology, business. politics and sports. The motivation behind this is that we want to see the performance of our TCRF model under blogs with di erent contents, size, conversation structure and di erent rating system. One common feature of these blog sites is the comments of each blog is rated by their users. In the following sections, we provide a detailed description of each blog site. 19Figure 3.1: Slashdot 3.2 Slashdot Slashdot (Figure 3.1) is a technology-related news website, the summaries of stories and links to news articles are submitted by Slashdot’s own readers, and each story becomes the topic of a threaded discussion among users. The Slashdot corpus was collected by [5]. This site was chosen for sev- eral reasons. The most important one is that this site has an ideal rating system: 1. Comments already had scores assigned by the blog users. 2. Comments on Slashdot are moderated by users of the site, meaning that each comment has a scores from -1 to +5 indicating the total score of mod- erations assigned, with each moderator able to modify the score of a given comment by +1 or -1. These user moderation scores provide ready-made classi cations which can be used as training and testing data for supervised approaches to identifying high-quality comments. We treat comments with scores of 0 or -1 as having no moderation class (class NOT-GOOD). Others are treated as GOOD comments. The collection totalled 425,853 comments on 4320 articles. 3.3 DailyKos DailyKos (Figure 3.2) is an American political blog that publishes news and opinions about political events. It has a comment conversation structure similar to the one in Slashdot. However, Its comment rating system is di erent: Each comment can be assigned points of either -1 or +1 by the users. There is no limit of the score a comment can get. These scores can also be used in supervised learning to classify a comment as GOOD or 20Figure 3.2: DailyKos NOT-GOOD. The size of our collection is 1000 posts and 60,713 comments. 3.4 AndroidCentral AndroidCentral (Figure 3.3) is a website providing news and reports about product based on the Android system. The motivations of collecting its comments data are: (a) The comments of AndroidCentral are structured as linear sequences. This can be used to test our model’s performance in dif- ferent conversation structures (both tree and linear). (b) The rating system is di erent than other blogs: each comments is rated by users by assigning stars to it. The range of score is from 0 to 3 stars. This can be used to test our model’s performance with respect to di erent rating methods. The size of data collected from AndroidCentral is 1000 posts and 8,277 comments. Figure 3.3: AndroidCentral 21Figure 3.4: BusinessInsider 3.5 BusinessInsider BusinessInsider (Figure 3.4) is a U.S. business/entertainment news website. The site provides and analyzes business news and acts as an aggregator of top news stories from around the web. Its business content along with rated comments ( in the form of Thumbs up(+1) and Thumbs down (-1)) provides us with an additional, di erent scenario to train and test our model. The size of the data we get from BusinessInsider is 1000 Post with 10,856 Comments. 3.6 Macrumors Macrumors (Figure 3.5) is a website that aggregates Mac and Apple related news, rumours, and reports. Macrumors is also home to one of the largest Mac-focused forum sites, with over 400,000 members and over 10,000,000 forum posts as of May 2010. We collected 1000 posts and 14,120 comments from this site. 3.7 TSN TSN (The Sports Network, in Figure 3.6) is a website presenting highlights and score updates about sports such as NHL Hockey, NFL and NBA etc. The reason for getting comments from TSN is the same as for the other blogs in our corpus: rated comments. Furthermore, for diversity, it is nice 22Figure 3.5: Macrumors Figure 3.6: Macrumors to cover sports topic in our dataset. At last, the easiness of getting the comments is another reason: all the comments of a given post are included in a JSON (JavaScript Object Notation) object, which makes it very easy to process and store in a universal format. The size of the data collected is 1000 posts with 54,235 comments. 3.8 Data Processing Framework 3.8.1 Pre-processing A web-crawler was built to collect articles and comments from these blog sites. To handle the issue that di erent websites have their own way of storing and displaying the content of their comments, a conversion step was developed to store all these data in a common representation. More 23Figure 3.7: Sample table speci cally, we process the data crawled and store them into a relational database tables with the same scheme. The scheme of the table consists of comment’s id [cid], comment title, comment author, date, post id [pid], content of the comment [text], the web link of the comment [url], rating of the comment [score], parent comment of given comment [parent] and so on. An example of such database table can be seen in Figure 3.7. 3.8.2 Transformation The next step of our data processing framework is to transform the data stored in database into XML (Extensible Markup Language) like format. The motivation of this is to convert the data into the linear sequence and tree structures needed by the tool we use to train and test our LCCRFs and TCRFs models. A sample formatted  le is shown in Figure 3.8. 3.8.3 Feature Extraction The last step in processing the data in to extract the feature values needed for training and testing in our CRFs models. Since we are modifying the GRMM (GRaphical Models in Mallet) toolkit to perform our training and testing tasks, the training and testing  les feeder to GRMM should look like the format in Figure 3.9. 24Figure 3.8: XML format data Figure 3.9: Feature Format 3.8.4 Dataset Summary In Table 3.1, we present a summary of key properties and basic statistics of the data we have collected from these blog sites. These blog sites cover a variety of genres such as technology, business. politics and sports; they also have di erent conversation structure (sequential or tree) and rely on di er- ent rating methods. This kind of diversity provides us with a comprehensive research target. We will be able to test the generality of our approach on very di erent blogs; and verify how the performance of our proposal varies with respect to the key properties of these blogs. Notice that since our dataset covers several di erent domains, in future work, we will be able to test domain adaptation techniques. Furthermore, these real life data makes our research more applicable and interesting. 25Name Structure Rating Method Genre Articles Comments Slashdot Tree -1 to 5 Technology 4320 425853 DailyKos Tree positive - negative Politics 1000 60713 AndroidCentral Sequential 0 - 3 Stars Technology 1000 8277 BusinessInsider Tree positive - negative Business News 1000 10856 Macrumors Sequential positive - negative Technology 1000 14120 TSN Tree positive - negative Sports 1000 54235 Table 3.1: Dataset description 26Chapter 4 Our Approach: Tree structured Conditional Random Fields This chapter presents our approach to classifying blog comments using Tree- structured Conditional Random Fields. As described in Chapter 2 (Related Work), there are mainly two ways to apply CRFs to classi cation problem: LCCRFs and TCRFs. In this work, we focus on expanding the work done by [5]. Instead of applying LCCRFs as the authors of [5] did, we try to explore the results of blog comments classi cation using TCRFs to better handle the dependencies across the hierarchically structured comments in a blog conversation. An outline of the chapter is as follows. In Section 4.1 we describe the LCCRF model used in previous work done by [5]. Section 4.2 introduces the TCRFs model we proposed in our work. 27Y1 Y2 Y3 Yn-1 Yn X = X1 , … , Xn-1 , Xn  Figure 4.1: Linear Chain CRFs 4.1 Linear Chain Conditional Random Fields Conditional Random Fields introduced by [7] are discriminative probabilistic models which have gained much popularity in Natural Language Processing and Bio-informatics applications. Like Markov Random Fields, CRFs can be de ned on an arbitrary undirected graph. However, for sequence-labelling tasks, linear-chain models are often used. These Linear-Chain CRFs have been shown to provide superior performance on many of the same tasks tra- ditionally handled by Hidden Markov Models, their generative counterpart. A CRF is a random  eld globally conditioned on the observation X. Linear-Chain CRFs were  rst introduced by La erty et al [7]. The graphical structure of linear-chain CRFs is shown in Figure 4.1. Let X and Y be random vectors,  = f kg 2 Rk be a parameter vector, and n fk(y; y 0 ; Xt) o be a set of real-values feature functions. Then a linear- chain conditional random  eld is a distribution P (yjx) that takes the form: p(yjx) = 1 Z(X) TY t=1 exp ( LX k=1  kfk(yt; yt 1; Xt) ) (4.1) where Z(x) is an instance-speci c normalization function: 28Z(X) = X y TY t=1 exp ( LX k=1  kfk(yt; yt 1; Xt) ) (4.2) The parameter values,  k, can be estimated from training data by several methods. In [5], they applied Linear-Chain CRFs to the problem of detect- ing high-quality blog comments in sequences of comments. One bene t of using linear chain CRFs over more traditional linear classi cation algorithms is that the sequence of labels is considered. However, this approach has two main limitations: (1) it ignores many hierarchical dependencies between the output tags/labels and (2) the common internal nodes fall in multiple paths, which cause them to be classi ed multiple time, and possible incon- sistent classi cations have to be combined heuristically (see Section 2.1.1 for details). These limitation inspire the use of TCRFs model which will be introduced in the next section. 4.2 Tree Structured Conditional Random Fields A Tree-structured Conditional Random Field (TCRF) model is a particular variation of the CRFs framework. Since the structure of blog conversations is often a tree, TCRFs are applied to take advantage of the dependencies across hierarchically laid-out information in the blog conversation. The structure of the tree CRF will mirror the structure of the conversation as shown in Figure 4.3. On the left side, we have a sample conversation initiated by the post x0, which received two comments x1 and x2 etc. On the right side, you can see the corresponding TCRF, in which all the posts are represented as the observed variables xi and their labels as the corresponding hidden variables yi. The TCRF models the parent-child dependencies for all the comments with their children (e.g., y0-y1 and y0-y2). The conditional probability for the labels given the observation in a CRF is de ned as: p(yjx) = 1 Z(x) exp( X c2C X k  kfk(c; yc; x)) (4.3) where C is the set of cliques,  k are the feature weights and fk(c; yc; x) are 29Figure 4.2: Sample Conversation and corresponding TCRF the feature function for the clique. Z(x) is the normalization function. The probability distribution can be rewritten to separate the node cliques and from edge cliques (between di erent nodes) as: p(yjx) = 1 Z(x) exp( X u2C1 X k  kgk(u; yu; x))+exp( X v2C2 X k  khj(u; v; yu; yv; x)) (4.4) where C1 is the set of node cliques and C2 is the set of edge cliques. In the TCRF model, edge cliques are edges between parent and child nodes. We used the dependency structure in the blog conversations to model the TCRF. 4.3 GRMM Modi cation GRMM (GRaphical Models in Mallet) is a general machine learning software package for implementing probabilistic graphical models providing various  tting/learning algorithms like Limited Memory BFGS (LBFGS) and in- ference algorithms like Belief Propagation, Tree-based re-parameterization (TRP). We slightly modi ed the software to model arbitrary tree structures and suit our experimental needs. 30Figure 4.3: Parent-Child pairs from a given conversation tree In order to build Tree structured CRF in GRMM, we create our own FactorTemplate class that: (1) de ne how the tree structure should be cre- ated from our FeatureVectorSequence (2) de ne the parameter tying (all factors created by the same Template have tied parameters) Another change is that we need to supply a structure  le containing the parent-child pairs information of a given blog conversation trees to the modi ed Arbitrary CRF (ACRF) classes. An example of such process is shown in Figure 4.3. Where on the left side, we have a sample conversation initiated by the post x0, which received two comments x1 and x2 etc, just the same as we have in Figure 4.2. On the right side, you can see the corresponding parent-child pairs from the given conversation tree, in which all the child nodes in the conversation tree are mapped to their parent. For example, x1 is mapped to x0 and x3 is mapped x2. 314.4 Feature Selection Each comment in a given blog conversation tree was represented as a series of features. In our approach, we consider the same features which are used in previous work done in [5]: unigrams, lexical similarity, and conversational features. These are described below: Unigram Features Unigrams are simply the words present in a given comment. Each unigram is a binary feature, indicating either the presence or absence of the given word in the comment in question. The intuition behind using these features is that good comments might be more likely to include certain words, or that replies to good comments might be more likely to include certain words (such as words which indicate agreement). Similarity Features Three features were used which capture the overlap of words between two comments: TF-IDF, LSA and Lexical Cohesion. For each comment, each of these three scores was calculated for both the preceding and the following comment (0 if there was no comment before or after), giving a total of six similarity features. All of these metrics are weighting schemes applied to each word in the document. These weights then form a vector represent- ing the comment, indexed by terms in the corpus. A given term is zero if the word does not appear in the document, or equal to the weight for the word if it does appear. The distance between two comments is then the cosine-similarity between the two vectors. The intuition behind the use of these features is that good comments should be ones that tend to use similar language to those around it. This should indicate comments which are rele- vant and on-topic. In our work, the "document" referred to in each metric’s description is the set of all comments for the article in whose conversation the comment appears. TF-IDF (term frequency-inverse document frequency) is a weighting 32scheme commonly used in computational linguistics applications. The over- all TF-IDF weight is the product of the Term Frequency and Inverse Doc- ument Frequency scores. The Term Frequency is simply the number of occurrences of the term in the document, divided by the total number of terms in the document: tfi;j = ni;j P k2j nk;j (4.5) The Inverse Document Frequency (IDF) is the inverse frequency of the ap- pearance of the term in all documents in the corpus: idfi = jDj jfd : ti 2 dgj (4.6) The intuition behind this metric is that the most important words to determine similarity between two documents are those which are frequent in the documents, but infrequent in the language as a whole. This means that very common words which are likely to appear in all comments (like and and the) will not contribute much to the similarity score. LSA (Latent Semantic Analysis) is a matrix-based approach to document similarity. First we form a matrix X where (i, j) is the occurrence of term i in document j. Now we decompose X as X = U V T , where U and V are orthogonal matrices and  is a diagonal matrix. The columns of V T are now d 0 j , and the similarity between two documents can be calculated as the cosine-distance between d 0 i and d 0 j . LCSeg (Lexical Chain Segmenter) works by  rst forming chains of word repetitions. These chains are then ranked on two criteria: the length of the chain (number of repetitions of the term) and the compactness of the chain (number of words the chain spans in the sequence). The chain-weight for each term then form the document vectors, which again are compared with cosine-similarity. 33Conversational Features The conversational features capture information about where the comment is situated in the conversation as a whole. The list is as follows:  NumReplies The number of replies to this comment. This includes all comments which are children of this comment in the conversation tree.  WordLength The length of this comment in words.  SentenceLength The length of this comment in sentences.  AvgReplyWordLength The average length of replies to this comment in words length.  AvgReplySentLength The average length of replies to this comment in sentences length.  TotalReplyWordLenth The total length of all replies to this comment in words length.  TotalReplySentLength The total length of all replies to this comment in sentences length. In addition, we also modify and apply some of the features introduced in the work done by [11] to our experiments. They are listed as following: idfi = jDj jfd : ti 2 dgj (4.7)  TPOS1 The time from the post time from original post to the time of this comment.  TPOS2 The time from the post time of this comment to the time of last comment in this conversation tree. 34 PPAU The time between the post time of this comment and its parent’s time.  BEGAUTH The comment is the  rst comment to the given article or not.  DOM A measure to see how dominant the current participant is in terms of words in the conversation. 35Chapter 5 Experiments and Results 5.1 Experiments 5.1.1 Experiments Setup Dataset Our dataset is described in details in Chapter 3. The size of the entire corpus is 8,400 blog conversations with 304,500 comments. Software Package We used GRMM (Graphical Models in Mallet) toolkit [14] to implement our TCRF model. GRMM is a general machine learning software package for im- plementing probabilistic graphical models providing various  tting/learning algorithms like Limited Memory BFGS (LBFGS) and inference algorithms like Belief Propagation, Tree-based re-parameterization (TRP). We slightly modi ed the software to model arbitrary tree structures and suit our exper- imental needs. Classi ers In our experiment, we used three classi ers for blog comments classi cation. The  rst one is a majority-class baseline classi er that assign to each test- ing comment the most frequent class in the training test. For example, if the training corpus contains 60% GOOD comments and 40% NON-GOOD comments, the classi er will always classify a testing comment as GOOD comment. The second is a CRF classi er using linear chain as its graphical 36model [5]. The third is our new approach, a CRF classi er using dependency tree as its graphical model. 5.2 Training and Testing For the training and testing, the popular Natural Language Machine Learn- ing toolkit MALLET7 and its package GRMM8 are used. MALLET is used for the LCCRF model, while GRMM’s ACRF class is modi ed to create the TCRF model. 10-fold cross-validation is used for our training and testing of each blog’s corpus under both LCCRFs and TCRFs model. In 10-fold cross-validation, the original sample is randomly partitioned into 10 subsamples. Of the 10 subsamples, a single subsample is retained as the validation data for testing the model, and the remaining 9 subsamples are used as training data. The cross-validation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub- sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. To compare the performance di erence under LCCRF model and TCRF model, the blogs in our corpus with tree structured conversation layout (Slashdot, DailyKos, BusinessInsider and TSN) are trained and tested under both models. 5.3 Results 5.3.1 Classi cation Our goal in this experiments is to be able to detect comments which had been identi ed as GOOD by blog users. This suggests that the problem 7http://mallet.cs.umass.edu/index.php 8http://mallet.cs.umass.edu/grmm/index.php 37should be formulated as a binary classi cation problem, where we need to classify good comments. Our experiment is to train the LCCRF and TCRF using data where the full set of moderation labels had been grouped into GOOD comments and NON-GOOD. These two Conditional Random Field classi ers were trained on the full set of features presented in Section 5.1.3. The result of our ex- periment is presented in Table 5.1. LCCRF Precision Recall F-Score BusinessInsider 62.78% 62.72% 62.75% DailyKos 69.40% 63.60% 66.40% Slashdot 70.45% 69.42% 69.93% TSN 64.31% 59.22% 61.66 % TCRF Precision Recall F-Score BusinessInsider 65.86% 61.58% 63.65% DailyKos 75.20% 63.82% 72.30% Slashdot 73.48% 68.93% 71.13% TSN 72.33% 66.31% 68.81% Baseline Precision Recall F-Score BusinessInsider 54.22% 53.47% 53.84% DailyKos 59.23% 56.31% 57.73% Slashdot 60.23% 63.12% 61.64% TSN 56.75% 58.12% 57.43% Table 5.1: Comparison of LCCRF, TCRF and Baseline From Table 5.1 we can see that TCRF performs better than LCCRF for all four tree-structured blog corpus in terms of F-Score. The improve- ment is especially noticeable for DailyKos and TSN. However, for Slashdot and BusinessInsider, the improvement is very limited because of a slight decrease in Recall performance. The analysis of this result is presented in next section. 385.3.2 Result Analysis Statistical Analysis For the statistical analysis, we performed the Paired t-Test to see if the improvement we obtained using TCRF compared to LCCRF model was statistically signi cant. We calculated the results for each conversation tree in the blog us- ing TCRF and LCCRF. For example, the Dailykos has 1000 conversation trees (articles). We calculated the F-Score for each conversation tree us- ing both TCRF and LCCRF so we got 1000 scores under each model. Then we performed the Paired t-Test with the following hypotheses: (1) H0: Meantcrf <= Meanlccrf (2)Ha: Meantcrf > Meanlccrf . The chosen signi cance level is 5% for all tests. In our tests for all four blogs, TCRF model showed statistically signi - cant improvement over LCCRF approach. For the comparisons, one-tailed t-tests were always performed and the null hypothesis was rejected with 95% con dence. Tree Complexity vs Performance In this section we are focusing on analyzing the correlation between the performance of our TCRF model and the complexity of the tree structure of each blog corpus. The complexity of a tree structure is decided by its Branching Factor and Height. The Branching Factor of a tree is the number of children at each node in a tree. In our case, since the number of children at each node is not uniform for a given blog conversation tree, an average branching factor is calculated for each tree. The Height of a tree is the height of this tree’s root node. The height of the root node is the length of the longest path to a leaf from that node. In other words, the Height of a tree is the "number of levels" in this tree. In order to represent The Branching Factor and The Height of a tree 39using one value so we can see the correlation between the complexity of tree and the performance of our TCRF model more easily, we de ne the complexity of a tree are computed as following: Complexity(Tree) = BranchingF actor(Tree)  Height(Tree) (5.1) The tree complexity information of all four blogs with tree-structured conversation is shown in Table 5.2. Blogs Branching Factor Height Complexity BusinessInsider 3.67 1.86 6.83 DailyKos 4.33 2.79 12.08 Slashdot 6.47 2.47 15.98 TSN 3.92 2.92 11.45 Table 5.2: Tree Complexity of Blogs From Table 5.2, we can observe that Slashdot has the highest complexity score at 15.98 while BusinessInsider’s 6.83 is the lowest. DailyKos and TSN are very close in terms of complexity score. Blogs Complexity Performance Improvement BusinessInsider 6.83 1.43% DailyKos 12.08 8.89% Slashdot 15.98 1.72% TSN 11.45 11.60% Table 5.3: Tree Complexity vs Performance Improvement in percentage Table 5.3 and Figure 5.1 together show the relationship between blogs’ tree complexity and the improvement they got using TCRFs over LCCRFs. The observation is that if the tree complexity of a given blog is low (Busi- nessInsider), the improvement we got using the TCRF model comparing to LCCRF is limited compared to the blogs with higher tree complexity (Dai- lyKos and TSN). However, if the tree complexity is over some limit (Slash- dot), then the performance of TCRF is decreased due to over-complex tree 40Figure 5.1: Tree Complexity vs Performance Improvement in percentage structure of the blog conversation. 41Chapter 6 Conclusions and Future Work 6.1 Conclusions In this thesis we described how to classify blog comments using TCRF model. Our classi cation approach is an extension of the previous work done by [5], which applied LCCRF approach to training the classi er. Since we used machine learning, we also required a dataset. To our knowledge there were no publicly available datasets for blog comments. We therefore collected blog comments conversation from a variety range of di erent blogs. We compared Tree-structured Conditional Random Fields and Linear-chain Conditional Random Fields approaches and found TCRF model has statis- tically signi cant improvement over LCCRF approach. As a summary we outline the main achievements of this thesis here:  We compiled a new corpus comprised of the Slashdot dataset collected by FitzGerald et al [5], and conversations from  ve other blog web- sites namely, DailyKos, AndroidCentral, BusinessInsider, Macrumors and TSN. These blog sites cover a variety of genres such as technol- ogy, business, politics and sports; they also have di erent conversation structure (sequential or tree). For example, Slashdot conversations have a tree-like structure; users can directly reply to a given comment, and their reply will be placed underneath that comment in a nested structure. On the contrary, comments in conversations from Android- Central always form a single linear thread. We were able to test the generality of our approach on very di erent blogs; and verify how the 42performance of our proposal varies with respect to the key properties of these blogs. Notice that since our dataset covers several di erent domains, we will be able to test domain adaptation techniques in the future work.  In the previous work done in the  eld of comments classi cation using CRFs, the authors use a linear-chain CRFs (LCCRFs) to detect the informative blog comments through the exploration of conversational and topical features. However, their approach has two main limi- tations: (1) it ignores hierarchical dependencies between the output tags/labels and (2) the common internal nodes fall in multiple paths, which cause them to be classi ed multiple time, and possible incon- sistent classi cations have to be combined heuristically. Our proposed approach applies a TCRF model to better handle the dependencies across the hierarchically structured comments in a blog conversation.  For building Tree structured CRFs in GRMM, we created our own FactorTemplate class and supplied a a structure  le containing the parent-child pairs information of a given blog conversation trees to the modi ed Arbitrary CRF classes.  We compared our proposed TCRFs model with the LCCRFs approach previously applied in [5] and we found that our TCRFs approach works better than LCCRFs model. TCRFs outperformed LCCRFs due to the inherent tree structure which captures the semantic and syntactic dependencies better.  In this thesis we also wanted to see if there was any correlation be- tween the performance improvement of our TCRF model and the tree’s complexity of a given blog. We observed that higher complexity could bring performance gain for the TCRF model. However, over- complicated tree structure could hurt the performance of our model. 436.2 Future Work Classi cation of blog comments is a relatively new  eld. There are many areas that have yet to be explored. Research can also be done on similar but newer forms of communication such as Twitter or Facebook discussions. The following are some topics that would be interesting to explore:  Extending this research into other forms of communication would be interesting. Although online chats like Twitter or Facebook chats are less structured than blog conversations, it would be interesting to see if the same techniques can be applied in this  eld.  It would be preferable to have  ner-grained classi cation than just GOOD and NON-GOOD. For this task, some of the  ner grained classes provided in the Slashdot corpus would be a good starting point. Models other than CRFs may be needed since LCCRFs didn’t perform well on this task in the work done by [5].  Since our dataset covers several di erent domains, it would be inter- esting to test domain adaptation techniques using our CRFs models.  Exploring more ways to re ne our ability to classify comments. For example, some blogs have the author information available for each comment. We can take advantage of this kind of information in our model to see if there is better outcome.  Incorporating comments classi cation into an abstractive summariza- tion system. It would be useful to have such a summarization system to summarize blog comments for users. 44Bibliography [1] Pranjal Awasthi, Aakanksha Gagrani, and Balaraman Ravindran. Im- age modelling using tree structured conditional random  elds. In Pro- ceedings of the 20th International Joint Conference on Arti cial Intel- ligence (IJCAI 2007), pages 2060{2065, 2007. [2] Andrew McCallum Charles Sutton and Khashayar Rohanimanesh. Dy- namic conditional random  elds: Factorized probabilistic models for labelling and segmenting sequence data. Journal of Machine Learning Research, 2007. [3] Trevor Cohn and Philip Blunsom. Semantic role labelling with tree conditional random  elds. In Proceedings of the Ninth Conference on Computational Natural Language Learning, CONLL ’05, pages 169{172, Stroudsburg, PA, USA, 2005. Association for Computational Linguis- tics. [4] Matthieu Constant and Isabelle Tellier. Evaluating the impact of ex- ternal lexical resources into a crf-based multiword segmenter and part- of-speech tagger. In International Conference on Language Resources and Evaluation (LREC’12), 2012. [5] Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Sha q R. Joty. Exploiting conversational features to detect high-quality blog comments. Canadian Conference on AI, pages 122{127, 2011. [6] Michel Galley. A skip-chain conditional random  eld for ranking meet- ing utterances by importance. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 364{372, 2006. 45[7] A. McCallum J. La erty and F. Pereira. Conditional random  elds: probabilistic models for segmenting and labelling sequence data. Inter- national Conference on Machine Learning, 2001. [8] Feng Jiao, Shaojun Wang, Chi-Hoon Lee, Russell Greiner, and Dale Schuurmans. Semi-supervised conditional random  elds for improved sequence segmentation and labeling. In Proceedings of the 21st In- ternational Conference on Computational Linguistics and the 44th an- nual meeting of the Association for Computational Linguistics, ACL-44, 2006. [9] Andrew McCallum. E ciently inducing features of conditional random  elds. Conference on Uncertainty in Arti cial Intelligence, 2003. [10] Andrew McCallum and Wei Li. Early results for named entity recog- nition with conditional random  elds, feature induction and web- enhanced lexicons. Seventh Conference on Natural Language Learning (CoNLL), 2003. [11] Gabriel Murray and Giuseppe Carenini. Summarizing spoken and writ- ten conversations. In In Proc. of EMNLP 2008, 2008. [12] Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese seg- mentation and new word detection using conditional random  elds. In COLING, 2004. [13] C. Sutton and A. McCallum. An introduction to conditional random  elds for relational learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press, 2007. [14] Charles Sutton. Grmm: Graphical models in mallet, 2006. [15] Jie Tang, Mingcai Hong, Juanzi Li, and Bangyong Liang. Tree- structured conditional random  elds for semantic annotation. In In Pro- ceedings of 5th International Conference of Semantic Web (ISWC2006, pages 640{653, 2006. 46[16] Martin Szummer Yuan Qi and Thomas P. Minka. Bayesian conditional random  elds. In AISTATS 2005, 2005. 47

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 148 0
China 25 0
France 21 0
India 5 0
Ukraine 4 7
Ireland 4 0
United Kingdom 3 0
Japan 2 0
Germany 2 2
Canada 1 0
Argentina 1 0
City Views Downloads
Kansas City 138 0
Unknown 31 2
Beijing 13 0
Guangzhou 8 0
Ashburn 6 0
Shenzhen 4 0
London 3 0
Odesa 3 7
Tokyo 2 0
Sunnyvale 2 0
New Delhi 2 0
Vancouver 1 0
Nagpur 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052167/manifest

Comment

Related Items