UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Cluster-based information retrieval modeling Sze, Richard 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-0308.pdf [ 3.68MB ]
Metadata
JSON: 831-1.0091246.json
JSON-LD: 831-1.0091246-ld.json
RDF/XML (Pretty): 831-1.0091246-rdf.xml
RDF/JSON: 831-1.0091246-rdf.json
Turtle: 831-1.0091246-turtle.txt
N-Triples: 831-1.0091246-rdf-ntriples.txt
Original Record: 831-1.0091246-source.json
Full Text
831-1.0091246-fulltext.txt
Citation
831-1.0091246.ris

Full Text

Cluster-based Information Retrieval Modeling Richard Sze B.S. University of Oregon, 2001 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF S C I E N C E IN B U S I N E S S A D M I N I S T R A T I O N IN T H E F A C U L T Y OF G R A D U A T E S T U D I E S (Sauder School of Business) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A March 2004 © Richard Sze, 2004 Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: Q ^ t t V . / k , . r ^ X . ^ / / .A/-/,. D e 9 r e e : Al k l (Ad-,. AiUn) Year: JC>6f Department of RJUA/^A cf nrj Kt^i< A / ^ v ^ V - • .„ The University of British Columbia Vancouver, BC Canada Abstract Cluster-based information retrieval, an extension of information retrieval strategy, is based on the assumption that a document collection can be organized into a set of topics so that a user can enhance retrieval effectiveness. The cluster-based IR model assumes that queries can be associated with clusters that contain high concentrations of relevant documents, and that such association can lead to gains in retrieval effectiveness. Earlier studies, however, have provided negative to mixed results for the performance of the model. Moreover, studies are lacking which investigate the potential of the model in situations where queries are manually associated with the appropriate clusters. The goal of this thesis is to provide evidence for the validity of the cluster-base IR model's effectiveness through conducting extensive empirical studies which explore alternative schemes of the model on a large scale and according to a well-accepted benchmark. Investigation shows that the cluster-based IR model has the potential to enhance retrieval effectiveness, and yet, alternative techniques fail to actually achieve enhanced effectiveness. TABLE OF CONTENTS LIST OF FIGURES VI LIST OF TABLES Vll CHAPTER 1 : INTRODUCTION 1 1.1 Motivation 1 1.2 Thesis Objective 1 1.3 Thesis Outline 4 CHAPTER 2 : REVIEW OF IR AND RELATED WORK 5 2.1 Overview of the Traditional IR Model 5 2.1.1 Traditional IR Automatic Text Analysis 6 2.2 Overview and Related Work on Cluster-based IR Model 9 2.2.1 The Cluster Hypothesis and the Cluster-based IR Model 9 2.2.2 Similarity Measurement 11 2.2.3 The Cluster-based IR Model Effectiveness 13 2.2.4 The Manual Clusters Selection Effectiveness 14 2.2.5 The Flaw in Query-Cluster Association 15 2.3 Cluster-based Weighting and Matching 15 2.3.1 Cluster-based Weighting 16 2.3.2 Cluster-based Matching 18 CHAPTER 3 : RESEARCH QUESTIONS 20 CHAPTER 4 : EXPERIMENTAL SETUP 22 4.1 The Data Set 22 4.2 The Clustering Algorithm and Implementation 22 4.2.1 Tokenizing and Indexing the Terms 22 4.2.2 Clustering Algorithm 23 iii 4.2.3 Traditional Query-Cluster Association 25 4.2.4 Traditional Query-Document Association 25 4.3 Experimental Measures 26 4.3.1 Experimental Baseline 26 4.3.2 Retrieval Effectiveness 26 CHAPTER 5 : INVESTIGATING THE POTENTIAL OF THE CLUSTERING IR MODEL 28 5.1 Establishing the Baseline - The Traditional Clustering IR Model 28 5.1.1 Experiment Description 28 5.1.2 Results and Analysis 29 5.2 Manual Selection of the Most Relevant Clusters 30 5.2.1 Experiment Description 30 5.2.2 Results and Analysis 31 CHAPTER 6 : IMPROVING QUERY-CLUSTER ASSOCIATION 36 6.1 Scheme Description 36 6.2 Results and Analysis 37 6.3 Alternative Approaches for Document-based Cluster Ranking 38 6.3.1 Scoring Approach 1 Description .39 6.3.2 Scoring Approach 2 Description 40 6.3.3 Scoring Approach 3 Description 40 6.3.4 Scoring Approaches Results 41 CHAPTER 7 : EXTENDING THE CLUSTER-BASED IR MODEL 44 7.1 Cluster-based Weighting 44 7.1.1 Query Index Re-weighting Results ...45 7.1.2 Document Index Re-weighting Results 47 7.1.3 Simple and Complete Realization Comparison 48 7.2 Cluster-based Matching 52 7.2.1 A Rule-based Matching Concept 52 7.2.2 A Rule-based Matching Scheme 52 7.2.3 Results 54 iv 7.2.4 Other Proposed Cluster-based Matching Approach 56 CHAPTER 8: DISCUSSION AND FUTURE RESEARCH 59 8.1 Discussion 59 8.2 Contributions 61 8.3 Limitations and Future Research 62 REFERENCES 64 v List of Figures Figure 2.1. Atypical IR system from Rijsbergen (1979) 6 Figure 2.2. Matching relevant documents with the query in traditional IR model 7 Figure 2.3. Matching a query with the most similar cluster profile 10 Figure 2.4. Matching a query with similar documents in the selected cluster. 11 Figure 5.1. Precision[10] for the cluster-based IR and vector space model 29 Figure 5.2. Precision[10] for manual clusters selection vs. vector space model (100 clusters per query) 32 Figure 5.3. PrecisionflO] for manual clusters selection versus traditional clustering IR. 32 Figure 7.1. Comparison of P[10] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 1 cluster query-cluster association 50 Figure 7.2. Comparison of P[10] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 clusters query-cluster associations 50 Figure 7.3. Comparison of P[20] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 cluster query-cluster associations 51 Figure 7.4. Comparison of Recall for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 cluster query-cluster associations 51 vi List of Tables Table 5.1. Performance summary of the cluster-based IR vs. vector space model 29 Table 5.2. Precision losses for cluster-based IR compared to the vector-space model. 30 Table 5.3. Performance summary of manual clusters selectioning vs. vector space model 33 Table 5.4. Precision and Recall changes for manual clusters selectioning vs. vector space model 33 Table 6.1. Document-based cluster ranking precisions and recall summary 37 Table 6.2. Precisions and recall gains/losses compared to traditional query-cluster association with same number of clusters selected 38 Table 7.1. Query index re-weighting precisions and recall summary 45 Table 7.2. Query index re-weighting precision and recall gains/losses compared to traditional query index weighting with same number of clusters selected 46 Table 7.3. Document index re-weighting precisions and recall summary 47 Table 7.4. Document index re-weighting precision and recall gains/losses compared to traditional document index weighting with same number of clusters selected. ..48 Table 7.5. Percentage of P[10], P[20], and Recall gains over the traditional cluster-based IR model for simple and complete realization with 5 cluster query-cluster associations 49 Table 7.6. Precisions and recall summary rule-based matching 55 Table 7.7. The rule-based matching precision and recall gainsflosses compared to traditional document index weighting with the same number of clusters selected. 55 vii Chapter 1 : Introduction 1.1 Motivation Conceptual modeling, which organizations use as blueprints for their databases, is currently a hot topic in Information Systems (IS) Management. Information can be taken from structured (relational) or unstructured1 (text) data. Structured data modeling receives most of the spotlight in conceptual modeling research; it can be done with traditional models such as E-R diagrams, among other models. In contrast, unstructured data such as documents are often stored in different ways and created individually and manually rather than automatically. Information Retrieval (IR) is the most popular unstructured text data access technique. In particular, cluster-based IR modeling is a subset of the IR field for constructing automatic topic-based organization for unstructured data in order to achieve information retrieval effectiveness. 1.2 Thesis Objective We have investigated approaches for accessing textual information. In order to restrict the scope of this study, we focus on a cluster-based retrieval model. The clustering IR model organizes a collection of documents into topically coherent sets 1 Sources of media such as videos and pictures are also unstructured data. In this thesis, we focus on textual information. 1 based on the documents' textual information. When a user makes a request, the process associates the request, referred to as a query, with a small number of a collection's topical organizations, referred as clusters, and matches the query with documents contained only in the selected clusters. Cluster-based IR methods have not been widely implemented in modem commercial systems (Cutting et al. 1992), possibly because the validity of the model foundation is still controversial or perhaps because the model does not perform effectively enough. The cluster-based IR model was first introduced by Jardine and van Rijsbergen (1971), who developed and demonstrated the validity of a fundamental hypothesis for the model with a small collection of documents. The fundamental hypothesis, called the cluster hypothesis, stated that "the associations between documents convey information about the relevance of documents to requests" (Jardine and van Rijsbergen 1971). Their study showed that the cluster-based IR model not only can reduce the computation time, or efficiency, of the retrieval process, but also has the potential to improve the retrieval effectiveness. Jardine and van Rijsbergen noted that effectiveness can only be enhanced only if the correct cluster is being associated with the query, an assertion which was challenged by Willet (1988). Willet's controversial objection raises this question: is there potential for the cluster-based IR model? 2 Willet (1988) argued that Jardine and van Rijsbergen's study, conducted with a small collection of 2 Arazy (2004a) also suggested query-cluster association is a critical issue for clustering retrieval effectiveness. We therefore raise another question about the model: can query-cluster associations be improved? In his thesis, Arazy (2004b) revisited the cluster-based IR model, and extended the model with two principles to enhance effectiveness: cluster-based weighting and cluster-based matching. His simplified realization of the principles through empirical studies showed performance gains over the basic cluster-based IR model. Hence, using these issues as a starting point, the objectives of this thesis are the following: • To empirically test the traditional cluster-based IR model • To test the performance of the traditional cluster-based IR model if the most relevant clusters to queries are manually selected • To explore techniques for enhancing the query-cluster association • To propose and test a more complete realization of cluster-based weighting and cluster-based matching documents in a specific field, was unrepresentative and could not be validated as usable for other collections 3 1.3 Thesis Outline The structure of this thesis is as follows: Chapter 2 discusses the literature history on the cluster-based IR model and some of the challenges to the model, Chapter 3 raises research questions underlying the model, and Chapter 4 describes our experimental design and implementation. In Chapter 5, we perform an analysis to test the potential effectiveness of the traditional cluster-based IR model. The results of our analysis also establish the performance measurements and the baseline for testing. Chapter 6 explores new techniques to improve the effectiveness of the query-cluster association, and Chapter 7 proposes and tests a more complete realization for the cluster-based IR extension model. Finally, Chapter 8 discusses and makes conclusions about the results of this thesis. 4 Chapter 2 : Review of IR and Related Work In this chapter, we provide an overview of the traditional IR field: this field serves as the foundation for the cluster-based IR model. An overview of this traditional model is important because clustering is an extension of it. We then discuss van Rijsbergen's traditional cluster-based IR model by reviewing the concept and the past literature. It is important to understand the cluster-based IR model before we detail our investigations and our suggestions for enhancing the model's performance. In addition, this chapter revisits Arazy's extended cluster-based IR model (Arazy 2004a). 2.1 Overview of the Traditional IR Model Traditional IR is defined as an activity for accessing unstructured information requested by a user. The word "information" was substituted by "document" to explicitly describe the kind of information an IR would process (van Rijsbergen 1979). The central idea of the traditional IR model is to use an automatic retrieval strategy which attempts to retrieve as many relevant documents as possible while avoiding retrieving any non-relevant documents (van Rijsbergen 1979). Logically, we can establish the relevance of a document to a query manually. However, as the documents of a collection set grow rapidly and more and more retrievals are processed by computers, we need to use an automatic retrieval strategy to construct an 5 information retrieval system to perform the task. Van Rijsbergen (1979) used the following diagram to illustrate a design of an IR system: Feedback Queries Processor Output Input Documents Figure 2.1. Atypical IR system from van Rijsbergen (1979). The IR system diagrammed here has queries and documents as inputs. In most cases, the document representation will be stored as a list of extracted words which the next section below will describe in more detail. A processor automates the document retrieval based on the query in order to yield an output, a list of documents relevant to a request. 2.1.1 Traditional I R Automatic Text Analysis The traditional IR model defines tokens as representations of document and query text. Queries - sometimes referred to as requests or keywords - and documents are represented through a vector of weighted tokens. Using the weighted tokens, a system directly approximates the relevance of documents to queries based on the 6 similarity of the document vector to the query vector. Figure 2.2 shows the collection of documents with the query matching the relevant documents in the vector space (Salton and Lesk 1971). j • Relevant Documents j ; # Non-Relevant Documents ! Figure 2.2. Matching relevant documents with the query in traditional IR model. The vector space is a representation for queries and documents through a vector of tokens. Tokens are extracted from the documents and the queries' text using the statistical indexing techniques from Luhn (1958), where frequency counts of words in a collection are analyzed in order to determine an upper and a lower cut-off point. An upper cut-off excludes any common everyday English words such as "I," "he" or "for," while a lower cut-off excludes rare words. The reason for discriminating both types of words is because they are not significant for representing the content of a document collection. 7 After the tokenizing process is completed, a vector of weighted tokens defines the document index, or DIj, (Salton and Lesk 1971). The document index represents tokens contained in a document as well as their resolving power, which is a measure of the importance of each token. The following is a notation of the document index: DIj = [<ti,wi>, <t2,W2>, ... <tn,wn>], where j = l..m; m = number of documents in the corpus; ti..tn = tokens; wi..wn = tokens' weights; n = number of unique tokens. Term Frequency Inverse Document Frequency (TF-IDF) is one of the most commonly used weighting schemes for the traditional IR model. The TF component is a local factor that describes the relative frequency of a token contained in a specific document, and the IDF component is a global factor that describes the relative frequency of the token in the entire collection. The following are the definitions of the notations used for the TF-IDF scheme, the one employed by this thesis: Wjj = the weight of token i in document j = (fy / max(fj)) x log(D / nj) f(j = the frequency of token i in document j max(fj) = the maximum frequency of any token in document j D = the number of documents in the collection nj = the number of documents in which the index term i appears 8 2.2 Overview and Related Work on Cluster-based IR Model 2.2.1 The Cluster Hypothesis and the Cluster-based IR Model Cluster-based IR is based on the assumption that a document collection can be organized into a set of topics, thus allowing retrieval effectiveness to be enhanced. Cluster-based IR is also an extension of the vector-space model. We define a vector space as a space where document contents are represented by tokens. Within the vector space, clusters define the region of the content and classify the documents. Traditionally queries are associated with the entire collection of documents in the vector space model. A cluster-based IR processor initially performs the following tasks: 1) Indexing the documents in the collection set. The indexes can be the frequency word count or the weighted value of the reduced token list of each document 2) Classifying similar documents into clusters by measuring their indexes 3) Computing a profile of each cluster. A cluster profile is an average of the index weights of all the documents contained in the cluster Once a user submits a query, the processor would match a query with documents in the following steps: 1) Indexing the query 2) Associating the query with the cluster profiles to select the most similar cluster(s) 9 (see figure 2.3) 3) Matching the query with the documents contained in the selected cluster(s) (see 10 • Cluster Profiles ® Relevant Documents # Non-relevant Documents Figure 2.4. Matching a query with similar documents in the selected cluster. Using a small collection of documents, Jardine and van Rijsbergen provided empirical support for the use of cluster-based IR to enhance efficiency (by reducing the time to retrieve documents) and for an initial indication of improved effectiveness (by returning more relevant documents to the user). 2.2.2 Similarity Measurement Similarity/distance functions to approximately calculate how two components (query, cluster-profile, or document) are relevant to each other. The approximation can be calculated by several measures. In this thesis, we implemented the cosine function, which is the most commonly used measure. There are three similarity measurements 11 available for the cluster-based IR model: Sim(Q,D)3, Sim(Q,P), and Sim(P,D)4. The value of a similarity measure varies from 0 to 1. Two components are completely non-relevant, that is, not similar, if the similarity measure of the two is exactly 0 and completely relevant, or similar, if the similarity measure of the two is exactly 1. The three similarities are defined at the following: Sim(Qi,Dj) - a cosine measure for evaluating the similarity between the query Qi and the document Dj. The formula is defined as follows: Sitn(Qi,Dj) = In_l^j_JLtin Sqrt(In=lk(tin) 2xZn=. k(t j n) 2) where given there are k number of unique tokens, t; n = frequency of term n in query i , and tj „ = frequency of term n in document j . Sim(Qi,Pj) - a cosine measure for evaluating the similarity between the query Qi and the cluster-profile Pj. The formula is defined as follows: SimfQi.Pj) = Z___t i_XJin sqrt(_; n = 1 k(t i n) 2xZ n = 1 k(tj n) 2) where given there are k number of unique tokens, tj n = frequency of term n in query i , and tj n = frequency of term n in cluster-profile j . Sim(Pi,Dj) - a cosine measure for evaluating the similarity between the cluster-profile Pi and the document Dj. The formula is defined as follows: 3 Sim(Q,D) is a traditional measurement for IR 4 Sim(Q,P) and Sim(P.D), to the best of our knowledge, are only been referenced by Arazy (2004b) 12 Sim(PbDj) = Z ^ i a x t j j i sqrt(I n=, k(t, n) 2x_; n = 1 k (t jn)2) where given there are k number of unique tokens, U n = frequency of term n in query i , and tj n = frequency of term n in document j . 2.2.3 The Cluster-based IR Model Effectiveness Shaw et al. (1997) performed an analysis of cluster-based IR effectiveness and claimed that the cluster-based IR model has low operational performance, suggesting that either the cluster-based IR model or the techniques of cluster-based IR are inferior. Most of the focus on cluster-based IR research has been on its efficiency. By reducing the number of documents to be matched against the query, several experiments demonstrated some success in using automatic classification in IR (Good 1958; Fairthorne 1961). On the other hand, studies of the effectiveness of the cluster-based IR model have shown negative to mixed results at best. Effectiveness is a measurement in IR to evaluate how relevant the return documents are. Traditionally, relevancy in IR can be measured in terms of precision and recall. Van Rijsbergen (1979) defined recall as the number of retrieved relevant documents divided by the total number of relevant documents, while precision is the number of retrieved relevant documents divided by the total number of retrieved documents. Several studies (for example, van Rijsbergen 1979; Shaw et al. 1997) tested the retrieval 13 effectiveness of cluster-based IR strategies, yielding mixed results. In this thesis, we investigate whether or not the cluster-based IR model can enhance retrieval effectiveness over the traditional vector space model. 2.2.4 The Manual Clusters Selection Effectiveness Initially, we define a manual cluster selection as a process of selecting the cluster by a person. We will term a cluster the most relevant cluster, if it contains the highest number of documents relevant5 to a query6. To investigate the potential of the cluster-based IR model, Jardine and van Rijsbergen (1971) performed an empirical test by manually assigning a cluster to be associated with each query. They used the small Cranfield Aeronautics collection (Jardine and van Rijsbergen 1971), containing 200 documents and 42 queries, to perform a manual cluster selection retrieval test: their positive results suggest that clustering has the potential to improve effectiveness if the most relevant clusters are matched with the query. But Willet (1988) argued that the small Cranfield collection and the size of each cluster - in general, only 2-3 documents per cluster - made the sample data unrepresentative, and therefore the potential improvement in retrieval effectiveness of the model is still an open question. In our experiment, each document is either rated as relevant (1) or non-relevant (0) to each query. 6 Using a query to select the most relevant cluster can only be done manually as current automatic schemes are more often than not unable to select the most relevant cluster. 14 2.2.5 The Flaw in Query-Cluster Association Jardine and van Rijsbergen (1979) noted a cluster-based IR model can reach its potential for enhancing retrieval effectiveness only if the most relevant cluster to the query is associated. Arazy (Arazy 2004a) suggested query-cluster association, a process of associating a few sets of clusters with a query, is the weakest aspect of the model because the most relevant clusters for a query are often not chosen by traditional clustering IR techniques. Traditionally, a single cluster is associated with each query. Arazy argued, though, that documents relevant to a query will concentrate in more than one cluster. Furthermore, little attention has been devoted to improving the query-cluster association effectiveness. Indeed, until now there has been no alternative solution for query-cluster association. Therefore, we tried to search for such an alternative for query-cluster association. 2.3 Cluster-based Weighting and Matching Arazy (2004a) conducted some investigations with the model. By clustering documents into topical organizations, the cohesion of the clusters can be enhanced by 1) re-interpreting the content of the documents by re-indexing the tokens, and 2) utilizing additional information for matching. Hence, Arazy suggested two extra principles to strengthen the model: cluster-based weighting and cluster-based 15 matching. 2.3.1 Cluster-based Weighting In traditional IR searches, weighting schemes such as TF-IDF (introduced in section 2.1.1) index tokens of a document against the entire document collection, a process referred to as global TF-IDF weighting. Unlike the traditional IR model, clustering searches do not associate a query with the entire document collection, referred to as the entire corpus. Rather, a clustering search only matches a query against documents inside one or a small number of clusters. Therefore, the weight of each token should be adjusted to reflect the change in the "document corpus" and should only include documents contained in the selected clusters associated with a query. Examining the formula for this weighting scheme, recall the weight of token i in document j = (fj / max(fj)) x log(D* / nj), where D* is the total number of documents in the entire collection and nj is the total number of documents containing token i from the entire collection. Based on cluster-based weighting, D* should be the total number of documents in the selected cluster, while nj is the total number of documents containing token i from the selected clusters. A simple realization of the principle had been tested (Arazy 2003) by calculating TF-IDF weights on a per cluster-basis, referred to as one cluster TF-IDF. We define 16 document j as contained in cluster C, for one cluster TF-IDF, while the weight of token i in document j = (fy / max(fj)) x log(D* / n;) where D* is the total number of documents in cluster C, and n; is the total number of documents containing token i from the selected clusters . However, one cluster TF-IDF reflects the weight of each token only if one cluster is associated with a query. Following Arazy's suggestion concerning the possibility of associating more than one cluster with a query, a more complete realization of cluster-based weighting should be implemented. In a more complete realization of cluster-based indexing, referred to as cluster-based TF-IDF re-weighting, the weight of a token would be factored against all the documents in the selected cluster. Recall the TF-IDF indexing function, where D* is the total number of documents in the cluster a document belongs to. For cluster-based TF-IDF, D* will be equal to the total number of documents in the clusters associated with the query. Simple Realization One cluster TF-IDF Complete Realization Cluster-based TF-IDF Re-weighting Table 2.1. Summary of the simple realization (tested by Arazy) for cluster-based weighting and our proposed complete realization. 17 2.3.2 Cluster-based Matching In the basic cluster-based IR model, a query first needs to be matched against the profile of each cluster (see Figure 2.3) and then is matched against the documents inside the selected cluster(s) (see Figure 2.4). We define the relevance of a document to a query by R(Q,D) = Sim(Q, D), the similarity of the document and the query. There also exist two additional similarities: Sim(P, D), the similarity of the profile of a cluster and the document inside the cluster, and Sim(Q,P), the similarity of the profile of a cluster and the query. The basic cluster-based IR model uses Sim(Q,P) for query-cluster association and uses Sim(Q,D) for query-document matching. But the cluster-based matching principle (Arazy, 2004a) states that the relevance of a document to a query should be accessed by a function / containing all three similarities: R(Q,D) = f(Sim(Q,P),Sim(Q,D),Sim(P,D)). Since the clustering model provides two additional similarities' information, Sim(Q,P) and Sim(P,D), about the organization of the documents, an optimal function/ should utilize the combination of the three similarities. Until now, defining the function/has remained an open research topic. Using this principle, Arazy investigated a simple realization of cluster-based matching (Arazy 2004b) utilizing with moderate success Sim(Q,P), Sim(Q.D) and Sim(PD) to calculate R(Q,D) (e.g. R(Q,D) = Sim(Q,P)+Sim(Q,D)). 18 Simple Realization . (tested by Arazy) " f2=sim(g,D) + sim(g,P) ' fi=sim(Q,D)+sim{P,D) ' A = sim{Q, D) + sim{Q, P) + sim(P, D) Complete Realization • fs = Proposed rule-based matching Table 2.2. Summary of the simple realization (tested by Arazy) for cluster-based matching and our proposed complete realization 19 Chapter 3 : Research Questions Our research questions focus on two aspects of the cluster-based IR model: 1) the performance of the basic model, and 2) a more complete realization of the extended model. We establish the following research questions for this thesis: For the basic cluster-based IR model: • How effective is the traditional IR model? The effectiveness of the model serves as the baseline of our study. • Assuming that the most relevant cluster(s) can be associated with a query, what is the potential of the model for enhancing effectiveness? • If associating the most relevant clusters shows the potential for enhancing effectiveness and if query-cluster association is the weakest point of the model, how can we automatically select the most relevant clusters? If the selection process for most relevant clusters only exists in principle but not practically, how can we improve query-cluster association? • How much improvement can cluster-based TF-IDF provide over the traditional cluster-based IR model weighting scheme? • How much can the cluster-based matching proposed in this thesis scheme improve retrieval effectiveness over the current model? The simple realization of the principle only employs simple linear combinations of three measurements: 20 Sim(Q,P), Sim(Q.D),and Sim(P,D). Our study, in contrast, explores a new scheme for the principle. Based on cluster-based weighting and matching, can a more complete realization of the two principles obtain performance gains beyond the simplified realization suggested by Arazy? 21 Chapter 4 : Experimental Setup In order to achieve the objectives of this thesis, our work borrows from Arazy's experiment by using the same data set, as well as the same clustering algorithm and implementation to ensure consistency with our baseline. 4.1 The Data Set We conducted our experiments by using the Test Retrieval Conference (TREC) database (disks 4 and 5). The database contains approximately 528,000 documents, 100 queries, and a table of query-document relevance judgments, created manually by subject experts, for all documents for each query. We chose the TREC database because it is a standard collection set for IR research and its contents represent a wide range of topics. 4.2 The Clustering Algorithm and Implementation 4.2.1 Tokenizing and Indexing the Terms We used the traditional vector space model as the baseline for our experiments, while words inside the documents are tokenized and indexed with the techniques mentioned in section 2.1.1. First, we removed the high frequency terms with SMART'S common words list (Saltan 1971). Next, we stemmed the words by removing word prefixes and suffixes with Porter's algorithm (Porter 1980). Then we removed the low frequency terms, and finally, we applied TF-IDF token weighting. In 22 order to speed up the calculation for cluster similarity, only the top terms were included for the entire document indexed in our collection. The effect of only employing the top terms is small because those terms are the most commonly searched terms. 4.2.2 Clustering Algorithm In chapter two, we stated that a cluster search strategy involves indexing documents, classifying them into clusters, building a profile for each cluster, matching a query to cluster profiles, and finally matching the same query to documents inside the selected cluster(s). We classified the documents into clusters using k-Means clustering algorithm (McQueen 1967). The k-Means clustering algorithm is a simple algorithm that classifies documents based on their similarities, a set of similar documents are organized in one cluster. A similarity is a measure for evaluating the resemblance between documents using the Squared Euclidean Distance Formula. The formula is defined as follows: dij 2 = Xn=l k(tin - tjn) 2 where, given there exist k number of terms in a collection set, tj n is the relative frequency of term n in document i - if a document does not contain the term n, the tj n would be equal to 0. 23 In addition, we applied the following conditions when clustering the documents: • The experimental set contained exactly 100 clusters. • Al l clusters were exclusive, that is, every document belonged only to one cluster. • Each cluster contained at least 3000 documents and at most 15,000 documents. Clusters with less than the minimum number were terminated and those with more than the maximum number were broken down into smaller clusters. The advantage of employing the k-Means clustering algorithm is its ease of use. Note though, that cluster-based IR performance does not depend on the type of clustering algorithm (Jardine and van Rijsbergen 1971). Thus, our assumption is that the investigations and validations of this thesis are compatible with other clustering algorithms. Our focus in this thesis is to examine the effects on effectiveness for the retrieval process by investigating various schemes. We need to restrict our study with one clustering organization by clustering the collection into 100 subsets. To the best of our knowledge, there are no sensitivity analyses in regards to how the collection should be clustered. Thus, we have chosen to set the collection into 100 clusters for simplicity of implementation. 24 Every cluster had a profile, created by calculating its centroid, an average of the index weights of all the documents contained in the cluster. Profiles were used to measure the similarities between the cluster and queries, and the similarities between the cluster and its documents. 4.2.3 Traditional Query-Cluster Association The next step is to associate queries and clusters. A traditional query-cluster association is done in the following steps: • Given a query Q, calculate the similarity between query Q and each cluster-profile P, measured by Sim(Q,P). • Associate the most similar cluster(s) for query Q by selecting the cluster(s) with the highest Sim(Q,P) value. 4.2.4 Traditional Query-Document Association The final step is to associate queries with appropriate documents. A traditional query-cluster association is done in the following steps: • Given a query Q, calculate the similarity between query Q and each document D, measured by Sim(Q,D). • Rank the documents in the selected clusters in the descending order of the Sim(Q,D) value. 25 4.3 Experimental Measures 4.3.1 Experimental Baseline Due to its simplicity for calculating token weights, we implemented a traditional IR model, the vector space model, introduced by Saltan and his associates (Saltan et al. 1975) as this thesis' baseline. As a baseline, we performed the query-cluster and query-document associations with 1, 5, 10, 20, 30 and 100 clusters out of 100 clusters from the collection in Chapter 5 to evaluate the cluster-based IR model. Based on the results (discussed below in Chapter 5), only 1 and 5 clusters were chosen for Chapter 6 and 7. Tokens in each document and query were indexed by TF-IDF weight calculated against the collection vector space. For each query, only the top 1000 documents were retrieved for evaluation. 4.3.2 Retrieval Effectiveness Retrieval effectiveness has been evaluated in terms of both precision and recall. Recall is the percentage of the number of relevant documents retrieved out of the total number of relevant documents in the collection, while precision is the percentage of the number of relevant documents retrieved out of the total number of both relevant and irrelevant documents retrieved. In this thesis, we used Recall[1000] (recall for the 1000 top ranked documents), and Precision[10], Precision[20] and Precision[30] 26 (precision for the top 10, 20, and 30 documents respectively). The value of the two measurements is within the range of 0 to 1 inclusively, where 0 indicates none of the relevant documents are retrieved and 1 indicates the top n number of documents as relevant to the query. Statistical significance testing was employed using a one-tail T-test to study the effects of our results by comparing the averages of the base line JUO with the averages of the other models we tested jul and defining the null hypothesis as jul < 27 Chapter 5 : Investigating the Potential of the Clustering IR Model 5.1 Establishing the Baseline - The Traditional Clustering IR Model 5.1.1 Experiment Description We performed Experiment 1 using the traditional query-cluster association model as the baseline for the study. In this experiment, we implemented the traditional cluster-based IR model and the vector space model, associating each query with all the documents in the entire collection, in order to establish the baseline in the following steps: • By automatically associating clusters to a query, based on query/cluster-profile similarity (using the cosine similarity measure) with each query associated with an exact number of clusters • By automatically matching the query to documents in associated clusters (again, using the cosine similarity measure) to produce a ranked list of relevant documents • By measuring retrieval effectiveness through employing both recall and precision measures 28 5.1.2 Results and Analysis We established that our baseline consisted of the cluster-based IR (1, 5, 10, 20, and 30 clusters) and the vector space models (100 clusters, or the entire collection). The data is presented in Figure 5.1 below. Precision[10] for Cluster-based IR and. Vector Space Model Precision[10] 0.350 "I .. 1 * • - I 0.300 " ' " " "" . c j 0.250 0.200 to,--*-,- - • . -• - " -•• ~-A 0.150 • .s:izsgm\ 0.100 "„••••- . -.^-'(g.-- - ''vmt*s?t*S' •'• ;--v.-*a-0-050 -y^-". " . - ' 0.000 ' ' ' ' ' 1/100 5/100 10/100 20/100 30/100 100/100 Number of clusters per query Figure 5.1. Precision[ 10] for the cluster-based IR and vector space model. Clusters-query 1/100 5/100 10/100 20-100 30/100 100/100 Precision[10] 0.171 0.255 0.269 0.278 0.290 0.289 Precision[20] 0.124 0.195 0.213 0.227 0.232 0.240 Precision[30] 0.099 0.164 0.188 0.196 0.201 0.211 Recall[1000] 0.166 0.301 0.380 0.435 0.453 0.511 Table 5.1. Performance summary of the cluster-based IR vs. vector space model. The vector space model, in general, has the best retrieval performance based on the experiment, while retrieval performance declines when fewer clusters are associated with queries. Referring to Table 5.1, we suggest that the traditional cluster-based IR model fails to match the effectiveness of the traditional vector space 29 model. Instead, associating fewer clusters generated retrieval precision losses, as shown in Table 5.2. Clusters per query M O D 5 1 0 0 10 100 20 100 30/100 Precision[10] change - 4 1 % - 1 2 % - 7 % - 4 % 0 % Precision[20] change - 4 8 % - 1 9 % - 1 1 % - 5 % - 3 % Precision[30] change - 5 3 % - 2 2 % - 1 1 % - 7 % - 5 % Recal l [1000] change - 6 8 % - 4 1 % - 2 6 % - 1 5 % - 1 1 % Table 5.2. Prec is ion losses for cluster-based IR compared to the vector space model . When only one cluster is associated with the queries, precision is 40%-53% lower than for the vector space model. Thus, the traditional approach of associating one cluster for each query yields low retrieval performance and is unable to validate the cluster hypothesis. Instead, a greater number (but still a small set) of clusters should be associated with each query in order to minimize precision losses. 5.2 Manual Selection of the Most Relevant Clusters 5.2.1 Experiment Description In response to Willet's argument, we designed an empirical study to test the potential of the model when manually selecting clusters from a large scale collection, as mentioned in Chapter 4. Rather than matching just one cluster to each query, we tried to match queries with 1, 5, 10, 20 and 30 clusters out of the total of 100 clusters. The purpose of this experiment was to show that if closely related documents indeed tended to group into one or a few clusters, by manually selecting the most relevant clusters to each query, the performance of cluster-based IR should be 30 comparable with using the traditional IR model. We investigated optimal performance levels for the cluster-based IR model by • Manually selecting clusters that had the highest number of relevant documents to each query • Automatically matching query to documents from the selected clusters and producing a ranked list of relevant documents • Measuring retrieval effectiveness by employing both Recall and Precision measures 5.2.2 Results and Analysis We first compare the distribution of relevant documents in the manual cluster selection with traditional query-cluster association. The following table shows the comparison which was previously done by Arazy (2004a). Clusters per query 1/100 5/100 10/100 20/100 30/100 100/100 % of the total relevant documents in manual cluster selection 37.1% 74.3% 88.1% 97.1% 99.3% 100% % of the total relevant documents in traditional query-cluster association 19.8% 34.5% 43.9% 55.9% 65.3% 100% Table 5.3. Distribution of relevant documents in manual cluster selections vs. traditional query-cluster association Table 5.3 shows that when using traditional query-cluster association, selecting a few set of clusters (1,5, and 10) only captures half of the total percentage of relevant documents compared to the manual cluster selection indicating that the current query-cluster association technique failed to capture the appropriate clusters. 31 In this experiment, we also revisited Jardine and van Rijsbergen's empirical study (1971) of manual cluster selection performance, comparing the performance of manual cluster selection (1,5, 10, 20 and 30 clusters) against the baseline vector space model (100 clusters or the entire collection). The results are shown in the following figures and tables. Precision110| for Manual Cluster Selection vs. Vector Space Model Precision[10] 0 400 0 350 0.300 0.250 0.200 0.150 0.100 0.050 0.000 1/100 5/100 10/100 20/100 30/100 100/100 Number of clusters per query Figure 5.2. Precision[10] for manual clusters selection vs. vector space model (100 clusters per query). Precision[10] for Traditional and Optimal Cluster Selection 0.4 0.35 0.3 1 _ 0.25 o E, 0.2 K 0.15 0.1 0.05 0 - Optimal - Traditional 1/100 5/100 10/100 20/100 30/100 100/100 Number of c l us te r s per query Figure 5.3. Precision[10] for manual clusters selection versus traditional clustering IR. 32 ( lusters per query l / l 00 , i / 5 / 1 0 0 10/100 20/100 30,100 100/10(1 "oo f the ip l l eu ion pci query ; 1.1% 6 0 % v '• 11.8% ; ' i t ; . 22 3 % 3 1 . 3 % 100% Precision[ 10] 0.351 0 363 0.334 0.325 0.317 0.289 Precision[20] 0.274 0 301 0.286 0.281 0.275 0.240 Precision[30] 0.232 0 267 0.256 0.245 0.240 0.211 Recall[1000] 0.314 0.538 0.591 0 602 0.590 0.511 Table 5.4. Performance summary of manual clusters selection vs. vector space model. Clusters per query I I00 ! 5/100 j 10-'100 20.100 30/100 Precision[10] change 21% 26% 16% 12% 10% Precision[20] change 14% 25% 19% 17% 15% Precision[30] change 10% 27% 21% 16% 14% Recall[1000] change -39% 5% 16% 18% 15% Table 5.5. Precision and Recall changes for manual clusters selection vs. vector space model. The results from Figure 5.2 and Table 5.3 show that manual cluster selection has enhanced effectiveness over the vector space model in terms of precision. Unlike the traditional cluster-based IR model, manual cluster selection with one cluster or a small set of clusters demonstrates the potential to improve retrieval performance. When only 1 of the 100 clusters was associated with queries, precision improved 10%-20%, despite a 40% loss of recall. The shaded cells in Table 5.3 show the best precision and recall performances. Associating 5 clusters with queries yielded the best precision improvement - 25% over the baseline vector space model - while associating 20 clusters with the queries yielded the best recall improvement, 18% over the baseline of the vector space model. The results indicate that, based on precision results, the 33 cluster-based IR model has the potential to improve the retrieval performance, although the cluster hypothesis based on the recall result was partially valid in that relevant documents did not concentrate in one cluster but rather in a few small sets of clusters. We validated the results of this experiment by conducting one-tail t-tests for statistical significance. We defined /uO as the average of measurements for the vector space model and u 1 as the average of measurements for manual cluster selection such that the null hypothesis was pO>/u\. Rejecting the null hypothesis showed the statistical significance for superior performance of manual cluster selection over the vector space model. When associating 1 cluster with the queries, the improvement in its Precision[10] over the traditional vector space model was statistically significant, with P<0.05. When associating 5 clusters with the queries, the improvements were statistically significant for Precision[10], Precision[20] and Precision[30] where P<0.02. Meanwhile, recall improvement was statistically significant for associating 20 clusters with the query, with P < 0.005. Experiment 1 showed that associating few clusters led to performance loss in both recall and precision. When using the cosine similarity measure to associate 1 cluster with the queries, a traditional realization of the cluster-based IR model, our experiment showed significant effective loss, roughly 50% precision loss and 68% 34 recall loss, which supported early challenges to the model (Willet 1988; Shaw et al, 1995; Singhal and Pereira 1999; Hearst and Pedersen 1996). Experiment 2 showed that the model can potentially improve effectiveness if manual cluster selection (ones containing the most number of relevant documents to the queries) are associated with the queries. Based on our collection set, matching queries with documents contained in 1 or 5 manual clusters selections could significantly enhance precision performance over the traditional vector space model. Therefore, document distribution is not the major problem for the cluster-based IR model. Instead, the traditional query-association approach, associating clusters to a query with highest Sim(Q.P), of the model fails to capture the right clusters, resulting in a decline of performance for recall and precision. 35 Chapter 6 : Improving Query-Cluster Association In this chapter, we propose a scheme, referred to as document-based cluster ranking, to improve the query-cluster association. We first briefly describe its methodology, then provide a rationale for suggesting the scheme. 6.1 Scheme Description Recall every document is contained in a cluster. The scheme first retrieves the top 1000 documents from the vector space model for each query. Documents are retrieved using the traditional IR approach by matching the documents with the highest Sim(Q.D) to the query. Then, the scheme ranks all clusters in the collection containing the greatest number of the top 1000 documents. Then, the top rank clusters are associated with the query. For example, after initially matching a set of 1000 documents to a query, if cluster x contains the most documents from the set, then this cluster would be ranked first. Document-based cluster ranking utilizes the cluster hypothesis: that most of the documents relevant to a query are concentrated in a small number of clusters. The initial query-document matching and cluster ranking are pre-calculation processes if 7 In our experiment, we attempted ranking by the top 300, 500, and 1000 documents. We found that ranking with the top 1000 documents yielded the best retrieval effectiveness. Therefore, we used the top 1000 documents to investigate the document-based cluster ranking scheme and its approaches. 36 the queries are known in advance. Assuming the traditional vector space model can retrieve more relevant documents than the cluster-based IR model, the scheme approximates the appropriate clusters for each query based on the documents it has retrieved. Hence, a smaller set of clusters will be selected when the next search is performed. 6.2 Results a n d Ana lys i s For each query, document-based cluster ranking, by using a small set of ranked clusters (1 and 5 clusters), provides some enhancement in recall and precision. Table 6.1 summarizes the scheme's recall and precision. Clusters query 1 '100 5/100 Precision[10] 0.177 0.250 Precision[20] 0.132 0.199 Precision[30] 0.108 0.174 Recall[1000] 0.169 0.339 Table 6.1. Document-based cluster ranking precisions and recall summary. When, in using the scheme, only 1 cluster is associated with the query, precision can improve 3.5%-8.8% over traditional 1 cluster query-cluster association. On the other hand, when 5 clusters are associated with the query, a precision loss occurs at Precision[10] (-2%) with precision gains at Precision[20] and Precision[30] (1.8% and 6.3% respectively) over traditional 5 cluster query-cluster association. In addition, associating 5 clusters to the query with the scheme yields 12.8% gains in recall over the traditional 5 cluster query-cluster association. 37 Clusters query 1/100 5/100 Prec is ionf lO] % change 3 .5% - 2 . 0 % Precision[20] % change 6.0% 1.8% Precision[30] % change 8.8% 6 .3% Recal l [1000] % change 1.9% 12.8% Table 6.2. Prec is ion and recal l gains/losses compared to tradit ional query-cluster association w i th the same number o f clusters selected. Although this scheme showed some improvement in retrieval effectiveness -precision gains with 1 cluster query-cluster association and recall gains with 5 cluster query-cluster association - over the traditional model, the results were not statistically significant8. 6.3 Al t e rna t ive Approaches for Document-based C lus t e r R a n k i n g We tried alternative approaches for document-based cluster ranking by implementing a scoring system, referred to as the scoring approach. The original approach, referred to as the non-scoring approach (Section 6.1) does not capture the information of the rank of the top 1000 retrieved documents. Scoring approaches assign a higher score to a document that has a higher ranking in the retrieved set based on Sim(Q,D). For example, suppose cluster A contained the top 10 documents of the retrieved set while cluster B contained the bottom 10 documents of this set. The 10 8 Precision and recall for both 1 and 5 cluster associations were P > 0.10 38 retrieved documents contained in cluster A would receive higher scores than the 10 documents contained in cluster B. Therefore, cluster A will rank higher than cluster B because cluster A has more documents at a higher rank than cluster B. The rationale for the scoring approaches is that retrieved documents that are ranked higher are more likely to be relevant to the query. Therefore, a document with a higher ranking should be given more weight and a cluster that has higher ranking documents would be more likely to contain higher number of relevant documents to a query. The non-scoring approach ranks the clusters by assuming a cluster is more relevant if it contains a higher number of retrieved documents. On the other hand, the scoring approach can rank a cluster highest even if does not contain the highest number of retrieved documents. There are three different scoring approaches which we have explored, both of each of which are discussed below. 6.3.1 Scoring Approach 1 Description In this approach, clusters are being scored based on the sum of Sim(Q,D) value of all the retrieved documents that belong to the cluster. The following is the formula for the scoring approach for query Q and Cluster C*: SCORE(C0 = 2j=\nSim(Q,Dj), where there are n in Cluster C*. 39 This approach utilizes the similarity measure Sim(Q,D). The rationale for using the measure is that Sim(Q,D) initially provides an estimate of relevancy of a document to a query. By summing the Sim(Q,D) of all retrieved documents that are contained in a cluster, the summarization provides an indication, based on the score, of how relevant a cluster is to a query. 6.3.2 Scoring Approach 2 Description In this approach, clusters are scored based on the average of Sim(Q,D) value of all the retrieved documents that belongs to the cluster. The formula for the scoring approach for query Q and Cluster C* is SCORE(Ck) = ( £ j = i n Sim(Q,Dj))ln This approach utilizes the similarity measure Sim(Q,D) similar to Scoring Approach 1. The rationale for using the measure is to try normalizing Sim(Q,D) by averaging Sim(Q,D) of all retrieved documents that are contained in a cluster. 6.3.3 Scoring Approach 3 Description In this approach, clusters are scored based on the sum of the rank value of each retrieved document that a cluster contains. The following is the formula for the scoring approach: SCORE(Ck) = 2>i" (|Retrieved Set| - rank(£>y)) 40 where there is n number of documents (£>/... Dj) contained in Cluster Cu, rank(/J>,) is the rank of document Dj in the retrieved set. If it is not in the set, rank(£>,-) = 0. {Retrieved Set\ is the number of retrieved documents - 1000 in our experiment - and SCORE(Ck) is the score for Qwhen retrieving documents for a query. {Retrieved Set\ - xarMjDj) is a rank value giving lower points for documents at lower rankings. 6.3.4 Scoring Approaches Results For each query, Scoring Approach 1 provides some enhancement in recall and precision over the traditional query-cluster association. These enhancements are very similar to the non-scoring approach. On the other hand, Scoring Approach 2 shows significant decrease in precision and recall over the traditional query-cluster association, while Scoring Approach 3 shows no improvement over the traditional query-cluster association. Table 6.3 summarizes all three approaches' precision and recall. Scor ing Approach 1 Scoring Approach 2 Scor ing Approach 3 Clusters/query 1/100 5/100 1/100 5/100 1/100 5/100 Precision[10] 0.172 0.270 0.045 0.192 0.171 0.257 Precision[20] 0.128 0.213 0.030 0.151 0.124 0.193 Precision[30] 0.104 0.184 0.025 0.125 0.099 0.162 Recall[1000] 0.170 0.352 0.025 0.136 0.166 0.301 Table 6.3. Precision and recall summary for the scoring approaches. For Scoring Approach 1, when 1 cluster is associated with the query, precision can improve 0.6%-5.1% over traditional 1 cluster query-cluster association. This 41 improvement is less than the non-scoring approach, where precision can improve precision 3.5%-8.8% over traditional 1 cluster query-cluster association. However, Scoring Approach 1 can improve precision 5.9%-16.9% over traditional 5 cluster query-cluster association, compared to from -2% to -6.3% precision changes for the non-scoring approach. For Scoring Approach 2, both precision and recall suffer significant losses compared to the traditional query-cluster association. Scoring Approach 3 does not provide gains or losses in precision and recall over traditional query-cluster association. Table 6.4 summarizes the gains and losses in precision and recall for the three scoring approaches. Scoring Approach 1 Scoring Approach 2 Scoring Approach 3 U U M C I S ijiwrv 1,100 5.100 M O O 5* 100 1 M O O 5'100 Precision[10] 0.6% 5.9% -73.7% -24.7% 0.0% 0.8% Precision[20] 2.8% 9.2% -75.8% -22.6% 0.0% -1.0% Precision[30] 5.1% 12.0% -74.7% -23.8% 0.0% -1.2% Recall[1000] 2.6% 16.9% -84.9% -54.8% 0.0% 0.0% Table 6.4. Precision and recall gains/losses compared to traditional query-cluster association with the same number of clusters selected for each scoring approach. 42 Similar to the non-scoring approach, in our experiment, Scoring Approach 1 showed measurable, but statistically insignificant, improvement over the traditional model. This result indicates that an approach utilizing Sim(Q,P) might not be able to outperform the non-scoring approach. 43 Chapter 7 : Extending the Cluster-based IR Model Cluster-based indexing and cluster-based matching are the two principles employed in the extended cluster-based IR model. In this chapter, we present new schemes based on the two principles Arazy has used to extend the model. At the conclusion of the chapter, we compare our complete realization with Arazy's simple realization. 7.1 Cluster-based Weigh t ing We propose two schemes to modify the indexing process of queries and documents: query index re-weighting (re-indexing the tokens of the queries) and document index re-weighting (re-indexing the tokens of the documents). Recall that the weighting function, TF-IDF, for document tokens is (fj / max(fj)) x log(D* / nj) and that we define a document corpus as a set of documents contained in the clusters that are associated with a query. In the simple realization, the TF-IDF weight is calculated on a per-cluster basis (here, the value D* is the total number of documents of the cluster that a document belongs to). For the complete realization, the TF-IDF weighting can be calculated based on one or more clusters (for example, 5 clusters), where D* is the total number of documents in the document corpus space. Thus, tokens in the queries or documents are weighted against the document corpus space. 44 Both query index re-weighting and document index re-weighting employ the complete realization TF-IDF weighting technique separately. When studying the effects of query index re-weighting, traditional global TF-IDF weighting is employed on the document index weighting, whereas traditional global TF-IDF weighting is employed on the query index weighting when studying the effects of document index re-weighting. 7.1.1 Query Index Re-weighting Results Query index re-weighting can significantly enhance effectiveness over traditional query index global TF-IDF weighting when 1 out of the 100 clusters is associated with each query: ('lusters/query 1/100 5 100 Precision[10] 0.227 0.264 Precision[20] 0.167 0.211 Precision[30] 0.139 0.181 Recall[1000] 0.182 0.312 Table 7.1. Query index re-weighting precisions and recall summary. With 1 cluster query-cluster association, precision can improve by approximately 33%-40% over traditional query index global TF-IDF weighting, and recall can improve by approximately 10%. Meanwhile, 5 cluster query-cluster association has shown minor gains in precision (between 4%-10%) and recall (approximately 4%). When more clusters are associated with the query, gains in precision and recall decline due to the document corpus space becoming more similar to the vector space. 45 Therefore, with more clusters associated with the query, the re-weighted query indexes are closer to the traditional query index global TF-IDF weighting index. The results show the scheme is more effective when fewer clusters are associated with queries (see Table 7.2). Clusters-query 1 100 5,100 Precision[10] 32.7% 3.5% Precision[20] 34.7% 8.2% Precision[30] 40.4% 10.4% Recall[1000] 9.6% 3.7% Table 7.2. Query index re-weighting precision and recall gains/losses compared to traditional query index weighting with same number of clusters selected. One-tail t-tests show statistical significance with 1 cluster query-cluster association at Precision[10] (where PO. l ) , Precision[20] (where P<0.1), and Precision[30] (where P<0.05). These results indicate that the weighting scheme can enhance effectiveness, especially with fewer clusters. Also, the improvements in precision shown in Table 7.2 indicate that query index re-weighting can reduce the "noise" (non-relevant documents) of retrieval because the improvements in precision are higher Precision[20] and Precision[30]. Traditionally, precision declines when more documents are retrieved because of the "noise" of non-relevant documents retrieved at the lower rank of the retrieved set. 46 7.1.2 Document Index Re-weighting Results Similar to query index re-weighting, document index re-weighting can significantly enhance effectiveness over traditional document index global TF-IDF weighting when 1 out of the 100 clusters is associated with each query: Clusters querv 1 100 5/100 Precision[10] 0.218 0.271 Precision[20] 0.167 0.21 Precision[30] 0.139 0.18 Recall[1000] 0.183 0.314 Table 7.3. Document index re-weighting precisions and recall summary. With 1 cluster query-cluster association, precision can improve by approximately 28%-40% over traditional document index global TF-IDF weighting and recall can improve by approximately 10%. Meanwhile, 5 cluster query-cluster associations have shown minor gains in precision (approximately 6%-10%) and recall (approximately 4%): the decline of precision and recall gains occurred in a partem similar to query index re-weighting. When more clusters are associated with the query, the adjusted document TF-IDF weighting is closer to traditional document index global TF-IDF weighting, due to the document corpus space being closer to the vector space: 47 Clusters query 1 100 5 100 Precision[10] 27.5% 6.3% Precision[20] 34.7% 7.7% Precision[30] 40.4% 9.8% RecallflOOO] 10.4% 4.2% Table 7.4. Document index re-weighting precision and recall gains/losses compared to traditional document index weighting with same number of clusters selected. One tail t-tests show statistical significance with 1 cluster query-cluster association at Precision[20] (where P<0.1), and Precision[30] (where P<0.05). Particularly with fewer numbers of clusters associated with queries, these results indicate that the weighting scheme can enhance effectiveness. 7.1.3 Simple and Complete Realization Comparison A l l realizations show significant precision and recall enhancements with 1 cluster query-cluster association. When 1 cluster is associated with queries, precision and recall for cluster-based and one-cluster document re-weighting are equal since both schemes calculate TF-IDF document indexes based on one-cluster document corpus space. Referring to the results for query and document index re-weighting, query index re-weighting outperforms document index re-weighting with 32.7% and 27.5% gains respectively at P[10] over the traditional global TF-IDF weighting. There are lesser precision and recall gains for 5 query-cluster association for all the realizations. Furthermore, one-cluster document re-weighting underperformed at P[10] and for Recall (see Table 7.5 below). One the other hand, the complete 48 realization shows minor improvements in all precision and recall measurements for cluster-based document re-weighting and query re-weighting. The table and figures below illustrate some of the differences in terms of retrieval effectiveness gains or losses between the three schemes for 5 query-cluster associations. Based on the results, the complete realization for cluster-based weighting outperforms both the traditional cluster-based model and the simple realization. Schemes with 5 query-cluster Association P[]<)1 Pf20J Recall Complete Realization Cluster-based Document Re-weighting 6.3% 7.7% 4.2% Query Re-weighting 3.5% 8.2% 3.7% Simple Realization One-cluster Document Re-weighting -0.4% 1.8% -0.3% Table 7.5. Percentage of P[10], P[20], and Recall gains over the traditional cluster-based IR model for simple and complete realization with 5 cluster query-cluster associations. 49 P[10] Comparison for 1 Cluster Query-Cluster Association I Baseline Cluster-based Query Re- One-cluster Document Re- weighting Document Re-weighting weighting Figure 7.1. Comparison of P[10] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 1 cluster query-cluster association. P[10] Comparison for 5 Cluster Query-Cluster Association I ~1 Baseline Ouster-based Query Re- One-cluster Document Re- weighting Document Re-weighting weighting Figure 7.2. Comparison of P[10] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 clusters query-cluster associations. 50 P[20] Comparison for 5 Cluster Query-Cluster Association " • ] Baseline Cluster-based Query Re- One-cluster Document Re- weighting Document Re-weighting weighting Figure 7.3. Comparison of P[20] for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 cluster query-cluster associations. Recall Comparision for 5 Cluster Query-Cluster Association j 1 Baseline Ouster-based Query Re- One-cluster Document Re- weighting Document Re-weighting weighting Figure 7.4. Comparison of Recall for traditional global TF-IDF weighting (baseline) and three other cluster-based weighting schemes with 5 cluster query-cluster associations. 51 7.2 Cluster-based Matching In the simple realization of the extended cluster-based IR model, query-document relevance matching, R(Q,D), is calculated by the linear functions f(Sim(Q,D), Sim(Q,P)). To supplement a more complete realization of the model, in this section we explore new approaches and present our best scheme for the realization in this section. 7.2.1 A Rule-based Matching Concept Cluster-based matching can be done by comparing documents and ranking them based on some conditional rules, referred as rule-based matching. Unlike the simple realization approach, where R(Q,D) is calculated by a linear function, rule-based matching utilizes the three similarity measures (Sim(Q,D), Sim(Q,P), and Sim(P,D)) and sets different cases based on the measures. The results of our experiment indicate that utilizing both Sim(Q.D) and Sim(Q,P) show performance gains while Sim(P,D) fails to convey any useful information. 7.2.2 A Rule-based Matching Scheme Consider the additional information created through the topical organization of the document collection, Sim(Q.P). We initially implemented the traditional clustering IR model to retrieve the top 1000 documents, compare them, and rank them based on 52 a set of conditions. Using a standard sorting approach in which documents are examined in pairs, the scheme applies the following rules: Step 1: Check to see if the two documents (Dl and D2) belong to the same cluster. If they belong to the same cluster, compare the distance between each of the two documents and the query, Sim(Q,D), increase the rank of the document with the higher Sim(Q,D), the document that is closer to the query, and terminate the comparison. If the two documents belong to distinctive clusters, proceed to the next rule. Step 2: Compare the distance between each of the two cluster profiles (PI and P2) and the query, Sim(Q.P). If the two cluster profiles have similar distances to the query (Sim(Q,Pl) ~ Sim(Q,P2)), increase the rank of the document with the higher Sim(Q.D) and quit the comparison. If the two cluster profiles do not have similar distances to the query (Sim(Q,Pl) ^ Sim(Q,P2)), proceed to the next rule. Step 3: Calculate the "scores" of D l and D2, referred as SCORE(Dl) and SCORE(D2), and increase the rank of the document with the higher score. In our scheme, we define SCORE(D) = a * Sim(Q,P) + (1-a) * Sim(Q,D), where a is a constant decimal value between 0 and 1. 53 The rationale for this scheme utilizes the measure Sim(Q,P), so that if two documents being ranked have a distinctive query-cluster distance, the scheme puts more weight on the document clusters which are closer to the query. Also, a controls the weight of Sim(Q,P) and Sim(Q,D) for calculating the score. This rule assumes that if a cluster is close to a query, documents inside the cluster have higher relevancy to the query. In ranking the top 1000 documents, the goal of the scheme is to move the documents more relevant to a query to higher rankings and hence improve precision. We explored various combinations of parameters for the scheme. First, we set the criteria of Sim(Q,Pl) ~ Sim(Q,P2) to true if the value difference between Sim(Q,Pl) and Sim(Q,P2) is 5%, 10%, or 20% close. Then, the weight a of the score may be defined as 0.1, 0.2, 0.5 or 0.8. Our results for the next section are based on the best combination of parameters in our experiment where a = 0.2 and Sim(Q,Pl) ~ Sim(Q,P2) = 10%. 7.2.3 R e s u l t s The results presented in the two tables below show that our realization is unable to improve retrieval effectiveness over the traditional clustering-IR model: with both 1 and 5 query-cluster associations, effectiveness gains and losses are within 1% . 54 Clusters'query 1 KK) 5.100 Precision[10] 0.171 0.257 Precision[20] 0.124 0.193 Precision[30] 0.099 0.162 Recall[1000] 0.166 0.301 Table 7.6. Summary of precision and recall for rule-based matching. Clusters-'querv 1-100 5-100 Precision[10] 0.0% 0.8% Precision [20] -0.4% -1.0% Precision[30] -0.3% -1.0% Recall[1000] 0.2% 0.1% Table 7.7. Rule-based matching precision and recall gains/losses compared to traditional document index weighting with the same number of clusters selected. The simple realization in cluster-based matching (Arazy 2004b) was also unable to yield effectiveness gains9 over the traditional matching scheme where he employed the simple realization of cluster-based matching with traditional global TF-IDF weighting to index tokens. An explanation for our complete realization failing to yield effectiveness gains over the traditional matching scheme could be due to the complexity of the rule-based matching scheme. Our experiment was unable to cover all the possible cases for the Arazy (2004a) explored the use of combining the simple realization of both cluster-based matching and weighting principles. The best result in his experiment was at 20 cluster query-cluster association where P[20] was 25% higher than for the traditional cluster-based IR model. 55 scheme because the scheme contains various case scenarios and conditions. An optimal case for the scheme might be possible for enhancing retrieval effectiveness. 7.2.4 Other Proposed Cluster-based Matching Approach In our experiment, we also explored another approach using rule-based cluster-based matching applied to various sets of cases. This approach is described below. The Alternative Approach Check to see if the two documents (Dl and D2) belong to the same cluster. If the two documents belong to the same cluster, go to Rule 1 or else go to Rule 2. Rule 1: IfDl and D2 belong to the same cluster Step 1: Compare the distance between each of the two documents and the query, Sim(Q,D). If the two documents do not have a similar distance to the query (Sim(Q,Dl) £ Sim(Q,D2)), increase the rank of the document with the higher Sim(Q,D) and quit the comparison or else go to Step 2. Step 2: Increase the rank of the document with the shorter distance between it and the cluster profile, Sim(P,D). Rule 2: IfDl and D2 do not belong to the same cluster 56 Step 1: Compare the distance between each of the two cluster profiles (Pl and P2) and the query, Sim(Q,P). Use the percentage of distance between the two cluster profiles as a weight to calculate the score. Define the percentage of distance between the two clusters as a and SCORE(D) = a * Sim(Q,P) + (1-a)* Sim(Q,D). Then, increase the rank of the document with the higher score. The rationale for Rule 1, applied when two documents belong to the same cluster, is that if the distances of each of the two documents to the query are roughly similar, the rule would use the distance between each of the two documents and the cluster profile, Sim(P D), as the tie-breaker. The tie-breaker assumes the document that is closer to the profile would be more relevant to the query because this document has a better representation of the profile of the cluster. If the query-cluster association selected the most appropriate cluster, the document closest to the cluster profile should have the closest distance to the query among all the documents in the cluster. The rationale for Rule 2, applied when two documents do not belong to the same cluster, is that if a cluster is close to a query, documents inside the cluster have higher relevancy to the query. Therefore, the greater difference between each of the two clusters to a query, the more weight would be placed on Sim(Q,P) for the scoring formula. 57 We explored the approach with various combinations for the criteria of Sim(Q,Pl) ~ Sim(Q,P2). We set the criteria of Sim(Q,Pl) ~ Sim(Q,P2) to be true if the value difference between Sim(Q,Pl) and Sim(Q,P2) is 5%, 10%, or 20% close. Also, the weight a of the score may be defined as 0.1, 0.2, 0.5 or 0.8. 58 Chapter 8: Discussion and Future Research 8.1 Discussion In this thesis we have tried to resolve the controversy surrounding the potential improvement of the cluster-based IR model's effectiveness by conducting empirical investigations and exploring possible alternatives for the model. First, we manually associated cluster(s), a selection method known as manual cluster selection. We discovered that the model can potentially improve retrieval effectiveness when the appropriate clusters are associated with queries. In particular, when 5 out of 100 clusters were associated with queries, there were 25% precision gains over the vector space model. The finding suggests that the cluster-based IR model can potentially enhance both run-time efficiency and retrieval effectiveness. A similar empirical investigation was performed with manual clusters selection three decades ago (Jardine and van Rijsbergen 1971) but was later criticized by Willet (1988), who claimed that the experiment was performed with a too specific, unrepresentative, and small data set. To the best of our knowledge, our investigation represents the first substantial empirical evidence for the model's potential since Jardine and van Rijsbergen's study. Our finding also validated an earlier claim (Arazy 2004a) that traditional query-cluster association was the weakest point of the model and results in poor 59 retrieval effectiveness. We explored a scheme to improve query-cluster association with minor enhancement in precision (up to 8.8%) and recall (up to 12.8%) over the traditional cluster-based IR model. Finally, we explored a complete realization of cluster-based weighting and matching. Using cluster-based TF-IDF documents or query index re-weighting, this complete realization demonstrated significant retrieval effectiveness enhancement for cluster-based weighting over both the traditional cluster-based IR model and simple realization one-cluster TF-IDF weighting. On the other hand, we were unable to provide a complete realization of cluster-based matching with any improvement in retrieval effectiveness. We explored the matching principle with rule-based matching in order to utilize the additional information by clustering a document collection that the topical organization provided. To summarize, we were able to demonstrate the great potential of the cluster-based IR model if any of the three principles utilized were addressed: 1) query-cluster association, 2) cluster-based weighting, or 3) cluster-based matching. We demonstrated the importance of associating the appropriate clusters to queries, but to date there is no alternative to query-cluster association that shows positive results. Also, beyond the simple realization, cluster-based matching provided negative results as we were unable to enhance retrieval effectiveness with the principle. The only 60 significant result we obtained was through realization with cluster-based weighting. We conclude that although there is great potential for the cluster-based IR model, it is not clear whether the model's retrieval effectiveness could actually be enhanced with any available alternative scheme. Our study explored various schemes to achieve such enhancement, but none of them were able to reach the potential of the model. 8.2 Contributions A contribution this thesis makes to the field of IS is in providing unequivocal evidence, on a large-scale and with a generally accepted benchmark for the potential of the cluster-based IR model. Earlier evaluation of the model was invalidated due to the use of unrepresentative small sets of data or the employment of suboptimal implementations that failed to reach the potential of the model. Another contribution this thesis makes by investigating three main principles of the model (described in the previous section) is the exploration of alternatives by which to enhance the retrieval effectiveness of the model. Further, a complete realization has been developed, studied and compared with the traditional model as well as with the simple realization. Significant improvement in retrieval effectiveness has been discovered through cluster-based weighting. 61 8.3 Limitations and Future Research The main limitation of our study is the use of one set of data for all of our experiments. Regarding future research, there are further studies needed to strengthen and extend our findings concerning the model. First, more investigations of the model's potential could be conducted using alternative data sets. Although we selected a representative data set for the study, employing other data sets would strengthen our findings. In addition, a future study could match queries to documents with more effective schemes, such as the cluster-based TF-IDF re-weighting proposed in this thesis. Second, alternative techniques for query-cluster association need to be explored in order to achieve the model's potential. This potential can be reached when the most relevant clusters to the query are selected. To date, there has been a lack of research toward developing an effective query-cluster association technique. Third, the proposed rule-based matching scheme needs to be investigated for its potential. The results failed to validate rule-based matching. If the scheme does not have any potential to serve as the complete realization of the cluster-based matching principle, new schemes should be explored. 62 Finally, further studies on cluster-based weighting schemes, query index re-weighting and document index re-weighting need to be performed. Such studies can validate the complete realization with other representative data sets and clustering organizations, such as clustering collections in sets of, for example, 200 clusters. 63 References Arazy O., "Cluster-Based Information Retrieval: A Study of the Underlying Assumptions", Paper under review at Information Processing and Management (2004a). . "Representing the Contents of Text Documents in an Open Environment", Ph.D. diss., University of British Columbia (2004b). Arazy O. and Woo C , "Textual Information Access: Enhancing the Cluster-Based Retrieval Model", Workshop on Information Technologies and Systems, 177-182 (2003). Baeza-Yates R. and Ribeiro-Neto B., "Modern Information Retrieval", Addison Wesley / A C M Press (1999). Cutting D.R., Pedersen J.O., Karger D., and Tukey J.W., "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections", Proceedings of the Fifteenth Annual International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, 318-329 (1992). Dice L.R., "Measures of the amount of ecological association between species", Ecology, v.26, 297-302 (1945). Fairthorne R.A., "The mathematics of classification". Towards Information Retrieval, Butterworths, London (1961). 64 Good I.J., "Speculations Concerning Information Retrieval", Research Report PC-78, IBM Research Centre, Yorktown Heights, New York (1958). Jardine N . and van Rijsbergen C.J., "The use of hierarchical clustering in information retrieval", Information Storage and Retrieval, 7, 217-240 (1971). Luhn, "The automatic creation of literature abstracts.", I. B. M. Journal of research and Development 2(2), 159-165, 1958 McQueen, "Some methods of classification and analysis of multivariate observations", Symposium in Mathematics, Statistics and Probability (1), 281-296. Univ. of California, Berkeley, USA, 1967 Porter M.F., "An Algorithm for Suffix Stripping, Program", 14 (3), 130-137 (1980). Salton G. and Lesk M.E., "Computer Evaluation of Indexing and Text Processing, in Salton G. editor", The SMART Retrieval System: Experiments in Automatic Document Processing, 143-180, Prentice-Hall (1971). Salton G , Wong A., Yang C. S, "A vector space model for automatic indexing", Communications of the ACM\% (11), 613-620 (1975). Shaw W. M . , Burgin R., amd Howell P., "Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models", Information Processing and Management, 33 (1), 1-14 (1997). 65 Singhal A . and Pereira R , "Document Expansion for Speech Retrieval", Research and Development in Information Retrieval, 34-41 (1999). van Rijsbergen C.J., "Information Retrieval", Information Retrieval, (1979). 66 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0091246/manifest

Comment

Related Items