UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Experimenting with automatic concept identification in documents Mi, Andy Lu 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-0452.pdf [ 4.9MB ]
Metadata
JSON: 831-1.0090963.json
JSON-LD: 831-1.0090963-ld.json
RDF/XML (Pretty): 831-1.0090963-rdf.xml
RDF/JSON: 831-1.0090963-rdf.json
Turtle: 831-1.0090963-turtle.txt
N-Triples: 831-1.0090963-rdf-ntriples.txt
Original Record: 831-1.0090963-source.json
Full Text
831-1.0090963-fulltext.txt
Citation
831-1.0090963.ris

Full Text

Experimenting with Automatic Concept Identification in Documents By Andy Lu Mi B.Eng., TsingHua University, 1996 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE In THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF MANAGEMENT INFORMATION SYSTEM FACULTY OF COMMERCE We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA April 2003 © Andy Lu Mi, 2003 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Management Information System Faculty of Commerce The University of British Columbia Vancouver, Canada ABSTRACT: Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections. Table of Content Abstract ii Table of Contents iii Acknowledgements vi Chapter 1. Introductions and Background 1 1.1 Motivation 1 1.2 Parsing and Pre-processing 3 1.3 Traditional Concept Identification Approach 4 1.3.1 Luhn and other scholars 4 1.3.2 Clustering techniques used in Concept Identification 6 1.3.3 Kullback-Liebler-based clustering 8 1.3.4 Mutual Information based clustering 10 Chapter 2. Proposed Approach 11 2.1 Objectives 11 2.2 Theories influencing the current approach 12 2.2.1 Semantic Network 12 2.2.2 Words Association and co-occurrence study 13 2.2.3 Concept Definition 14 2.3 Framework of the proposed approach 16 2.4 Parsing and Pre-processing 17 2.5 Term Identification 19 2.6 Building a similarity matrix for important terms 20 2.6.1 Association Clusters 21 2.6.2 Metric Clusters 21 2.6.3 Scalar Clusters " 22 2.6.4 Other approaches 22 in 2.7 Affinity Value Measure 23 2.7.1 Term Affinity Measure 23 2.7.2 Definitions 24 2.8 Clustering Algorithms 27 2.8.1 Hierarchical clustering 27 2.8.2 Partitional Clustering 28 2.8.3 Selecting a clustering algorithm 29 2.9 Identifying concepts 30 Chapter 3. System Implementation 31 Chapter 4. Evaluation 34 4.1 Measurements 35 4.1.1 Relevance score 35 4.1.2 Average within cluster similarity 36 4.2 What to measure 36 4.3 Questions and Tests 37 Chapter 5. Experiments and Discussion: 40 5.1 Experimental Corpus 40 5.2 Selection Criteria vs. Final Concept Identification Output 41 5.3 The N factor in Affinity Value Definition and Concept Identification Outputs 50 5.4 Characteristics of document corpus and Concept Identification Outputs 54 5.5 Overall results 57 Chapter 6. Conclusion 62 6.1 Conclusions drawn from Sensitivity studies 64 6.2 General conclusions regarding the automatic concept identification 66 iv Chapter 7. Summary 68 7.1 Contributions of the research 68 7.2 Critiques of the research 69 7.3 Future Research 70 References 73 Appendix A: Top 10 computer / expert identified concepts for three corpuses 78 Appendix B: The Processed document lists 85 Appendix C: List of Selected Terms for all three corpuses 92 Appendix D: Porter Stemming Algorithm 105 Appendix E: Notation / Formula summary Table 112 Appendix F: Original Output files from the system (in CD ROM) 114 Appendix G: System programs code (in CD ROM) 114 v Acknowledgement I would like to thank Professor Yair Wand for giving me invaluable instructions and suggestions, Professor Carson Woo and Professor Jacob Steif for being my thesis committee members, and PHD candidates Ofer Arazy and Edmund Zeto for brainstorming with me on affinity value definition. I would also like to thank Sam Gao, Jimmy Luo, Eric Hu, and Frank Fan for reviewing and scoring the experimental results as the expert judges. 1. INTRODUCTIONS AND BACKGROUND This experimental study of Automatic Concept Identification proposes an automatic, scalable and consistent concept identification approach for document collections with narrowly defined domains, and it analyses empirical evidence to support a proposed model for managing these document collections. In the study, we first describe a framework for automatic concept identification, and then implement a testing system based on the proposed framework. Finally, we evaluate the performance of the implemented system against several document collections. 1.1 Motivation As businesses and other organizations use large sets of text documents for a growing range of purposes, Automatic Concept Identification is becoming an area of great interest. It has already been used extensively in various applications, notably including information retrieval, text mining, domain information acquisition, ontology engineering, automatic summarization, and automatic indexing. Most efforts in the development of Automatic Concept Identification have been extended and applied to document collections with specific domains, for example collections of legal cases, biology studies, medical research, and software programming. Most of these identification techniques are suitable only for highly structured technical papers whose content can be organized using certain semantic frames (Paice & Jones, 1993), and which require knowledge of specific domains. However, with the overwhelming volume of information being generated and disseminated across the internet, the majority of which is produced from unstructured textual data, there has been a growing interest in finding effective automatic techniques that can be used in unrestricted domains with minimal need for specialized domain knowledge. 1 This paper proposes an approach to automatic concept identification that targets the problems attendant to non-specialized document collections. The process developed in this paper is comprised of three major phases. In the first phase, terms used with the greatest frequency in particular documents are identified. A term is considered to be frequent if it appears in at least 8 documents, when 5 is a given document frequency threshold (because the size of targeted document collections may vary, it is appropriate to define 5 as a relative measure instead of as an absolute measure; therefore 8 is defined as the percentage of the documents in which a given term appears, out of the total documents in a targeted document collection), and the term must appear a minimum of cj) times in the entire collection, when <j) is a given term frequency threshold (similar to the definition of 8, § is defined as a certain term"s frequency relative to the highest term frequency in the entire document collection). The second phase of the concept identification process introduces a new measurement based on the relations between identified term pairs and the co-occurrence of those terms, to describe the Affinity between each pair of identified terms. Finally, in the third phase, the Affinity measure calculated in the second phase as the similarity function is applied in a hierarchical clustering program to cluster the identified terms into groups. In the clustering process, some terms merge with each other to form "term groups," while other terms remain independent. All of the terms represent concepts that are potentially relevant to organizing the document collection. In order for the above approach to work, an automatic text indexing technique is necessary to parse and pre-process text documents and appropriate similarity functions and clustering algorithms are required to cluster identified terms. The next 2 section presents a review of relevant literature on automatic text indexing techniques, traditional concept identification approaches and different similarity functions in unsupervised clustering processes applied in the context of concept identification. 1.2 Parsing and Pre-Processing The overwhelming volume of online information generated and disseminated across computer networks has created a significant burden for researchers and practitioners. For structured numeric data, database management systems (DBMSs) have typically been used to organize records and files. However, in the case of unstructured textual data, processes of information extraction, processing and retrieval remain very complex and problematic. , To overcome the difficulties of processing unstructured text documents, G. Salton presents a blueprint for document processing in Automatic Text Processing (1989), which includes functions of looking up terms in a dictionary, removing stop words, stemming, and phrase formation. The algorithm for Salton's entire process is as follows: 1. Ignoring case, extract all the words from each document in the document collection. 2. Eliminate non-content-bearing stop-words such as the, a, on, in, etc. (a sample list of stop-words is identified by Frakes and Baeza-Yates, 1992, Chapter 7). 3. Apply a stemming algorithm (for example the Porter stemming algorithm: Porter, 1997) to identify the word stem of the remaining words (eliminating 3 plurals, tenses, prefixes, and suffixes: see (Frakes and Baeza-Yates, 1992, Chapter 8). 4. Formulate phrases by combining related adjacent words such as New and York. The above steps outline a simple pre-processing scheme, however "an efficient implementation of the above scheme would use lexical analyzers, fast and scalable hash tables, advanced stemming algorithms, and advanced phrase extraction algorithms" (Dhillon & Modha, 1999). 1.3 Traditional Concept Identification Approach 1.3.1 Luhn and other scholars H.P. Luhn's influential 1958 paper entitled "The Automatic Creation of Literature Abstracts" has proposed that abstracts can be generated automatically by selecting, from a source text, sentences which contain strong clusters of significant words (Luhn, 1958). This approach requires an understanding of the processes of recognizing significant words; significant words are viewed as content words that are particularly closely associated with the subject matter of the documents in question. To recognize them, first a table is used to eliminate prepositions, articles, conjunctions and other common function words (stop words), and then the most frequent of the remaining word stems are accepted as significant. Luhn's paper has served as a starting point for subsequent research into automatic abstracting, statistical indexing, information retrieval and concept identification. Furthermore, during the past several decades researchers from different disciplines have used ontology, taxonomy, and computer algorithms to develop numerous complex models and techniques for concept identification. However, a problem remains: the frequency 4 with which word clusters appear is by no means the only indication of which concepts are important. As a result, researchers have tended to concentrate their efforts in other directions, including the position of a sentence within a document or paragraph (Baxendale, 1958; Edmundson, 1969); the presence of cue words and expressions such as important, definitely, in particular, unclear, perhaps, and for example (Edmundson 1969; Rush et al. 1971); the presence of indicator constructs such as "The purpose of this research is" and "Our investigation has shown that" (Paice, 1981); and the number of semantic links between a sentence and its neighbors (Skorokhod'ko, 1972). An alternative approach, using techniques from the study of artificial intelligence, entails performing a detailed semantic analysis of a source text, and thus constructing a semantic representation of the meaning of the document. "A set of frames, tailored to the domain of application, is normally used to facilitate the analysis and representation tasks. When analysis is complete, output templates are used to generate a textual summary from the instantiated frames" (DeJong, 1982; Rau et al., 1989). INCIDENT TYPE bombing DATE March 19 LOCATION E l Salvador: San Salvador (city) PERPETRATOR urban g u e r i l l a commandos PHYSICAL TARGET power tower HUMAN TARGET EFFECT ON PHYSICAL TARGET destroyed EFFECT ON HUMAN TARGET no inj u r y or death INSTRUMENTATION bomb Figure 1.1 Template of extracted information from a terrorist report. (Grishman, 1997) 5 Unfortunately, the knowledge base required for a system of this kind is of necessity large and complicated, and is moreover specific to the domain of application. Some idea of the complexities involved can be gained from Hahn's descriptions of his TOPIC system (Hahn, 1989, 1990). Although the performance of such systems may be reasonable for documents of a particular domain and specific genre, there seems to be little prospect of broadening such systems to cope with a wider variety of input. 1.3.2 Clustering techniques used in concept identification Some notable previous studies have outlined the use of clustering techniques for Automatic Concept Identification. Clustering is, by definition, the unsupervised classification of patterns (e.g. patterns found in observations, data items, or feature vectors) into groups (clusters) (Jain and Murty, 1999). Patterns within a valid cluster (i.e. a cluster that is useful for Automatic Concept Identification) are more similar to each other than they are to patterns belonging to different clusters. Thus, in the context of concept identification, clustering techniques can often be used in grouping relevant terms together to form term groups that represent potentially relevant concepts, provided that a proper similarity measure can be established between multiple terms. The two most popular clustering methods currently used by a majority of researchers are the Kullback-Liebler clustering method and the Mutual Information clustering method (Chotimongkol and Rudnicky, 1999; McClandless, and Glass, 1993). The tasks involved in clustering include pattern representation, distance definition (i.e. measuring the degree of similarity between different patterns), grouping, and data 6 abstraction (extracting data in a simple and compact form, from a data set). Pattern representation refers to the number of classes of patterns available to the computers that perform the clustering, and the quantity, type and scale of the features available to the clustering algorithm. Due to technological and environmental factors, practitioners may be limited in their ability to develop clustering algorithms that can account for all of the characteristics and qualities of the patterns being analyzed. In the discussion of concept identification below, pattern representation is most prominently evident in the terms identified in the study. In other studies and in application, the distance or proximity of similarities are usually measured by a distance function defined by pairs of patterns (in the concept identification task contributing to this paper, distance is defined by pairs of terms). A variety of distance measures have been used by various communities (Anderberg 1973; Jain and Dubes 1988; Diday and Simon 1976). A simple distance measure like Euclidean distance (') can often be used to reflect dissimilarity between two patterns, whereas other similarity measures can be used to characterize the conceptual similarity between patterns (Michalski and Stepp 1983). The Kullback-Liebler clustering method and the Mutual Information clustering method are most evidently distinguishable from each other in this function. The grouping step can be performed in a number of ways. The output clustering can be hard (a partition of the data into groups) or fuzzy (where each pattern has a 'it is the straight-line distance between two points; in a plane with p, at (x ] ; yi) and p 2 at (x2, y2), it is V((X| - x 2 ) 2 + (yi - y 2) 2), while in N dimensions, the Euclidean distance between two points p and q is V ( £ i = | N (prqi)2) where Pi '(or qi) is the coordinate of p (or q) in dimension i ; in some information retrieval systems, researchers use vector space models to represent documents - when they use clustering techniques to cluster documents into different clusters, a Euclidean function is often used to calculate the distance between two documents, with p and q 7 variable degree of membership in each of the output clusters). "Hierarchical clustering algorithms produce a nested series of partitions based on a criterion for merging or splitting clusters based on similarity. Partitional clustering algorithms identify the partition that optimizes (usually locally) a clustering criterion" (Jain & Murty 1999). Additional techniques for the grouping operation include probabilistic (Brailovski 1991) and graph-theoretic (Zahn 1971) clustering methods. 1.3.3 Kullback-Liebler Clustering In many previous studies, a number of techniques based on an analysis of the distributional statistics of words have been proposed for semantically clustering words or phrases. The Kullback-Liebler clustering technique has been used in many different areas, including Automatic Grammar Induction (McClandless and Glass 1993), Automatic Concept Identification (Chotimongkol and Rudnicky 1999), Speech Recognition (Foote and Silverman 1999), and many data mining applications. In the Kullback-Liebler clustering method, the similarity between clusters is determined by Kullback-Liebler (KL) distance. KL distance, also called "relative entropy," is defined as follows: Supposing Y is a collection of data, the relative entropy or Kullback-Liebler distance between two probability distributions Pa and Pb is DiP« 5 P*) = X P* tog*"; * r Pb<?> , , (1. 1) representing the two documents, and p, (or q i ) as the weight of document p (or q ) in terms of i dimension; the 8 While D (Pa, Pb) is often called distance, it is not a true measurement because it is not symmetric and does not satisfy the triangle inequality. Therefore, when researchers use this distance function in clustering techniques, they often use the divergence measure: Div (Pa, Pb) = D(Pa, Pb) + D(Pb, Pa) (1.2) The distance between cluster /' and cluster j is the sum of the KL distance between the left context probability, p(left), of the two clusters and the KL distance between the right context probabilities, p(right) of the two clusters: Distance (I, J) = Div (Pi(left) + Pjfleft)) + Div(Pi(right) +Pj(right)) (1.3) "Researchers calculatepfleft) and p(right) from bi-gram probability. Pi(left(vk)) is the probability that word vk is found to the left of words in cluster /" (Chotimongkol and Rudnicky, 1999). Similarly, pi(right{yk)) is the probability that word vk is found to the right of words in cluster /. Specifically, li';'"(yk) = p (vk I duster,) p"**^ ! chatter/) = p(vk ..cluster/) p(cluster,) p(clusf:ert .:vh ) p(chestert) (1.4) above formula ( V ( Z i = 1 N ( p j - q i ) 2 ) ) is then used to calculate the Euclidean distance between two documents The above Kullback-Liebler distance defines the distance / similarity function between two clusters, in relation to the entire clustering process. The grouping step of KL clustering in Concept Identification Applications often uses Hierarchical-clustering algorithms, the reasons for which will be explained in the latter part of this paper. 1.3.4 Mutual Information based Clustering The Mutual Information clustering method is very popular for processing text documents and natural languages (Brown et. al 1992, Rudinicky et. al 1999). "Mutual Information," or simply "information," was introduced by Shannon in his landmark 1948 paper "A Mathematical Theory of Communication." When Mutual Information Clustering is applied to Concept Identification applications, the process is very similar to that of KL-based clustering. It defines the similarity between words or clusters based on information that is shared with adjacent words or clusters. The algorithm assigns each word to its own cluster, then iteratively merges clusters in an economical manner, so that at each iteration, "the loss in average mutual information is minimized" (Chotimongkol and Rudnicky 1999). This merging process continues until the desired number of clusters is reached. Average mutual information (AMI) is defined by the following equation: , o f ' / }") AMI = > p i t . . fyioQ— '• 'J"— 7"? * pit)pi / i (1.5) 10 where p(ij) is the bi-gram probability of cluster / and cluster j; i.e. the probability that a word in cluster i precedes a word in cluster j. (Chotimongkol and Rudnicky 1999) 2. Proposed Approach 2.1 Objectives Previous studies on automatic concept identification can be useful for the development of a new approach to automatic identification of important concepts in document collections that are unrestricted, yet with narrowly defined domains. We propose several desired features for such an approach: 1. It should be easily portable across different restricted domains, to ensure that minimal specific knowledge regarding the domains is required. 2. The system should be scalable and able to deal with large document collections. 3. The final output should include both concept identification results and concept creation process information, to ensure that the process of concept identification can be examined. 4. The process should accommodate the optional injection of domain-specific knowledge to improve the final outcome. 5. If possible, the system should be independent of any specific languages. Based on the above objectives, it is apparent that automatic text indexing techniques (such as stop-word removal and stemming) are needed, to identify relevant index terms for the target document collections and to apply unsupervised clustering 11 techniques to classify terms into groups that represent potential concepts. In addition, the study needs to provide empirical evidence of the effects on concept identification results of different selection criteria for index terms, different parameters used in similarity functions, and different characteristics of document collections. Because a number of previous studies have already focused on the use of unsupervised clustering techniques to group words and phrases for concept identification, the present study focuses primarily on applying the newly developed affinity value function as a similarity function in the clustering process of concept identification tasks. 2.2 Theories Influencing the Current Approach 2.2.1 Semantic Network The first theory that has inspired the development of the present concept identification framework is the Semantic Network theory developed by Professor Joseph D. Novak at Cornell University in the 1960s. It was originally based on the theories of David Ausubel, who stressed the importance of prior knowledge to the acquisition of new concepts. Novak concluded, "Meaningful learning involves the assimilation of new concepts and propositions into existing cognitive structures" (Jonassen 1996). David Jonassen has elaborated the relationship between knowledge and learning in more detail: Semantic networks are knowledge representation schemes involving nodes and links (arcs or arrows) between nodes. The nodes represent objects or concepts and the links represent relations between the nodes. The links are directed and labelled; thus, a semantic network is a directed graph. In print, the nodes are usually represented by circles or boxes and the links are drawn as arrows between the circles. Semantic networks are spatial representations of ideas and their interrelationships. The semantic networks and the maps that represent 12 them are composed of nodes (concepts or ideas) that are connected by links (statements of relationships). (Jonassen 1996) "The purpose of semantic networks is to represent the organization of ideas in a content domain" (Jonassen 1996). As Jonassen has explained, semantic networks can be used to represent structural knowledge, particularly as "structural knowledge is the knowledge of how ideas within a domain are integrated and interrelated" (Jonassen 1996). Based on the Semantic Network theory, a graphic representation of a specific content domain can be built using a diagram of nodes, with each node representing a major knowledge base (concept) within that domain. When a specific domain's document collection is targeted, if the index terms of that document collection can be identified and if interrelationships between the index terms can be established, a Semantic Network (representation of structural knowledge) of that specific domain document collection can be built, and both the single-term and multiple-term concepts within the document collection can be identified. 2.2.2 Word Association and Co-occurrence Study Several researchers have used statistical measures to identify sets of associated words, which can be used in various natural language processing tasks. As Church and Hanks (1990) have said, "word association has been studied for some time in the fields of psycholinguistics (by testing human subjects on words) and linguistics (where meaning is often based on how words co-occur with each other), as well as by those researching natural language processing" (Church and Hanks 1990). Notable studies that have used statistical measures to identify sets of associated words for use in 13 various natural language processing tasks have been produced by H indie and Rooth (1990), Dagan (1990), McDonald et al. (1990), and Wilks et al. (1990). Central to the study of word association is the concept that "words which occur frequently with a given word may be thought of as forming a 'neighborhood' of that word" (Guthrie et al. 1991). Moreover, these neighborhoods assist people in determining precise meanings. For example, as noted by Church, Kenneth, and Hanks (1989), the word "bank" has different meanings in different contexts. If we assume that it has only two meanings - a repository for money and the ground at the edge of a river - we would expect the first meaning of bank to co-occur frequently with words like "money," "loan," and "robber," while the second meaning would be associated with "river," "bridge," and "earth." In order to disambiguate "bank" in a text, neighborhoods can be produced for each meaning, and they can intersect the terms within the text, under the assumption that the neighborhood that shares more words with the text will identify the correct sense. In the concept identification task associated with this paper, the true "meaning" of a concept (or term) is identified by studying its "neighborhood" - its associations with other terms and concepts. 2.2.3 Concept Definition As the preceding discussion notes, a concept's true meaning can only be identified when its associations with other concepts are studied. Likewise, to understand the structural knowledge of a specific domain, the relationship between the concepts included in the domain must be studied. Admittedly, the term concept is itself complex, and a number of papers in the areas of information retrieval and language processing, researchers have used these two words interchangeably. Therefore, for the 14 purposes of the present study, the definitions of the two words can be combined, and concept can thus be defined as a word or group of words having a particular meaning, or a general idea derived or inferred from specific instances or occurrences. However, a specific definition of concept is still necessary for use in the study's concept identification task. For example, if we try to identify concepts in a document collection of a specific domain, "CRM system technology, strategy and implementation," we would encounter terms like "business," "system," "data," "intelligence," and "analysis." In their own right, these terms may very well be independent concepts, but in the specific domain, they can merge together to form a whole new concept: "business intelligence and data analysis system," a very important category in the CRM system. Thus, the present concept identification task should identify not only single-term concepts, but also the groups of terms that form complex concepts representing relevant knowledge within the specific domain. Furthermore, the second task is more important and challenging than the first. In the concept identification context, concept will be defined as a term or a group of terms that refer to a relevant knowledge point of the specific domain. Hereafter, in order to avoid confusion, the final outcome of the concept identification system will be referred to as concept, and single words or phrases resulting from automatic text processing will be referred to as terms. Researchers in human-computer interactions and information science have suggested that important terms should be expanded into concepts, and that terms with similar meanings should be linked. For example, Furnas et al. (1987) has proposed "unlimited aliasing," which creates multiple identities for the same underlying concepts. In information science, Bates (1986) has proposed the use of a domain-specific 15 dictionary to expand user vocabularies (terms) in order to allow users to "dock" onto the system more easily. The general idea of identifying important terms and linking similar terms is sound, and its usefulness has been verified in previous research. However, as Lin and Chen (1996) have noted, "the bottleneck for such techniques is often the manual process of identifying terms and linking similar or synonymous terms. For example, the effort involved in creating an up-to-date, complete, and subject-specific thesaurus is often overwhelming and the resulting thesaurus may quickly become obsolete unless consistently maintained." 2.3 Framework of the proposed approach The automatic concept identification approach proposed in this study is divided into five phases (See Figure 2.1): 1. Pre-processing. At this phase, all the documents in the targeted domain are prepared for future processing. The preparation includes the conversion of all words to lower case, stop-word removal, word stemming and phrase formation. 2. Term identification. At this phase, all the terms in the document collection will be scanned to record their frequency in each document and their total frequency throughout the document set. Based on these two indicators, the index terms (important terms) are identified. 3. Affinity value calculation. At this phase, the affinity value (average affinity value) between each pair of identified terms is calculated, based on the newly 16 developed affinity value formula, and a term-term affinity matrix is generated as an input source for the next phase. 4. Clustering. At this phase, an unsupervised clustering technique is chosen to group terms into term groups based on the affinity value matrix. 5. Identifying concepts. At the last phase, domain knowledge is applied to convert clustering results (term groups) into concepts. The detailed processes will be further explained in the next section. Figure 2.1 The automatic concept identification approach 2.4 Parsing and Pre-Processing A simple parsing and pre-processing scheme was introduced above, in section 2.1, with the acknowledgement that the efficient implementation of the scheme would use lexical analyzers, fast and scalable hash tables, advanced stemming algorithms, and 17 advanced phrase extraction algorithms. (Dhillon and Modha, 1999). Although pre-processing is an integral part of the automatic concept identification framework being proposed, those techniques are not a focus of the present research. Instead, only the most simple and popular techniques (algorithms) available have been selected to implement in the empirical experiments conducted in this study. For stop-word removal, the stop-word list used by the Rainbow Package (http://www-2.cs.cmu.edu/~mccallum/bow/) has been adopted O . For the stemming algorithm, the Porter stemming algorithm has been used. It is the most popular and widely used of all existing algorithms. It was originally presented by Porter in 1980, and it has since been reprinted in Jones and Willet (1997). Removing suffixes by automatic means is an operation that is particularly useful in the field of unstructured text documents processing. Typically, a vector of words or terms can represent a document, and terms with a common stem will usually have similar meanings. For example, the following terms can be treated as equivalent (this example is taken from Porter's original published algorithm, please refer the details in appendix D): CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Frequently, the performance of a text document processing system will be improved if groups of terms with common stems are conflated into a single term. This may be done by removing the various suffixes (e.g. -ed, -ing, -ion, and -ions), to leave the 2 The Rainbow toolkit is designed for statistical language modeling, text retrieval, classification and clustering, 18 single term (e.g. CONNECT). In addition, the suffix-stripping process reduces the total number of terms in the document collections, and hence reduces the size and complexity of the data to be processed. In the Porter stemming algorithm, the stemming process is broken into five steps; in each step, certain operations are applied to the term according to specific rules, and complex suffixes are removed bit by bit. For example, GENERALIZATIONS is stripped to GENERALIZATION (Step 1), then to GENERALIZE (Step 2), then to GENERAL (Step 3), and then to GENER (Step 4). OSCILLATORS is stripped to OSCILLATOR (Step 1), then to OSCILLATE (Step 2), then to OSCILL (Step 4), and then to OSCIL (Step 5). A complete and detailed set of examples of the Porter stemming algorithm can be found in Appendix D. 2.5 Term Identification While automatic indexing identifies terms in the document collection, the relative importance of each term in representing the underlying concept may vary. That is, some of the terms used may be more important than others. Salton's vector space model (1997) assigns each term a weight to represent its descriptive power (a measure of importance). Among the many probabilistic techniques that have been developed by various information science researchers, techniques that typically incorporate term frequency and inverse document frequency, like Salton's, have been found useful. In the context of the present study, term frequency is used to refer to the number of times a term appears in a given document, and document frequency refers to the number of documents in the entire database that contain a given term. Because the size of targeted document collections may vary, in the current study it is appropriate to use developed and maintained by Andrew McCallum of C M U . 19 relative measures instead of absolute measures. Thus, document frequency is defined as the percentage of documents a term appears in, out of the total number of documents in the targeted document collection, and term frequency is defined as the ratio of a given term's frequency compared to the frequency of the most common term in the entire document collection. For concept identification research, in contrast to studies of automatic indexing, document frequency is a more suitable means of choosing index terms than inverse document frequency. For information retrieval, where the task is to emphasize the uniqueness of a particular document, inverse document frequency is appropriate, but for concept identification, the primary task is the identification of important terms in the whole document collection; therefore, document frequency is suitable. The specific thresholds for selecting index terms may vary depending on the size of the document collection and specific domain knowledge. Because previous studies have not firmly established standards for choosing index terms, researchers are often forced to rely on their experience and common sense. In the present study, various empirical criteria are posited to set thresholds for index term selection. 2.6 Building a Similarity Matrix for Important Terms The first stage of cluster analysis theory (Baeza-Yates and Ribiero-Neto 1999) involves the conversion of raw data (e.g. terms and weights) into a matrix of distance/similarity measures between any pair of terms. The similarity measure computation is primarily based on the probabilities of terms co-occurring in the documents within a database. The probabilistic weights of each of the terms indicate 20 their strength of relevance or association. In previous studies, researchers have developed a variety of functions to calculate the similarity/distance matrix. Besides the two similarity/distance functions mentioned in Section 1.3 above (the Mutual Information and Kullback-Liebler distances), several other functions have been used primarily in information retrieval, namely association clusters, metric clusters, and scalar clusters. The following are brief descriptions of the above functions, taken from Baeza-Yates and Ribeiro-Neto's [check the spelling - it is Ribiero-Neto in the bibliography]"Modern Information Retrieval" (1999, Chapter 5). 2.6.1 Association Clusters An association cluster is based on the co-occurrence of terms in documents. In effect, terms that occur together frequently within documents have a synonymity[is this a common technical term? If not, you might try "have an association of synchronicity"] association. Association clusters are generated as follows: Definition: The frequency of a term t(i) in a document d(j), is referred to as f(i,j). Let w=(m(i,j)) be an association matrix with | SI | rows and I Dl | columns, where m(i,j)=f(i,j). Let mt be the transpose of m. The matrix s=m*mt is a local term-term association matrix. Each element S(u,v) in s expresses a correlation C(u,v) between the term S(u) and S(v) namely, C(u,v)=S, f(S(u),j)*f(S(v),j) ' (2.1)[I don't know stats well, so here and throughout the paper, I have not checked them thoroughly] The resulting matrix s is the similarity matrix that will be used in clustering operation. 21 2.6.2 Metric Clusters Association clusters are based on the frequency of the co-occurrence of term pairs within documents, and they do not take into account where the terms occur within the documents. However, because two terms that occur in the same sentence are likely to be more closely correlated than two terms that occur far apart in a document, it might be worthwhile to factor in the distance between th two terms in the computation of their correlation factor. Metric clusters are based on this idea. Definition: Let the distance r(ki, kj) between two keywords ki and kj be given by the number of words between them in the same document. If ki and kj are in distinct documents, then r(ki,kj)=oo. A local term-term metric correlation matrix s is defined as follows. Each element S(u,v) of s expresses a metric correlation C(u,v) between the terms Su and Sv, namely, C(u,v)=Si Zjl/(r(ki,kj)+l) (2.2) 2.6.3 Scalar Clusters One additional means of deriving a synonymity [as above - is this term acceptable?]relationship between two local terms Su and Sv is by comparing the sets Su(n) and Sv(n), under the assumption that two terms with similar neighbourhoods have a synonymity relationship. In this case, the relationship can be seen as an indirect one, or as one that is induced by the neighbourhood. One way of quantifying such neighbourhood relationships is to arrange all correlation values S(u,i) in a vector Su, and all correlation values S(v,i) in another vector Sv, and then to compare these vectors through a scalar measure. For instance, the cosine of the angle between the two vectors is a popular scalar similarity measure. 22 2.6.4 Other approaches There are other approaches to calculating similarity matrices; for example, some approaches consider the unique characteristics of individual terms. These include the position of a term (e.g. a term in a title compared to a term in the abstract of a paper), the number of words in a term, the date when the term appears (i.e. the publication year of the document including the term), and the identifiable type of the term (e.g. a person's name, a subject descriptor, a gene name, a company's name, etc.). 2.7 Affinity Value Measure 2.7.1 Term Affinity Measure The proposed affinity measure between two terms is based on their co-occurrence and the distance between them when they co-occur in the text documents. If two terms constantly appear close to each other in the documents in which they co-occur, then they have a high affinity measure. Therefore the affinity value contains two aspects of the relationship between the terms - a spatial relationship (the distance between two terms when they co-occur), and a temporal relationship (how often they co-occur). Based on theories and assumptions, the affinity measure between the terms can be based on their spatial and temporal relationships. Because two terms that appear in the same sentence seem to have closer affinity than two terms that occur in different sentences, and because two terms that appear closer in distance have closer affinity than two terms that occur far apart, it might be worthwhile to factor in distance and sentence factors in the computation of affinity measures. 23 Furthermore, because affinity values represent the relationships between pairs of terms, and it is the opposite of distance (the shorter the distance, the larger the affinity), in order to be mathematically sound, the affinity value must be symmetric. 2.7.2 Definitions Distance: The number of terms between two terms being evaluated, plus one. Therefore, if two terms are next to each other, their distance is 1. D(ti,tj) is used to represent the distance between two terms ti, and tj. Affinity value: For two terms ti and tj, the affinity-measure is defined as Affinity (ti, tj). If the distance between ti and tj is d (please refer to the above distance definition), then the inverse distance factor is 1/d. If the threshold number of sentences used to evaluate the affinity measure is N (and N is less than the total number of sentences in the document, as two terms appearing only in different documents are not considered to have an affinity measure), and the number of sentence between ti and tj is n (if ti and tj are in the same sentence, then n=0; if ti and tj are in two sentences next to each other, then n=l, and so on), the sentence factor is (N-n). Finally, the normalization factor 1/N is added, and the affinity value is calculated as follows: A (ti, tj) = X r ^ W/D(ti , t j ) C 1 (if ti, tj appears in the same sentence) W=J (N-n)/N (n<N) (2.7.1) ^ 0 (n>=N) D (ti, tj) is the distance between term ti and term tj (please refer to the above distance definition). E implies the accumulation of all occurrences of ti and tj in the entire 24 document collection. For example, if a certain document includes two occurrences of ti and two occurrences of tj within the evaluation scope (N), then the co-occurrence will be counted as four times (the affinity measure may be different due to different d and n), and they can be accumulated together based on this measurement. For all ti=tj, the affinity value between the terms is calculated as zero. Co-occurrence number. This is the total co-occurrence of two terms in the entire document collection. For all term pairs, the co-occurrence number matrix is CN (ti, tj). Average Affinity value: Because the total affinity value of two terms is an accumulation of all the co-occurrences of the two terms, it is very important to know the average affinity value of each co-occurrence. Because, as stated before, the affinity value contains two aspects of the terms' relationship to each other, a spatial relationship and a temporal relationship, when the affinity value is used to build the affinity matrix for clustering analysis, it is appropriate to separate the spatial and temporal relationships. Therefore the definition of Average Affinity value is: " A(ti,tj) / CN(ti,tj) (CN(ti, tj)<>0) AVE(ti,tj)=*j (2.7.2) 0 (CN(ti, tj)=0) The detailed algorithm that realize this definition is as follows: 1. Each token is evaluated from left to right, and as each token is evaluated, it is treated as the pivot point. 2. The term is then compared to all succeeding tokens in the next N sentences (including the one it is in), or until the end of the document. 25 3. The affinity measure is only given to term pairs which consist of two different terms. 4. When a co-occurrence for ti and tj is found, (N-n)/(N*D) is calculated and accumulated to Affinity (ti, tj). 5. A co-occurrence is added to CN(ti, tj) to record the co-occurrence number of the two terms. 6. Each sentence is separated by either a period, an exclamation mark, or a question mark. 7. The program runs the scan process from the beginning to the end of the document. 8. After processing all the documents, the program calculates the average affinity value matrix AVE(ti, tj), based on the affinity value matrix Affinity (ti, tj) and the co-occurrence number matrix CN(ti, tj). The following is an example of the calculation of Affinity value and Average Affinity Value. Assuming there is one document, Docl, with the following content: Docl: "A B. B A." (where A and B are identified terms) In calculating the affinity value, N=2 is selected for the parameter. Affinity CN (number of co-occurrence) Average Affinity A-A 0 NA 0 A-B =1 =1+0.25=1.25 =1.25 + 1.25 =2.5 =1 =1+1=2 2+2=4 0.625 B-A =0.25 =0.25+1=1.25 =1.25 +1.25 =2.5 =1 =1+1=2 =2+2 =4 0.625 B-B 0 NA 0 Table 2.1 Example of Affinity and Average Affinity Calculation 26 2.8 Clustering Algorithms In the following sections, several clustering algorithms are briefly reviewed. The following definitions, descriptions and algorithms of clustering are taken from "Data Clustering: A Review" (Jain and Murty 1999). 2.8.1 Hierarchical Clustering Hierarchical techniques produce a nested sequence of partitions, with a single, all-inclusive cluster at the top and single clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters from the next lower level (or splitting a cluster from the next higher level). The results of a hierarchical clustering algorithm can be graphically displayed using a tree diagram, called a dendogram. This tree diagram graphically displays the merging process and the intermediate clusters. The dendogram below demonastrates how four points can be merged into a single cluster. (pl,p2i P3,p4) (p2, b3, p4) (p27p3D PI P2 P3 P4 Figure 2.2 Example of hierarchical clustering 27 For document clustering, this dendogram provides a taxonomy, or hierarchical index. There are two basic approaches to generating a hierarchical clustering: Agglomerative: Beginning with the points as individual clusters and, at each step, merging the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. Divisive: Beginning with one all-inclusive cluster and, at each step, splitting the cluster until only singleton clusters of individual points remain. In this case, each step requires decisions about which cluster to split and how to perform the split. Agglomerative techniques are more common, and the traditional agglomerative hierarchical clustering procedure is summarized as follows: Simple Agglomerative Clustering Algorithm 1. Compute the similarity between all pairs of clusters; i.e. calculate a similarity matrix whose if th entry gives the similarity between the / th and j th clusters. 2. Merge the most similar (closest) two clusters. 3. Update the similarity matrix to reflect the pair-wise similarity between the new cluster and the original clusters. 4. Repeat steps 2 and 3 until only a single cluster remains. 2.8.2 Partitional Clustering In contrast to hierarchical techniques, partitional clustering techniques create a one-level (un-nested) partitioning of the data points. If K is the desired number of clusters, partitional approaches typically find all clusters at once. This is in contrast to traditional hierarchical schemes, which bisect a cluster to get two clusters or merge two clusters to get one. Of course, a hierarchical approach can be used to generate a 28 flat partition of AT clusters, and likewise, the repeated application of a partitional scheme can provide a hierarchical clustering. K-means is based on the idea that a center point can represent a cluster. In particular, for K-means the notion of a centroid is used, as the mean or median point of a group of points. Note that a centroid almost never corresponds to an actual data point. 2.8.3 Selecting a Clustering Algorithm Of the many clustering algorithms, the agglomerative hierarchical clustering (AHC) algorithm has been selected and implemented in the system being developed. Because AHC is a clustering algorithm that iteratively merges terms or clusters together in the order of their affinity value, terms are clustered into a treelike structure in which the cluster at a leaf corresponds to an identified term of the specific document collection. The intermediate nodes that are close to the leaves represent simpler (general) term groups (concepts) while the intermediate nodes that are close to the root represent more complex (specific) term groups (concepts) within the specific document collection. Using the previous example, concepts are identified in a specific domain document collection, "CRM technology, strategy, implementation, and products." The terms business, "system," "data," "intelligence," and "analysis" are first identified as important terms. If they are placed into a treelike structure, they are at the leaves of the tree (see Figure 3). Agglomerative hierarchical clustering provides a more flexible method of understanding and interpreting the structure of the concepts, inasmuch as a group of concepts can be merged into a more complex and domain-specific concept. In this example, "business" first merges with "intelligence" to form a domain specific concept: "business intelligence." "Data" merges with "analysis" to form a complex concept: "data analysis." Finally, all the terms merge to form a complex domain 29 specific concept: "business intelligence, data analysis system," which represents a very important knowledge base in the specific domain. In this way, not only can the final clustering outcome be examined, but also the clustering process used to evaluate the concept identification (creation) process, which is also highly significant and valuable in the knowledge recognition/development process. business intelligence data analysis system Figure 2.3 Agglomerative hierarchical clustering example 2.9 Identifying Concepts The last phase of the current approach is closely linked to the clustering process. Up to this point, domain specific knowledge has not been used in concept identification efforts. At this step, however, expert knowledge is introduced to convert the results of the last phase (term groups) into final output concepts. Also, because the agglomerative hierarchical clustering stops at one cluster, expert knowledge must also be incorporated to determine the stopping point for each cluster. For example, regarding the above cluster, expert knowledge is necessary to decide whether it 30 should merge with other clusters or stop at this point. Expert knowledge is also needed to summarize the concepts of each cluster of terms. The currently proposed approach involves the submission of output clusters of each merging step of the clustering process to four individual judges, who will rate them based on a five-point scale separated according to their relevance to the specific domain. (The details of the measurement will be introduced in the "Evaluation" section below.) Subsequently, the top ten term clusters (based on their scores) are selected and summarized into concepts as the final output of the system. 3. System Implementation To create a visual presentation of the entire concept identification process, which includes stop-word removal, tokenizing, statistics collecting, threshold determination, affinity measurement calculation, and clustering, a windows-based program with a GUI interface has been developed using Visual Basic 6.0 (See Figure 3.1). 1 St. Auto Concept Identification File Statistics Cluster Help _ L T J J 31 Figure 3.1. Screenshot of application window Users can load documents into the system, remove stop-words, use the Porter algorithm to stem words into tokens, retrieve important statistics about the document corpus (e.g. document frequency and word frequency), set different thresholds based on the aforementioned statistics to filter out unimportant tokens, calculate affinity matrices based on newly developed affinity measures, and, finally, put important tokens into several clusters using the selected clustering algorithm. The complete process can be monitored in the application's display window. fpllir i^llllllSi')^ II / u i Fife 5tatcNc* Distei Help |t^ r«ii' R«alji, i-$ Retervion Marketing? .Executive Summary Rise and shine, Maiketing Dkectois. It's a new day for Online Customer .Retention thar&s to significant advances in customer information and relate: techwtogiei finally, the reatty is catching up to the promise of mass \ .sidividuafeation and its power to glue customs'? to brands. • £-marketing has become tuch a common element ot today's coomurttcatiori plans that most marketers don't even bothec with the "e" aryrrtore But the v it's deployed and measured is still woehJy attimeniary given its urwque powl tor bfiidirig and ti acting customer relationship*. The onlne world is She only; platform that puts both parties, customer and company, on the same tooting; giving thern equal access to each other and equal say in shaping what hapt neat in the brand experience. Moreover, sophisticated on&ne customer intelligence techrwiogies can now deliver metrics that track ustomets interactivity with content • a tat more meansnqfui way to qauge. i respond to and build on theft brand loyalty. The file opened is D:\papet^02M | Draw " k C* j A^oShayf - \ « J 3 D |S) $ £ ' '& - siLjii • = ~ 5JB § -falliiBiiiillliiiBiii^^ Figure 3.2. A document loaded into the application Figure 3.2 displays a document named "D:\papers\02.txt" loaded into the application. Figure 3.3 then illustrates that "D:\papers\02.txt" has been tokenized, meaning that it has gone through stop-word removal, stemming and phrase formation processes. The new file is saved in "D:\papers\02.tok." 32 =JiJ*J £dtf f r « w t FQrwsZ Toot;. T$We &irtdow Hjfp j •iuAuto Concept Identifier fcWi hi. £*H« Help iet«n market. exeeut sunmari tes shtn mafket directot. da anfai custom K*er> signitic advanc custom irtftsro leiatiarohip teehnatogi. final reafti ea?ch prosm mass iridwdu powet glue custom t*«fvct. matkel common utem todai commun plan tnatket bother anytnot. depioi measur woefuft iudirreritaji uniqu power build back custom felaisorahsp. onfen world platform put parti custom compani toot giv equal access equal sKap btsiid sxpen. sophel onlsi custom iiteflig techr»k<gi defr* meiiic tf«ck ci(ifom inistacl content meaning gaurj respond budd brand toyafli. The words have been tokened And tti« f3e s saved as D:\papmMX?. tok 4 5 > _i - .1 = ~ ~: Bsiar.ll ^ ^ 4 1 $ j] 3 3 5 ] € l T l Ujel & P 1 P 5 A a]o] J l » g % a ^ H * a q i hJi^'- i • • 1 Figure 3.3. The document has been tokenized The distributable program is 2.3 MB in size, and can be executed on Window 98/2000/NT/XP platforms. It consists of a single-form interface and five general modules: 1. The stop-word removal module. This uses the list of stop words from the Rainbow toolkit package, which includes 545 common stop words. (http://www-2.cs.cmu.edu/~mccallum/bow/) 2. The stemming module. This uses the Porter stemming algorithm (1997) to remove suffixes. 3. The statistics module. This records important statistics for all terms in the document collection, including the total number of terms, the total frequency 33 of each term, and the proportion of documents in which each term appears. The last two statistics are used to identify important index terms. 4. The affinity measure module. This calculates affinity value between each pair of index terms based on the newly developed system. The results are saved in a file in term-term matrix format. 5. The clustering module. The agglomerative hierarchical clustering algorithm has been adopted (please previous section for detailed reference). The final output is saved in a file. 4. Evaluation The ultimate goal of the concept identification task is to obtain a set of clusters that precisely as possible identifies and organizes all of the significant concepts in the domain of interest. Consequently, the quality of the output clusters themselves must be evaluated. Supposing that the correct set of concepts and their members is given, the quality of the output clusters can then be evaluated by comparing them against these reference concepts. However, if the pre-defined concepts are found only in specialized fields requiring advanced technical knowledge, the concepts cannot be obtained beforehand, and a subjective evaluation method must be adopted to test the concept identification system. 34 4.1 Measurements 4.1.1 Relevance score A straightforward approach to measuring outcomes involves the recruitment of a group of judges to look at each set of clusters and to identify concepts. Using the results of this exercise, the mean of every cluster and the correlation among the concepts generated by the judges can be calculated. Higher correlation among judges should indicate better concept identification results. Although this approach is rather subjective and time-consuming, as well as being dependent on the judges' domain knowledge and ability to use the English language, it seems to be the only viable evaluation method available at present. For the present experiments, four individual judges have been employed, all of whom are IT professionals working for high profile technology companies, with an average of over six years of experience in network, MIS system technology and implementation. A five-point scale measuring instrument has been adopted to assess the results, with a rating of 5 denoting that the posited correlations are highly reasonable, a rating of 1 indicating that proposed pairs are nonsensical, and 3 indicating a moderate amount of correlation between the items in a pair. For example, one clustering iteration in process corpus one (CRM) merges two tokens: "communication" and token "channel." This may be rated "highly relevant," while another iteration, in which the token "enterprise" and the token "planning" are merged, may be rated 4 by some judges - still "relevant," but less relevant than "communication and channel." In contrast, the tokens "make" and "expect," when merged, can be rated 1 or "totally irrelevant." Through this process, the average "relevance" value can be examined, as the point at which the "relevance" value for a cluster begins to decline (an indicator to prevent a particular cluster from continuing to merge with other clusters), and the trends of average "relevance" value in the 35 clustering process (the number of terms per cluster compared to the average relevance score) can be compared. 4.1.2 Average Within-Cluster Similarity The average similarity value within a cluster is an objective measurement that indicates the quality of each cluster, calculating the average affinity value of all members in a cluster. For a cluster " A , " ti and tj are two random yet unique terms in cluster "A , " and N is the number of unique terms in cluster "A . " The number of pairs in cluster " A " can thus be calculated using the following formula: The Average affinity value within particular clusters can be defined by the following formula: £ (ti, tj) is the accumulated affinity value of all unique combinations of ti and tj (tiotj); CO(N), calculated using formula 4.1.1. This measurement is based on the distance between terms in the cluster, term co-occurrence in the cluster, and the clustering method. 4.2 What to Measure As stated before, because the agglomerative hierarchical clustering algorithm provides a more flexible method of understanding and interpreting the structure of concepts, the process of clustering represents the process of recognizing and developing CO(N) = N*(N-l)/2 (4.1.1) AveSim(cluster a) = 2 (ti, tj) (Affinity(ti, tj) / CO(N) (4.1.2) 36 knowledge. Each step in the clustering process generates a cluster of terms, although the number of terms in the cluster may vary widely. The challenge then is to test whether the cluster (of terms) can become a potential concept for the specific domain. When evaluating the concept identification system, each step of the clustering process is evaluated until the subsequent merging steps cease to be reasonable. Based on the results of the clustering process, the top ten judge-identified concepts are selected (those clusters with the highest relevance score) along with the top ten computer-identified concepts (those clusters with the highest average within cluster similarity value), as the final concept identification outputs. Limiting the number of concepts to ten may be arbitrary, but it is generally appropriate given the size of the document collections, the number of index terms selected, and the need of human beings to work with a manageable number of concepts. 4.3 Questions and Tests The following are several tests and questions that were posed [to whom?]before conducting an empirical study of the proposed concept identification system. Question 1: Do different selection thresholds for index terms have different impacts on concept identification outputs? Lower thresholds of index term selection will lead to the selection of more terms in the clustering process, therefore, it is logical to venture the following test: Test la: Do different selection thresholds for index terms (document frequency and term frequency) affect the recall of concept identification (the measurement of the completeness of concept identification)? 37 The rationale for this question is that an increased quantity of the terms that are selected will lead to identification of more concepts. Test lb: Do different selection thresholds for index terms (while maintaining a certain number of index terms) have any impact on the precision of identification of the top ten concepts (does it predict of the accuracy of identifying the top ten concepts, if accuracy requires that a certain number of the top ten concepts identified using the approach are correct reference concepts)? The rationale behind this question is that the quantity of documents in which each given term appears and the frequency of each term across the entire set of documents are, generally speaking, correlated (please refer to the token list table, Appendix C), and thus when the thresholds are changed (e.g. in the format of percentage of highest term frequency, and percentage of total documents number in the corpus), the important terms at the top of the term list (those with both high document frequency and term frequency) will not change much. In addition, because the nature of the affinity value is linear and does not take into account the interactions of other, relatively infrequent terms, the value of affinity between two terms will not change as the number of index terms changes. Because the important terms and their affinity values between them remain the same, and because the hierarchical clustering algorithm is also linear in nature, the output of the concept identification of the terms should remain the same. 38 Question 2: Does the application of a different N factor (the size of the window of co-occurrence) in the affinity value definition has an impact on concept identification outputs? The sentence factor is an empirically decided factor that determines the size of the window within which two terms are considered to have affinity. In theory, when a large N in an affinity definition is used (meaning that more sentences are considered), it tends to discover the semantic relationships among terms; alternatively, when a small N is chosen (and fewer sentences are considered), it tends to discover the phrasal relationships among terms. Therefore it is logical to conduct the following tests: Test 2a: Will semantically relevant concepts (concepts with more terms) have a higher score than phrasal relevant concepts (concepts with fewer terms), if a larger N factor is used to measure affinity values? Test 2b: Will semantically relevant concepts (concepts with more terms) have a lower score than phrasal relevant concepts (concepts with fewer terms) if a smaller N factor is used to define affinity values? Question 3: Is the proposed concept identification approach more suitable for document collections with narrowly defined domains than document collections with less narrowly defined domains? The rationale behind this question is based on the possibility that the concept identification framework is suitable for document collections with narrowly defined 39 domains. First, these specialized document collections have limited and focused vocabularies; and the proposed method of selecting index terms (using Document Frequency and Term Frequency as indicators) will perform well. Second, specialized document collections are often more structured, and the affinity value definition will perform well in clustering processes and subsequently it will generate useful clustering outputs (concepts). 5. Experiments and Discussion: 5.1 Experimental Corpus The experiments conducted for this study used three groups of documents. The first corpus of documents was taken from www.searchCRM.com. which contains 23 white papers in the domains of CRM (Customer Relationship Management) system technology, strategy, markets and implementation. The second corpus of documents was taken from the UBC library website, www.library.ubc.ca. and consisted of publications by the Gartner group: 30 white papers of VPN (Virtual Private Network) technology, markets and product reviews. The third corpus of documents was also taken from www.library.ubc.ca. and consisted of other Gartner Group publications, including 20 white papers on SAN (Storage Area Network) technology reviews, product reviews and market trends. Al l the statistics of the experimental documents are summarized in the following table, and detailed indexes (title, author, publication date, etc.) for all of the documents in the experiments are provided in Appendix A. 40 Number of Documents Average Length of the Document (words) Number of Unique Tokens Highest Token Frequency Corpus 1 23 1881 3169 1260 Corpus 2 30 1794 2496 1370 Corpus 3 20 1603 2245 612 Table 5.1. Statistics of experimental documents 5.2 Selection Criteria and Final Concept Identification Output As previously mentioned, index tokens were selected for the clustering process based on two criteria: total frequency (TF) and document frequency (DF). Because there is a lack of established standards for index terms for concept identification tasks, researchers have often relied on personal experience and common sense in their studies. One example is Chen & Schatz's "Semantic Retrieval for the NCSA Mosaic" (1999), which has proposed an approach that would first identify important concepts in the specific domain (their use of concept means a term, unlike the concept definition used in the current study). They have then applied the concepts to build a concept space model and to complete the semantic retrieval task. They arbitrarily chose certain thresholds (which can be altered by the user) for Document Frequency and Term Frequency to eliminate concept descriptors with low frequency in their approach. 41 The selection of index terms for concept identification determines how many terms should be included in the clustering process, which may subsequently affect the final results of the process. In order to determine the effects of the index term selection on the concept identification outputs, a sensitivity study has been conducted for the current paper, using a different combination of Document Frequency measures (in the format of percentage of total documents in the corpus) and Total Frequency measures (evaluated as a percentage compared with the token appearing with the highest frequency in the corpus) as criteria to select index terms and to examine the outcome of the top ten computer-identified concepts and expert-identified concepts. Due to the complexity of the computation, this study was conducted on the second corpus only (VPN technology, products, and markets). The corpus contained 30 white papers with 2496 different tokens; the highest token frequency in the corpus was 1370 (VPN). Total Frequency (%) 2% (27) Document Frequency # of Tokens (%) 3% (42) 4% (55) 30% (9) 221 134 102 50% (15) 149 101 82 Table 5.2: Combinations of document frequency and total frequency, compared to the number of selected tokens 42 On the above table, the first row shows different thresholds of total frequency. Two percent of the highest term frequency, 1370, is 27; therefore the selected index terms must have at least a 27-term frequency, and so on. The first column shows the different thresholds of Document Frequency; the number of documents in the corpus is 30, 30 percent of 30 equals nine, which means that the selected index term must appear in at least nine documents in the corpus. Other intersection squares show the number of selected index terms that satisfy the two criteria. The following table exhibits the results of a concept identification task for the first group (document frequency = 30 percent, total frequency = 2 percent). The results of the other five groups can be found in Appendix A (Table 1 to Table 6) No. Tokens Average within-cluster similarity Grade 1 servic, tim 1.0 4.5 2 system, cisco 1.0 4.5 3 access, technologi 1.0 4 4 encryption, wan 1.0 4 5 ipsec, standard 1.0 4.75 6 privat, virtual 1.0 4.75 7 sourc, destin 1.0 4 8 authent, commun 1.0 4 9 Limit, number 1.0 4 10 chang, subject 1.0 4 Table 5.3a: Document Frequency (30 percent) and Total Frequency (two percent): the top ten concepts identified by the computer No. Tokens Average within- Grade cluster Similarity 1 privat, virtual, network 0.58654 5 2 router, configur, criteria 0.33333 5 3 ipsec, standard 1.0 4.75 43 4 simultan, encrypt, session 0.50163 4.75 5 internet, connect, traffic 0.54857 4.75 6 client, softwar, system, cisco 0.38153 4.5 7 number, limit, packet 0.46564 4.5 8 offic, small, central 0.47862 4.25 9 gatewai, connec, enterpr 0.32516 4.25 10 end, user, shar, devic 0.28767 4.25 Table 5.3b: Document Frequency (30 percent) and Total Frequency (two percent): the top ten concepts identified by experts Table 5.3a lists the top ten computer-identified concepts for the second corpus (VPN technology, products and markets). The ranking is based on the average within-cluster similarity value according to the output of the clustering program. The clustering program was used to group terms based on their affinity value. The average within-cluster similarity value is the average affinity value between all the members in a cluster. It can be calculated using the following formula: AveSim(cluster a) = I (ti, tj) (Affinity(ti, tj) / CO(N) (4.1.2) CO(N)=N*(N-l)/2 (4.1.1) Please refer to the detailed definition of average within cluster affinity in Section 4 for the explanations of CO(N), AveSim(cluster a), and S (ti,tj). The "Grade" column is the average score of each identified concept given by the expert judges, to generate an empirical and quantifiable measurement. To measure the performance of the clustering process and the overall performance of the proposed concept identification framework, four judges were enlisted, and a five-point scale measuring instrument was adopted, as discussed in Section 4.1.1. In sample results in corpus one (CRM system implementation, technology and strategies), two tokens, "communication" and "channel," were merged to form the concept "communication 44 channel," which could be rated 'highly relevant.' Another concept, in which the tokens "enterprise" and "planning" were merged, could be rated 4 - still relevant, but less relevant than "communication channel." If, for example, the tokens "make" and "expect" are merged, it may be 1, or 'totally irrelevant.' Because an agglomerative hierarchical clustering algorithm was used, smaller clusters could be merged with others to form larger cluster in subsequent iterations. For example, in corpus two (VPN technology, markets, and product reviews), the tokens "virtual" and "private" were merged into the concept "virtual private," which in itself may not be very reasonable, but later they merged with the token "network" to form a very relevant concept: "virtual private network." Table 5.3b lists the top ten expert-identified concepts ranked by their average score. These concepts have more semantic meaning than other concepts do, and they are more specific to the narrowly defined domain. They are the concepts that the concept identification experiments are designed to identify. When the test results were submitted to the four individual judges, the following information was provided: 1. List of the titles of all documents for the three document collections. For the first document collection (CRM), brief abstracts for all of the documents, as shown in Appendix B, were also provided. 2. An overview of the research approach for this study, and a statement of the study's goals. 3. Final output from the clustering process including all of the term clusters with the quantity of terms in one cluster not exceeding ten. 45 Based on the information provided, the judges were asked to give the following responses: 1. Grade all term clusters using the aforementioned "relevance" score measure. 2. For the clusters which were graded better than the neutral (3) score, a brief description of the clusters should be provided (i.e. what the judges think the concept of the cluster is), if they are not self-descriptive. In order to present the effects of index term selection on the concept identification task, two matrices are used to present the data. As detailed in the Appendix, the concept identification results of all six groups are listed in the same format as Table 5.3. Then the top ten expert-identified concepts across all six groups are noted. The resulting concepts can be regarded with a certain degree of satisfaction, as they are the top ten most relevant concepts for the second corpus (VPN technology, markets, and product reviews). They are listed in the following table, recovered from their token format and returned to their original forms. The sequences of the terms in the clusters are also adjusted accordingly. No. Concepts Average Similarity Max Average Similarity Mean Average Similarity Standard Deviation Average Similarity Average Grade 1 Virtual, private, network 0.58654 0.83333 0.4507722 0.1755389 5 2 Additional, enterprise, router, gateway 0.19980 0.72222 0.3788553 0.1422380 4.75 3 IPSEC, standard 1.0 1.0 0.6479464 0.2987877 4.75 4 Small, office, remote, internet, 0.33377 0.64167 0.2687632 0.0665883 4.75 46 connection 5 Limit, number, packet 0.46564 0.83333 0.4507722 0.1755389 4.5 6 Simultaneous, session, encryption 0.50163 0.83333 0.4507722 0.1755389 4.5 7 CISCO, software, system, check, point 0.35971 0.64167 0.2687632 0.0665883 4.5 8 Integrate, secure, authentication , protocol 0.31550 0.72222 0.3788553 0.1422380 4.5 9 Complete, operation, data 0.31258 0.83333 0.4507722 0.1755389 4.5 10 End, user, sharing, device 0.28767 0.72222 0.3788553 0.1422380 4.25 Table 5.4: Top ten expert-identified concepts of Corpus 2 The next table (Table 5.5) demonstrates the quantity of top ten concepts each experiment identified. If it is a perfect match, the score will be 100 percent; if it is a partial match, the score will be a percentage between zero and 100 percent depend on the degree of similarity in the semantic meaning. If there is no match, the score will be zero. For all partial matches, a number of correct terms identified are divided by the total number of terms in the reference concepts to decide the appropriate percentage. For example, "virtual private" would be a 67 percent match of the top concept "virtual private network." The last column of the following table shows the average hit rate (average on all ten reference concepts) of each experiment. 47 C l C2 C3 C4 C5 C6 C7 . C8 C9 CIO Ave E 1 (DF=30%, TF=2%) 100% 67% 100% 67% 100% 100% 50% 0 0 100% 68.4% E 2 (DF=30%, TF=3%) 100% 100% 33% 50% 100% 100% 50% 0 100% 100% 73.3% E3-(DF=30%, TF=4%) 100% 100% 33%o 50% 100%. 100% 67% 0 100% 100% 75% E4 (DF=50%, TF=2%) 100% 100% 100% 100% 100% 100% 100% 0 0 100% 80% E5 (DF=50%, TF=3%) 100% 100% 33% 100% 0 0 100% 100% 100% 100% 73.3% E 6 (DF=50%, TF=4%) 100% 100% 33% 100%. 0 0 100% 100% 100% 100% 73.3% Table 5.5 Matrix of the sets of top ten concepts identified in six experiments (El to E6 refer to the six experiments, C l to C10 refer to the ten concepts) As this table reveals, all the experiments have identified a majority of the top ten concepts. Furthermore, when more index terms are used, the concept identification is more accurate, with the exception of Experiment 4 (document frequency =50%, total frequency=2%). Overall, however, there is not much difference between the experiments. This indicates that in identifying the most relevant concepts of a specific domain of documents, as long as a minimal number of index terms are selected, the addition of more index terms will not improve the performance of the proposed concept identification approach. In addition, it is clear that an increased quantity of index terms included in the clustering process will identify more concepts. However, because the affinity value definition only considers terms co-occurring within a certain range, after a minimal threshold, even if more index terms are selected for the clustering, their affinity values with other terms will be very small. Therefore the 48 inclusion of those terms will not improve the performance of the concept identification approach; on the other hand, it may affect the performance by introducing noise into the clustering process. A review of the top ten computer-identified concepts in each experiment reveals an even greater similarity across all six experiments. The following table demonstrates how many of the top ten computer-identified concepts are the same in each pair of experiments. # o f common concepts E t (30%, 2%) E 2 (30%, 3%) E 3 (30%, 4%) E 4 (50%, 2%) E 5 (50%, 3%) E 6 (50%, 4%) E 1 (DF=30%, TF=2%) E 2 (DF=30%, TF=3%) 6 E 3 (DF=30%, TF=4%) 5 8 E 4 (DF=50%, TF=2%) 10 6 5 E 5 (DF=50%, TF=3%) 6 9 8 6 E 6 (DF=50%, TF=4%) 5 8 10 5 8 Table 5.6: Top ten computer-identified concepts across six experiments In the above table, it is evident that the top ten computer-identified concepts are more sensitive to Total Frequency (TF) than Document Frequency (DF), and therefore E l and E4, E2 and E5, E3 and E6 have almost the same top ten computer-identified concepts. 49 5.3 The N factor in Affinity Value Definition and Concept Identification Outputs In section 2.7.2, a new affinity value was defined based on the following formula: A (ti, tj) = S W / D(ti, tj) (D (ti, tj) is the distance between term ti and term tj, Z means accumulate all the co-occurrences of term ti and t j ) r 1 (if ti, tj appears in the same sentence) W =i (N-n)/N (n< N) LO (n>=N) Where N is the size of the window in terms of the number of sentences in which two terms co-occur (please refer to the definition in Section 2.7.2), and n is the number of sentences between the two terms plus one, the N factor is a significant parameter because it decides the size of the window within which two terms are considered to have affinity. Thus, the choice of N may have an important effect on the final concept identification output. In all the previous experiments, N was arbitrarily set at 5. Now, however, a sensitivity study can test the effect of different N values on the final concept identification outputs. Due to computational complexity, this experiment is performed only on the second corpus (VPN technology, markets, and product review), and a Document Frequency of 30 percent and a Total Frequency of 4 percent is used consistently. The following six tables are for the second corpus, DF=30%, TF=4%, N=l,3,5 Top ten computer-and expert-identified concepts. No. Tokens Average within-cluster similarity Grade 1 vpn, protocol 1.0 4 2 enterpr, network 1.0 4 3 enterpr, network, connect 1.0 4 4 privat, virtual 1.0 4.75 5 tunnel, require 1.0 2 6 point, check 1.0 4.5 7 number, limit 1.0 4.0 50 8 system, cisco 1.0 4.0 9 access, technologi 1.0 4.0 10 encryption, wan 1.0 4.25 Ta Me 5.7a: DF (30%) TF (4%) and N (=1). Top ten computer-identified concepts No. Tokens Average within-cluster similarity Grade 1 Virtual, privat 1 4.75 2 enterpr, network, connect, traffic 0.75 4.75 . 3 simultan, encrypt, session 0.50163 4.75 4 secur, router, gatewai 0.33796 4.75 5 system, cisco, check, point 0.50555 4.5 6 number, limit, packet 0.66666 4.5 7 vpn, protocol, user, authent 0.61574 4.5 8 complet, oper, data 0.37049 4.5 9 firewal, vendor, solution 0.54861 4.5 10 tunnel, requir, ipsec 0.5 4.25 Table 5.7b: DF (30%) TF (4%) and N (=1). Top ten expert-identified concepts No. Tokens Average within cluster similarity Grade 1 network, sit 1,0 3 2 tunnel, require 1.0 2 3 System, cisco 1.0 4 4 access, technologi 1.0 4.75 5 encryption, wan 1.0 4 6 privat, virtual 1.0 4.75 7 limit, number 1.0 4.5 8 user, end 1.0 4.5 9 secur, gatewai 0.75 4.5 10 remot, small 0.55556 4 Table 5.8a: DF (30%) TF (4%) N (=3). Top ten computer-identified concepts No. Tokens Average within cluster similarity Grade 1 Virtual, privat 1 4.75 2 secur, gatewai, connec, enterpris 0.51965 4.75 3 simultan, encrypt, session 0.50163 4.75 4 internet, connect, traffic 0.42460 4.75 5 system, cisco, softwar, check, point 0.35317 4.5 6 number, limit, packet 0.52222 4.5 7 remot, small, office , connect 0.47076 4.5 8 complet, oper, data 0.37049 4.5 51 9 enterpr, router, solution 0.23751 4.25 10 tunnel, encryption, wan, requir, ipsec 0.38683 4.25 Table 5.8b: DF (30%) TF (4%) N (=3). op ten expert-identified concepts No. Tokens Average within cluster similarity Grade 1 System, cisco 1.0 4.5 2 access, technologi 1.0 4.5 3 encryption, wan 1.0 4 4 privat, virtual 1.0 4.75 5 limit, number 1.0 4 6 encryption, wan, require 0.73333 4 7 remot, small 0.56667 4 8 internet, connect 0.54857 4 9 oper, data 0 .53636 4 10 user, end 0.50357 4 Table 5.9a: DF (30%) TF (4%) N (=5). Top ten computer-identified concepts No. Tokens Average within cluster similarity Grade 1 privat, virtual, network 0.58654 5 2 enterpr, router, addition, gatewai 0.19980 4.75 3 simultan, encrypt, session 0.50163 4.75 4 internet, connect, traffic 0.54857 4.75 5 softwar, system, cisco, check, point 0.38153 4.5 6 number, limit, packet 0.46564 4.5 7 remot, small, office 0.43642 4.5 8 complet, oper, data 0.31258 4.5 9 end, user, shar, devic 0.33568 4.25 10 tunnel, encryption, wan, requir, ipsec 0.36191 4.25 Table 5.9a: DF (30%) TF (4%) N (=5). Table 5.10 extrapolates the data from the above tables to compare the average score and the average similarity of the top ten expert-identified concepts. Average Similarity Average number of terms Average Score N=l 0.579664 3.2 4.575 - N=3 0.478686 3.4 4.55 N=5 0.413031 3.6 4.575 52 Table 5.10: N=l,3,5. Average similarity, average number of terms and average score on top ten expert-identified concepts From the above table, it is evident that for the top ten expert-identified concepts, the experiment using N=l has the best average similarity, and the experiment using N=5 has the worst average similarity, with N=3 in between. This result arises because N=l considers only terms within the same sentence as having affinity value. Average similarity is based on the average affinity value between each pair of terms, and thus when the average affinity is calculated, the value should be higher than if N=3 or N=5 The average scores of the top ten expert-identified concepts across three experiments are very close, suggesting that, for identifying the top ten concepts, a different value of N will not greatly affect the performance of the proposed concept identification system. The following table lists the number of identical concepts identified by experts in each pair of experiments. # of identical N=l N=3 N=5 concepts (expert-identified top ten) N=l N=3 7 N=5 6 7 Table 5.11. The number of identical concepts identified by experts across three experiments 53 Although the average scores of the top ten expert-identified concepts are very close across three experiments, the same concepts were not identified. From the above table, it is evident that the number of identical concepts is only 60 percent to 70 percent. When N=l, more information is generated regarding the phrasal structure of the terms, while when N=5, more information is given concerning the semantic structure of the terms. For this reason, although the average scores may be close, they actually identify different concepts. Table 5.10 lists the average number of terms per cluster for the top ten expert-identified concepts throughout all three experiments, indicating that as N increases, the average number of terms per cluster of the top ten concepts increases from 3.2 to 3.6. Although this may not offer conclusive evidence (because only the top 10 concepts are considered, the concepts with fewer terms have a higher average score than concepts with more terms, as discussed later in this chapter), the results suggest that using a larger N factor tends to identify more semantically relevant concepts, and using a small N factor tends to identify concepts that are more relevant to phrasing. Further study may be needed to investigate this point. 5.4 Characteristics of a Document Corpus and Concept Identification Outputs After all three document corpuses were processed, it became evident that the second corpus has a better average score (in terms of both top ten and overall concepts) than the first corpus, particularly in regard to clusters that have more tokens. The total average score is 3.9917. If the clusters that contain more than ten tokens are not counted, the average score is 4.072. 54 The reasons for this difference could be the following: first, the second group of documents are all produced by the same source, the Gartner Group; second, VPN is a narrower domain than CRM; and third, compared to CRM, VPN is more technical, hence the articles about it are more structured. Thus, the characteristics of the document corpus evidently might affect the output of the proposed concept identification system. i A sensitivity study was performed to compare corpus one (CRM system implementation, technology, and markets) with corpus two (VPN technology, markets, and product reviews). Again, due to computational complexity, DF=30%, TF=4% and N=l were set as parameters for the study. No. Tokens Average within-cluster similarity Grade 1 custom, bas 1.0 4.5 2 process, develop 1.0 4 3 cost, benefit 1.0 4 4 manag, relationship 1.0 4.5 5 market, commun 1.0 3 6 inform, provide 1.0 4.0 7 servic, sal(e) 1.0 4.0 8 Product, model 1.0 4.0 9 tim(e), respons 1.0 5.0 10 success, offer 1.0 4.0 Table 5.12a: C R M corpus (D=30%, F=4, N=l). Top ten computer-identified concepts •No. Tokens Average within-cluster similarity Grade 1 management, relationship, software 0.6805 4.75 2 process, develop, strategy 0.8571 4.5 3 time, response, efficiency 0.6667 4.5 4 user, environment 0.5 4.25 5 effect, measure, knowledge 0.6667 4.25 6 custom, bas, tool 0.15944 4.25 7 include, target, group 0.61574 4 55 8 crm, operation, application, support 0.6667 4 9 data, integration, technology 0.5708 4 10 experience, information, provider 0.66667 4 Table 5.12b: CRM corpus (D=30%, F=4, N=l). Top ten expert-identified concepts Above, table 5.12a and 5.12b list the results of corpus 1 (CRM). The results of corpus 2 (VPN) are shown in table 5.9a and 5.9b. Average score of Standard deviation Average similarity top ten expert- of average score of top ten identified concepts for top ten expert- computer-identified concepts identified concepts Corpus 1 4.25 0.2635 1 Corpus 2 4.575 0.23717 1 Table 5.13: Average score and average similarity of top ten concepts for two corpuses The next table delineates the characteristics of two corpuses and the number of tokens selected using the aforementioned DF, TF values. Number of Average Number Highest Number Percentag documents length of of term of e of the unique frequency selected selected documen tokens tokens tokens t (words) (DF=30% , TF=4%) related to the total tokens Corpus 1 23 1881 3169 1260 99 3.12% Corpus 2 30 1794 2496 1370 102 4.09% Table 5.14: Statistics for corpus 1 and corpus 2. Table 5.14 indicates that corpus 2 has more documents than corpus 1, and the average length is slightly shorter than corpus 1. On the other hand, corpus 2 has far fewer unique tokens than corpus 1, and a higher term frequency than corpus 1. In effect, corpus 2 is defined and structured more narrowly than corpus 1. After token selection criteria are applied, the number of selected tokens is very close (99, 102). The results 56 in table 5.13 demonstrate that the top ten expert-identified concepts in corpus 2 have a higher average score than those in corpus 1 (the mean and standard deviation indicate that the results are significant). 5.5 O v e r a l l R e s u l t s .6 •5H H 1 - i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 r— 1 13 25 37 49 61 73 85 97 7 19 31 43 55 67 79 91 103 Iteration Figure 5.1: Highest affinity value at each merging step of clustering (corpus 1) 1.5 1.0 C > 0.0 1 15 29 43 57 71 85 99 113 127 8 22 36 50 64 78 92 106 120 Iteration Figure 5.2: Highest affinity value at each merging step of clustering (corpus 2) 57 Because the agglomerative hierarchical clustering algorithm was used in the clustering process, the computer identifies the highest affinity value between any two clusters at each step and merges them into one cluster. The above two figures demonstrate the highest affinity value the computer used to merge two clusters at each step. Number of Tokens in one cluster Mean 2 4.12 3 3.96 4 4.03 5 3.56 6 3.63 7 3.47 8 2.33 9 2.33 Tab e 5.15: Average scores compared to the number of tokens in one cluster (corpus 1) Number of Tokens in one cluster Mean 2 4.17 3 3.96 4 4.09 5 3.86 6 4.0 7 3.6 8 3.33 9 3.2 Tab e 5.16: Average scores compared to the number of tokens in one cluster (corpus 2) Number of Tokens in one cluster Mean 2 4.15 3 4.06 4 4.01 5 3.76 6 3.53 7 3.17 8 2.03 9 2.34 58 Table 5.17: Average scores compared to the number of tokens in one cluster (corpus 3) The above tables reveal the relationships between the number of tokens in one cluster and their average score. When a cluster includes more tokens, the related concept is more complex, on average. Thus, the above tables reveal that two-token concepts have the highest average score, and there is a tendency for average scores to decrease as the number of tokens grows. Therefore, the proposed concept identification approach is better suited for identifying simpler concepts (concepts with two, three, or four terms) than more complex concepts (those with more than five terms). The same conclusion is also supported by the following three figures. Figure 5.3 Mean of average similarity compared to the number of tokens in a cluster (corpus 1: SAN technology) £ .3-10 <u ai ro i_ > < c ro o.o n j It ll ^  ^ 1 1 11 13 16 19 35 43 96 Number of tokens per cluster 59 Figure 5.4 Mean of average similarity compared to the number of tokens in a cluster (corpus 2: CRM marketing and technology) 10 12 16 18 19 22 34 56 66 Number of tokens per cluster Figure 5.5 Mean of average similarity compared to the number of tokens in a cluster (corpus 3: VPN technology) 60 .6 E In 0> 2 0) > < c tu 0) 0.0 1—r 2 4 6 Em 8 10 12 15 22 36 76 79 Number of tokens per cluster The above three figures illustrate the correlation between average similarity and the number of tokens within a cluster (defined in section 4.1.2) for all three corpuses. The bars in the figures represent the mean average similarity value of clusters with two, three, or four tokens, and so on, respectively. As the above figures demonstrate, the clusters that contain two tokens have the highest average cluster similarity. In general, as the number of tokens per cluster grows, the mean of the average within-cluster similarity decreases. This holds true for all three corpuses. Based on the definition of average within-group affinity accepted for this study, as the number of terms in a concept grows, the average concept affinity (within-group affinity) must decrease. 6 1 The tendency of averages of within-cluster similarity values to decrease as the number of tokens per cluster grows is consistent with the tendency of average scores (given by judges) to decrease as the number of tokens per cluster grows (please refer to Tables 5.15, 5.16, and 5.17). 6. Conclusions A list of desired features for a concept identification system were included at the beginning of Section 2, as follows: 1. It should be easily portable across different restricted domains, and minimal specific knowledge of the domain should be required. 2. The system should be scalable. 3. The output should include concept identification results and information on the concept creation process. 4. The process should accommodate the optional injection of domain-specific knowledge to improve the final outcome. 5. If possible, the system should be language independent. The concept identification system that has been developed in this study is not perfect, but it achieves most of the objectives set out at the beginning. It is easily portable across different domains, because it does not involve any specific domain knowledge until the last step (summarizing the concepts from term clusters). However, introducing domain knowledge into the clustering process will improve the clustering results and the overall concept identification. Domain knowledge can be used to determine the stopping point in the clustering process, inasmuch as the criteria for 62 stopping points must be decided for the agglomerative hierarchical clustering for each output cluster. Domain knowledge can also be used to adjust the clustering process, if some merging steps are semantically or conceptually unreasonable (judging by prior familiarity with the domain); the program can be modified to skip those steps. Finally, domain knowledge is necessary to interpret the final clustering outputs, and to summarize term clusters into concepts. Based on the algorithm used for it, the concept identification system is scalable, because the calculation of word affinities has linear dependence on the length of the document and the quantity of documents, and it has square dependence on the number of terms selected. Furthermore, because the clustering process depends only on the number of index terms selected, clustering will not be affected if the quantity of index terms to be considered is not increased. Therefore the system has no problem dealing with large document collection; the final output of the system is the step-by-step term merging process. For the same reason, expert knowledge can be injected into the clustering process to determine when the merging process should be stopped and when subsequent steps are irrelevant. A good example can be found in Section 2 above, when corpus 2 (CRM) is processed and several terms are identified: "business," "intelligence," "data," "analysis," and "system." This cluster of five terms can be manually combined into a composite concept that is significant in discussions of CRM systems. Based on the techniques and architecture used in the concept identification system, it is also possible to achieve portability across different languages. The last three steps are language independent, and techniques and methods are already available for 63 automatic indexing in multiple languages such as Chinese, German, Japanese and Spanish. 6.1 Conclusions Drawn from Sensitivity Studies Three sensitivity studies have been conducted for this paper, to generate empirical evidence to try to answer the three, main questions and four sub-questions in Section 4 (Evaluation). The results of the studies are presented in Section 5, with further questions and tests suggested by the initial studies. Question 1: Do different selection thresholds for index terms have different impacts on concept identification outputs? Test la: Does the use of different selection thresholds for index terms (document frequency and term frequency) affect the recall of concept identification (measurement of the completeness of concept identification)? Test lb: Does the use of different selection thresholds for index terms (as long as a minimal number of index terms are retained) have any impact on the precision of concept identification of the top ten concepts (measure of the accuracy of identifying the top ten concepts)? The results of the first sensitivity study, assessing selection criteria for index tokens against the concept identification output, reveal that in terms of identifying the top ten concepts from a document set, there is not much difference in the results across six experiments. In effect, in the process of identifying the most relevant concepts from a specific domain of documents, as long as a minimal number of index terms are selected, the addition of more index terms will not improve the performance of the concept identification approach. Because the affinity value definition considers terms only if they co-occur within a certain range (e.g. within the same sentence, or within a group of three or five sentences), after a minimal threshold (a minimal number of terms selected), even if more index terms are selected for the clustering, the affinity 64 values among terms beyond that minimal number would be very small. Therefore the inclusion of those terms will not improve the performance of the concept identification approach; on the other hand, it may affect the performance by introducing distortions or noise into the clustering process. Question 2: Does using a different N factor in the affinity value definition have a different impact on concept identification outputs? Test 2a: Will semantically relevant concepts (concepts with more terms) receive a higher score than phrasal relevant concepts (concepts with fewer terms) if a larger sentence factor N is used in the affinity value definition? Test 2b: Will semantically relevant concepts (concepts with more terms) score lower than phrasal relevant concepts (concepts with fewer terms), if a smaller sentence factor N is used in affinity value definition? The results of the second sensitivity study, analyzing the use of sentence factor ./V in the development of affinity value definitions and its relation to the outputs of the concept identification system, indicate that although the average scores of the top ten expert-identified concepts are very close across three experiments, they do not identify all of the same concepts. For the N=l experiment, the average number of terms of in the top ten concepts is 3.2, while for the N=3 experiment the number is 3.4, and for the N=5 experiment the number is 3.6. However, there is not enough variation between these numbers to provide definite evidence. Question 3: Is the proposed concept identification approach more suitable for document collections with narrowly defined domains than document collections with less narrowly defined domains? 65 The results of the third sensitivity study, assessing the particular characteristics of a document corpus against the outputs of the concept identification process, indicate that the top ten expert-identified concepts in corpus 2 (VPN technology, markets and product reviews) have higher average scores than those in corpus 1 (CRM strategy, implementation and technology). Thus, it is apparent that the system is indeed more suitable and accurate with document collections with narrow domains. 6.2 General Conclusions Regarding the Automatic Concept Identification Approach This section presents some broad conclusions that can be drawn from the Automatic Concept Identification study. It was not clear at the outset of the study whether the deployment of newly developed techniques for clustering and for determining affinity values could effectively and efficiently address the goals of concept identification. The overall empirical results demonstrate that the measures and techniques applied in this approach point to a very promising concept identification framework, although there is still a lot of room for improvement. Although the experimental results varied among different document collections, the scores are all above average (3). This means that the overall concept identification results are positive, leading to several implications. First, deploying automatic text processing techniques from Modern Information Retrieval systems (stop wording, stemming, phrase extraction) provides groundwork for automatic concept identification tasks. Second, using unsupervised clustering techniques, particularly choosing the proper clustering algorithm to group individual index terms into term 66 groups and subsequently to identify them as concepts, is a viable approach for automatic concept identification. Third, it is critical that a proper distance/similarity function between term pairs is developed in the clustering process. Using the newly developed affinity value to describe the similarity of terms and deploying it in the Clustering algorithm has already yielded positive results in our concept identification experiments. Therefore, it should be possible to develop more mathematically accurate, informative affinity measures (e.g. employing certain weighting schema to measure the effects of term frequency in word co-occurrence, using non-linear functions to define the relationships among affinity, distance, co-occurrence, and windows of co-occurrence), which may perform better in concept identification tasks. Fourth, it is also possible to improve the performance of the concept identification system by incorporating new algorithms, techniques and approaches from Information Retrieval and Information Extraction areas (e.g. a stemming algorithm, a clustering algorithm, or automatic indexing approaches). The results also indicate that as the average number of terms in clusters grows, the average score becomes lower. This result may be explained by a tendency of the system to work better when identifying simple concepts (phrasal relationships), as longer lists of terms might not represent any specific concepts. Alternatively, it could be a problem with the measuring method; because it is easier forjudges to identify simpler concepts than more complex concepts, and the more difficult it is for them to identify a concept, the more likely they are to give up and give a lower score. The first problem can be addressed by developing a more advanced affinity value definition; for example, the sentence factor (N) can be increased so the affinity value can reveal more semantic relationships among terms. If the problem results from the measuring 67 methods, it can be addressed by deploying more objective measurements. For example, if a document collection with a certain number of pre-defined concepts is used, measures such as counting the correct concepts identified, precision, recall, entropy, and similarity matrix could be used. 7. Summary 7.1 Contributions of the Research The research for this paper makes contributions to a few areas of study. First, it develops an automatic and scalable process for concept identification on unrestricted yet narrowly defined domain document collections. This has been the main purpose of the study, and the results from empirical studies demonstrate that the proposed system delivers positive performance on concept identification tasks. Second, this study offers a means of reducing the use of domain-specific knowledge (such as taxonomies and ontology) in concept identification. By deploying automatic text processing techniques and unsupervised clustering algorithms, the need for domain specific knowledge in the concept identification system is minimized, which makes the system scalable (as it has the ability to deal with large document collections). Third, the study proposes a new measurement: the "Affinity" value between index terms. This measurement defines the relationships between each pair of index terms. 68 Fourth, the study provides empirical evidence regarding the effects of different parameters that may be used in the proposed approach on the final concept identification outcomes. Those parameters include index term selection thresholds (TF, DF), the size of windows considered for two terms that co-occur (N), and particular characteristics of targeted document collections. Fifth, the research testifies that it is possible to reveal semantic relationships among relevant concepts for certain document collections. 7.2 Critique of the research Although the outcomes of the research generate positive results, there are still some aspects that need further improvement. The proposed approach lacks a strong theoretical foundation; although it seems reasonable to expect that the proximity of words to each other increases the chances that the words are connected as parts of a concept. This concept identification study faces similar problems as research into information extraction and information retrieval processes: there is no formal, strong and complete theoretical foundation that deals with the kinds of numbers generated in this process. In addition, there is no well-established, formal method for selecting index terms. Index term selection is a very important part of the entire concept identification framework, because the final outcomes, the identified concepts, are all based on the index terms that are selected. 69 Furthermore, the newly developed affinity value lacks a strong theoretical foundation. Although the linear affinity value definition seems simple, straightforward and intuitive, there is no proof that it is the best way to describe the relationships between each pair of index terms. There is no proof that clustering based on affinity values is better for concept identification tasks than clustering based on the KL or mutual information methods. Added to the theoretical limitations, there is also a lack of objective measurement in the present study. The principle measurement used in the study (average grade of identified concept) is a subjective measure. It is constrained by the number of judges involved in the process, the domain knowledge of the judges, and the language skills of the judges. If a document collection with a certain number of pre-defined concepts is used, then objective measurements such as precision, recall, entropy, and similarity matrices could be used. 7.3 Future Research Future research should aim at improving the performance of the concept identification system. Because document pre-processing and automatic index word identification have already been studied extensively by researchers in the field of information retrieval, it would be more productive to focus on developing a more elaborate affinity definition, to better describe the similarity of index terms (affinity value should be based on distance and co-occurrence of terms, but certain weighting schema could be used to assess the effect of term frequency, and the basic relationship could be non-linear), finding good clustering algorithms that are better suited for concept 70 identification application, and discovering the means of incorporating domain knowledge into the clustering process to achieve better results. For example, researchers might investigate methods of combining spatial clustering (based on average affinity values between pairs of terms) and temporal clustering (based on the co-occurrence of two terms) in the system. The present approach, on the other hand, uses only spatial clustering (average affinity value), based on the average distance between two terms when they co-occur in the clustering process. The value of temporal measures (number of co-occurrences) has been omitted because it is too sensitive to term frequency, according to the simple definition used in this study. For example, in the first document collection, C R M technology, strategies, implementation and products, the term " C R M " is the most frequent term in the whole document collection, with a term frequency of 1260. This is close to 30 times the average term frequency (45) of all the chosen terms. If the temporal affinity measure is defined only by counting the incidence of co-occurrences, the affinity values between CRM and any other terms are always very high. On the other hand, when temporal affinity values are used in the clustering process, the results will always show that one cluster grows and absorbs other terms. Therefore, in the proposed system, affinity value is defined as average affinity, which is the total affinity value divided by the number of co-occurrences of particular terms. However, this definition ignores one important aspect of the affinity definition, the co-occurrence of terms. It is obvious that if two words often (although not necessarily always) appear together, it indicates a certain relationship between these two words. Obviously, the temporal aspects of relationships between terms are also important for concept identification; however, a good definition has not yet been developed that will calculate the temporal 71 affinity value based on term co-occurrence and at the same time weigh out the impact of term frequency impact. There may be two solutions to that problem: first, a good weighting schema should be identified when counting co-occurrences; and second, high frequency terms such as " C R M " can be treated as limiting factors in the clustering process, omitted from the early phases of the process and re-introduced in the subsequent sections. Another possible focus for future research could involve separation of the spatial (distance) and temporal (co-occurrence) aspects of affinity value. Spatial clustering could be used first to identify more phrasal relevant concepts, and then temporal affinity clustering could be used to identify more semantic relevant concepts; on the other hand, spatial and temporal affinities could both be used alternately in each step of the clustering process. 72 9. References Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y, "Topic Detection and Tracking Pilot Study: Final Report." Proceedings of DARPA Broadcast News' Transcription and Understanding Workshop. Lansdowne, VA, 1998. Arai, K., Wright, J., Riccardi, G., and Gorin, A. "Grammar Fragment Acquisition Using Syntactic and Semantic Clustering." Speech Communication. 27(1), 1999. Baeza-Yates, Ricardo and Ribiero-Neto, Berth ier. Modern Information Retrieval. Addison-Wesley, 1999. Baxendale, P. B. "Machine-Made Index for Technical Literature - An Experiment." IBM Journal of Research and Development. 2(2): 1958. 354-361. Brown, P., Delia Pietra, V., deSouza, P., Lai, J. and Mercer, R.. "Class-Based N -Gram Models of Natural Language." Computational Linguistics. 18(4): 1992. 467-479. Cheeseman, P., Kelly, J., Self, M. , Stutz, J., Taylor, W., and Freeman, D. "Autoclass: Bayesian Classification System." Proceedings of the Fifth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann, 1988. Chen, H. and Dhar, V. "User Misconceptions of Online Information Retrieval Systems." International Journal of Man-Machine Studies. 32(6): 1990. 673-692. Chotimongkol, A. and Rudnicky, A.I. "Automatic Concept Identification In Goal-Oriented Conversations." Proceedings oflCSLP 2002. Denver, CO, 2002. 1153-1156. Church, Kenneth and Hanks, Patrick. "Word Association Norms, Mutual Information, and Lexicography." Proceedings of the 27th Annual Meeting of Association for Computational Linguistics. 1989. 76-83. Cowie, J., Guthrie, J. A., and Guthrie, L. "Lexical Disambiguation Using Simulated Annealing." Proceedings of the International Conference on Computational Linguistics. 1992.359-365. 73 Dagan, I., Lee, L., and Pereira, F. "Similarity-Based Models of Word-Cooccurence Probabilities." Machine Learning, 34(1-3): 1999. 43-69. DeJong, G. "An Overview of the FRUMP System." Strategies for Natural Language Processing. Ed. W.G. Lehnert and M.H. Ringle. Lawrence Erlbaum Associates, 1982. 149-176. Dhillon, I. and Modha, D. A Data-Clustering Algorithm on Distributed Memory Multiprocessors: Proceedings of the KDD'99 Workshop on High Performance Knowledge Discovery. 1999. Diday, E. and Simon, J. C. "Clustering analysis." Digital Pattern Recognition. Ed. K. S. Fu. New York: Springer-Verlag, 1976. Doszkocs, T.E., Reggia, J., and Lin, X . "Connectionist Models and Information Retrieval." Annual Review of Information Science and Technology (ARIST), 25:1990. 209-260. Dubes, R. and Jain, A. K. "Clustering Methodologies in Exploratory Data Analysis." Advances in Computers: Ed. M.C. Yovits. 19: 1980. 113-228, Edmundson, H. "New Methods in Automatic Abstracting." Journal of ACM. 16(2): 1969.264-285. Foote, J.T. and Silverman, H. F. "A Model Distance Measure for Talker Clustering and Identification." Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing. Adelaide, SA, 1994. Frakes, W. and Baeza-Yates, R. (eds.), Information Retrieval. Englewood Cliffs, NJ: Prentice-Hall, 1992. Fuhr, N . and Buckley, C. "A Probabilistic Learning Approach for Document Indexing." ACM Transactions on Information Systems. 9(3): 1991. 223-248. Furnas et al. The Vocabulary Problem in Human-System Communication. New York: A C M Press, 1989. Grishman, Ralph. 1997. "Information Extraction: Techniques and Challenges." Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. Ed. Maria Teresa Pazienza. Lecture Notes in 74 Artificial Intelligence, Volume 1299. Frascati, Italy: Springer International Summer School, 1997. 10-27. Griths, A., Robinson, L.A., and Willett, P. "Hierarchic Agglomerative Clustering r Methods for Automatic Document Classification." Journal of Documentation. 40: 1984.175-205. Hindle, Donald and Rooth, Mats. "Structural Ambiguity and Lexical Relations." Meeting of the Association for Computational Linguistics. 1991. 229-236. Jain, A.K. and Dubes, R.C. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall, 1988. Jain, A. K., Murty, M. N. , and Flynn, P. J. "Data Clustering: A Review." ACM Computing Surveys. 31(3): 1999. 264-323. Johnson, F. C , Paice, C. D., Black, W., and Neal, A. "The Application of Linguistic Processing to Automatic Abstract Generation." Journal of Document and Text Management, 1: 1993. 215-241. Jonassen, D. Computers in the Classroom: Mindtools for Critical Thinking. New Jersey: Merrill, 1996. Jonassen. Handbook of Research for Educational Communications and Technology. 1996 Lin, Chung-Hsin and Chen. Hsinchun. "An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents." IEEE Transactions on Systems, Man and Cybernetics. 26(1): 1996. 75-88. http://ai.bpa.arizona.edu/papers/chinese93/chinese93.html. Lovins, L.B. "Development of a Stemming Algorithm." Mechanical Translation and Computational Linguistics. 11: 1968.22-31. Luhn, H. P. "The Automatic Creation of Literature Abstracts." IBM Journal of Research and Development. 2(2): 1958. 159-165. Michalski, R.S. and Stepp, R.E. "Learning from Observation: Conceptual Clustering." Machine Learning, An Artificial Intelligence Approach. Ed. R.S. Michalski, 75 J.G. Carbonell, and T.M. Mitchell. Palo Alto, CA: Tioga Publishing Company, 1983. 331-363. Paice, C. D. and Jones, P. A. "The Identification of Important Concepts in Highly Structured Technical Papers." Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Ed. R. Korfhage, E. Rasmussen, and P. Willet, Pittsburgh, 1993. 69-75. Paice, Chris D. and Jones, Paul A. "The Identification of Important Concepts in Highly Structured Technical Papers." ACM-SIGIR'93. Pittsburgh, 1993.[why are there duplicate entries of this?] Pollock, J.J. and Zamora, A. "Automatic Abstracting Research at the Chemical Abstracts Service." Journal of Chemical Information & Computer Science. 15(4): 1975. 226-232. Porter, M.F. "An Algorithm for Suffix Stripping." Readings in Information Retrieval. Ed. Karen Sparck Jones and Peter Willet. San Francisco: Morgan Kaufmann, 1997. Rasmussen, E. "Clustering Algorithms." Information Retrieval: Data Structures and Algorithms. Ed. W. B. Frakes and R. Baeza-Yates. Englewood Cliffs, NJ: Prentice Hall, 1992. Rau, L.F., Jacobs, P.S., and Zemik, U. "Information Extraction and Text Summarization Using Linguistic Knowledge Acquisition." Information Processing & Management. 25(4): 1989. 419-428. Reimer, Ulrich and Hahn, Udo. "An Overview of the Text Understanding System TOPIC." Linguistic Approaches to Artificial Intelligence. Ed. U. Schmitz, R. Schiitz, and A. Kunz. Frankfurt: Lang, 1990. 305-320. Rush, J.E., Salvador, R., and Zamora, A. "Automatic Abstracting and Indexing. II. Production of Indicative Abstracts by Application of Contextual Inference and Syntactic Coherence Criteria." Journal of the American Society for Information Science. 22(3): 1971. 260-274. 76 Salton, G. and McGill, M J . Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Salton, G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, 1989. Salton, G., Allan, J., and Buckley, C. "Automatic Structuring and Retrieval of Large Text Files." Communications of the ACM. 37(2): 1994. 97-108. Siu, K. and Meng, H. "Semi-Automatic Acquisition of Domain-Specific Semantic Structures." Proceedings ofEUROSPEECH-99. Budapest, Hungary, 1999. Skorochod'ko, E. "Adaptive Method of Automatic Abstracting and Indexing." Information Processing 71: Proceedings of the IFIP Congress 71. Ed. C. Freiman. North-Holland Publishing Company, 1973. 1179-1182. Sparck Jones, Karen and Willet, Peter. Readings in Information Retrieval. San Francisco: Morgan Kaufmann, 1997. Stepp, R.E. A Description and User's Guide for CLUSTER/2, a Program for Conceptual Clustering. Urbana, IL: Department of Computer Science, University of Illinois, 1983. Taycher, L., La Cascia, M. , and Sclaroff, S. Image Digestion and Relevance Feedback in the ImageRover WWW Search Engines, Proceedings of Visual. San Diego, 1997. Wilks et al., "A Tractable Machine Dictionary as a Resource for Computational Semantics." Computational Lexicography for Natural Language Processing. White Plains, NY: Longman, 1989. [avoid using "et al." in the bibliography-all of the authors' names should be listed] Zechner, K. "Automatic Generation of Concise Summaries of Spoken Dialogues in Unrestricted Domains." SIGIR'01. New Orleans, 2001. 77 Appendix A: Top Ten Computer- and Expert-Identified Concepts for Corpus 1, 2, and 3 (Including Sensitivity Studies Results). Table 1: Document Frequency (30%) and Total Frequency (2%). Top ten concepts. Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 servic, tim 1.0 4.5 1 privat, virtual, network 0.58654 5 2 system, cisco 1.0 4.5 2 router, configur, criteria 0.33333 5 3 access, technologi 1.0 4 3 ipsec, standard 1.0 4.75 4 encryption, wan 1.0 4 4 simultan, encrypt, session 0.50163 4.75 5 ipsec, standard 1.0 4.75 5 internet, connect, traffic 0.54857 4.75 6 privat, virtual 1.0 4.75 6 client, softwar, system, cisco 0.38153 4.5 7 sourc, destin 1.0 4 7 number, limit, packet 0.46564 4.5 8 authent, commun 1.0 4 8 offic, small, central 0.47862 4.25 9 Limit, number 1.0 4 9 gatewai, connec, enterpr 0.32516 4.25 10 chang, subject 1.0 4 10 end, user, shar, devic 0.28767 4.25 Table 2: Document Frequency (30%) and Tota Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 servic, tim 1.0 4.5 1 privat, virtual, network 0.58654 5 2 system, cisco 1.0 4.5 2 enterpr, router, addition, gatewai 0.19980 4.75 3 access, technologi 1.0 4 3 simultan, encrypt, session 0.50163 4.75 4 encryption, wan 1.0 4 4 internet, connect, traffic 0.43174 4.75 Frequency (3%). Top ten concepts. 78 5 privat, virtual 1.0 4.75 5 client, softwar, system, cisco 0.38152 4.5 6 number,limit 1.0 4 6 number, limit, packet 0.46564 4.5 7 encryption, wan, require 0.73333 4 7 remot, small, office 0.43642 4.75 8 access, technologi, level 0.66667 3.75 8 complet, oper, data 0.31258 4.5 9 remot, small 0.56874 4.0 9 end, user, shar, devic 0.33568 4.25 10 internet, connect 0.54857 4.5 10 tunnel, encryption, wan, requir, ipsec 0.36191 4.25 Table 3: Document Frequency (30%) and Total Frequency (4%). Top ten concepts. Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 system, cisco 1.0 4.5 1 privat, virtual, network 0.58654 5 2 access, technologi 1.0 4.5 2 enterpr, router, addition, gatewai 0.19980 4.75 3 encryption, wan 1.0 4 3 simultan, encrypt, session 0.50163 4.75 4 privat, virtual 1.0 4.75 4 internet, connect, traffic 0.54857 4.75 5 limit, number 1.0 4 5 softwar, system, cisco 0.38153 4.5 6 encryption, wan, require 0.73333 4 6 number, limit, packet 0.46564 4.5 7 remot, small 0.56667 4 7 remot, small, office 0.43642 4.5 8 internet, connect 0.54857 4 8 complet, oper, data 0.31258 4.5 9 oper, data 0 .53636 4 9 end, user, shar, devic 0.33568 4.25 10 user, end 0.50357 4 10 tunnel, encryption, wan, requir, ipsec 0.36191 4.25 Table 4: Document Frequency (50%) and Total Frequency (2%). Top ten concepts Computer-Identified Expert-Identified No. Tokens Ave Grade No. Tokens Ave Sim Grade Sim 1 servic, tim 1.0 4.5 1 privat, 0.58654 5 79 virtual, network 2 system, cisco 1.0 4.5 2 ipsec, standard 1.0 4.75 3 access, technologi 1.0 4 3 remot, small, office, internet, connect 0.33377 4.75 4 encryption, wan 1.0 4 4 gatewai, additional, plan, enterpr, router 0.32516 4.75 5 ipsec, standard 1.0 4.75 5 number, limit, packet 0.46564 4.5 6 privat, virtual 1.0 4.75 6 simultan, encrypt, session 0.50163 4.5 7 sourc, destin 1.0 4 7 system, cisco, softwar, check, point 0.35971 4.25 8 authent, commun 1.0 4 8 end, user, shar, devic 0.28767 4.25 9 Limit, number 1.0 4 9 servic, tim, authent, commun 0.45009 4.25 10 chang, subject 1.0 4 10 tunnel, encryption, wan, requir, type, ipsec, standard 0.29333 4.25 Table 5: Document Frequency (50%) and Total Frequency (3%). Top ten concepts. Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 servic, tim 1.0 4.5 1 privat, virtual, network 0.58654 5 2 system, cisco 1.0 4.5 2 enterpr, router, addition, gatewai 0.19980 4.75 3 access, technologi 1.0 4 3 remot, small, office, internet, connect 0.43174 4.75 4 encryption, wan 1.0 4 4 system, cisco, 0.35971 4.5 80 softwar, check, point 5 privat, virtual 1.0 4.75 5 complet, oper, data 0.31258 4.5 6 number,limit 1.0 4 6 end, user, shar, devic 0.33568 4.25 7 encryption, wan, require 0.73333 4 7 tunnel, encryption, wan, requir, ipsec 0.36191 4.25 8 remot, small 0.56667 3.75 8 access, technologi, level 0.66667 4. 9 internet, connect 0.54857 4.5 9 integr, secur, authent, protocol 0.31550 4 10 vendor, sing 0.53636 3.75 10 servic, content, sourc, 0.34285 3.75 Table 6: Document Frequency (50%) and Total Frequency (4%). Top ten concepts. Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 system, cisco 1.0 4.5 1 Privat, virtual, network 0.58654 5 2 access, technologi 1.0 4.5 2 enterpr, router, addition, gatewai 0.19980 4.75 3 encryption, wan 1.0 4 3 remot, small, office, internet, connect 0.43174 4.75 4 privat, virtual 1.0 4.75 4 system, cisco, softwar, check, point 0.35971 4.5 5 limit, number 1.0 4 5 complet, oper, data 0.31258 4.5 6 encryption, wan, require 0.73333 4 6 end, user, shar, devic 0.33568 4.25 7 remot, small 0.56667 4 7 tunnel, encryption, wan, requir, ipsec 0.36191 4.25 8 internet, connect 0.54857 4 8 access, technologi, level 0.66667 4. 9 oper, data 0 .53636 4 9 integr, secur, authent, protocol 0.31550 4. 10 user, end 0.50357 4 10 servic, content, sourc, 0.34285 3.5 81 Table 7: DF (30%) TF (4%) N (=3) Top ten concepts. Computer-Identified Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 network, sit 1.0 3 1 virtual, privat 1 4.75 2 tunnel, requir 1.0 2 2 secur, gatewai, connec 0.51965 4.75 3 system, cisco 1.0 4 3 simultan, encrypt, session 0.50163 4.75 4 access, technologi 1.0 4.75 4 internet, connect, traffic 0.42460 4.75 5 encryption, wan 1.0 4 5 system, cisco, softwar, check, point 0.35317 4.5 6 privat, virtual 1.0 4.75 6 number, limit, packet 0.52222 4.5 7 limit, number 1.0 4.5 7 remot, small, office 0.47076 4.5 8 user, end 1.0 4.5 8 complet, oper, data 0.37049 4.5 9 secur, gatewai 0.75 4.5 9 enterpr, router, solution 0.23751 4.25 10 remot, small 0.55556 4 10 tunnel, encryption, wan, requir, ipsec 0.38683 4.25 Table 8: DF (30%) TF (4%) and N (=1). Top ten.concepts. Computer-Identifiec Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 vpn, protocol 1.0 4 1 virtual, privat 1 4.75 2 enterpr, network 1.0 4 2 enterpr, network, connect, traffic 0.75 4.75 3 enterpr, network, connect 1.0 4 3 simultan, encrypt, session 0.50163 4.75 4 privat, virtual 1.0 4.75 4 secur, integr, router, gatewai 0.33796 4.75 5 tunnel, requir 1.0 2 5 system, cisco, check, point 0.50555 4.5 6 point, check 1.0 4.5 6 number, limit, packet 0.66666 4.5 7 number, limit 1.0 4.0 7 vpn, protocol, user, authent 0.61574 4.5 82 8 system, cisco 1.0 4.0 8 complet, oper, data 0.37049 4.5 9 access, technologi 1.0 4.0 9 firewal, vendor, solution 0.54861 4.5 10 encryption, wan 1.0 4.25 10 tunnel, requir, ipsec 0.5 4.25 Table 9: CRM corpus (D=30%, F=4, N=5). Top ten concepts. Computer-Identifie( Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 custom, bas 1.0 4.5 1 manag, relationship, software 0.6805 4.75 2 process, develop 1.0 4 2 process, develop, strategi 0.8571 4.5 3 cost, benefit 1.0 4 3 tim, respons, effici 0.6667 4.5 4 manag, relationship 1.0 4.5 4 user, environ 0.5 4.25 5 market, commun 1.0 3 5 effect, measure, knowledg 0.6667 4.25 6 inform, provide 1.0 4.0 6 custom, bas, tool 0.15944 4.25 7 servic, sal(e) 1.0 4.0 7 includ, target, group 0.61574 4 8 product, model 1.0 4.0 8 crm, oper, applic, support 0.6667 4 9 tim(e), respons 1.0 5.0 9 data, integr, technologi 0.5708 4 10 success, offer 1.0 4.0 10 experi, inform, provid 0.66667 4 Table 10: Top ten computer- and expert-identified concepts for Corpus 3 (DF=50%, F=4%, N=5). Computer-Identifiec Expert-Identified No. Tokens Ave Sim Grade No. Tokens Ave Sim Grade 1 switch, fabric 1.0 4.25 1 storage, area, network 0.3562 5 2 EMC, IBM 1.0 4 2 small, computer, system, interface 0.2764 5 3 disk, independ 1.0 4 3 key, business, issue 0.3667 4.5 4 Compaq, 1.0 4.25 4 performance, 0.5 4.5 83 comput high, availability 5 high, availability 1.0 4.25 5 price, configuration 0.6667 4.5 6 capacity, scalability 1.0 4.0 6 independent, array, disk 0.1254 4.5 7 Note, research 1.0 4.0 7 hardware, software 0.4574 4.25 8 network, area 0.8 4.0 8 host, server, protocol 0.2661 4.25 9 vendor, deliver 0.8 4.0 9 capacity, limit 0.2703 4.25 10 channel, fibr 0.8 4.25 10 market, share 0.1667 4.25 84 Appendix B: The Processed Document Lists The detailed document lists and a brief introduction of corpus one (CRM system, implementation, strategies, and review) 1. CRM: Well what is it? by Simon Harper, Director, NHANZ Ltd Is your business designed to "Churn and Burn" (A) or "Learn and Earn" (B)? If your business model is based around model A, then a lot of this white paper is irrelevant as the business focus is about costly one off sales and immediate return. This paper is aimed at Model B businesses working towards long-term customer profitability & reciprocal benefits that in turn create a proactive business and see customer information as a powerful business asset. 2. What Really is Retention Marketing? by Kathleen Goodwin, Chief Executive Officer, iMakeNews This article reviews the latest methods for increasing e-marketing ROI by focusing on retention. It also uncovers the latest e-metrics for segmenting customers into "buckets" according to behaviors/attitudes, narrowcasting content to each bucket accordingly, and improving branding/retention as a result. Read the entire report! 3. Channel Partner e-Communications Workbook by Todd A. Brehe, Director of Communications Products, Gallatin Technologies, Inc. Many companies looking to implement a partner relationship management (PRM) application would be better served by first starting a channel partner communications program. This Workbook provides an outline for defining and implementing a world-class partner communications initiative so that channel executives and managers can reach partners and earn their efforts and attention. 4. Secrets of CRM Success: Achieving the Vision Through end user adoption by Kristina Markstrom, Director Client Services, Tech Resource Group End users hold the keys to the CRM kingdom. Sales reps and call center staff that don't support the effort or understand how to use and apply the new system and business practices can thwart successful?CRM implementation efforts. Learn best practices for managing change, training end users, and providing ongoing support. 5. Business Process Integration. Simplified. by Anand Vaidya, CEO and Chief Technologist, AVS SYSTEMS, INC Today? economy demands cohesive handshake among all business process so that companies can understand the real benefits that are hidden underneath the guise of voluminous storage of data spread across disparate systems. 6. Improving Agent Performance While Maintaining High Levels of Motivation by Jose Roza, Product Manager, Altitude Software This white paper outlines critical steps to improve contact center performance and, at the same time, keep high levels of agent motivation. The white paper also addresses several themes, such as how to turnover agents and how to improve agent performance while keeping high customer satisfaction levels. 7. Knowledge Management for Customer Service by Anand Subramaniam, Sr. Marketing Director, eGain Communications Corp. 85 As enterprises increasingly use customer service to differentiate themselves, knowledge management has gained prominence as a strategic initiative. A key enabler, it allows businesses to use their knowledge assets to provide better customer service. This paper discusses the ingredients of success as it relates to knowledge management strategy, technology, people, and processes. 8. Real Time Service & Support by Anand Subramaniam, Sr. Marketing Director, eGain Communications Corp. Web collaboration is delivering strategic and operational benefits to organizations in the form of improved customer satisfaction, reduced costs, and increased revenues. In this paper, we discuss the pain points that are driving the demand for web collaboration, the benefits of this set of technologies, and what to look for in a solution. We also describe how eGain, a leader in this space, can help enterprises provide best-in-class real-time customer service and support on the web. 9. Positioning C R M Programs for Value Creation by Steve LaValle, Marco Cucchi, Scott Lieberman, C R M Vision Practice, IBM Business Consulting Services Today, organizations must focus on identifying clear business cases to justify CRM investments. This white paper outlines how to setup a successful C R M Program focused on value-creating initiatives. 10. Distinguish Your Business Through Excellent Customer Service by Craig Bailey, President, Customer Centricity, Inc. Everyone recognizes that in today? business environment you cannot afford to lose a single profitable customer. However, the current business climate also provides a window of opportunities for companies to leverage excellent customer service to differentiate their business from the competiton and gain a competitive advantage. 11. The CIO Agenda: Taking Care of Business by Getronics/IDG Research Many in the IT community talk about ROI, but a comprehensive global survey of IT executives at large organizations finds that CIOs don't use formal ROI tools ~ CFOs do. Conducted by IDG Research and commissioned by global solutions provider Getronics, this paper outlines many key trends for 2003 and beyond. Some stats of note: CIOs are cautiously optimistic for 2003, with 41% predicting they will increase spending. Projects will focus on decreasing costs, increasing customer loyalty, and increasing profits in the upcoming year. CRM will be the top spending priority in the US and security for Europe. 12. Turn your Customer Data into Gold! by Dr. Raj Menon, Managing Partner, The Menon Group, Inc. This white paper outlines a framework for analyzing your existing customer data which will help you develop a consistent, firm-wide understanding of your customer and the notion of customer value. Knowing exactly what your customers are worth is the first step towards developing specifically targeted and customer focused strategies. We also outline seven habits of highly customer focused organizations -which are strategies used by successful companies to become and remain profitable. 13. What Does It Take for CRM Success? 86 by Alan F. Atkinson, Vice President, Consulting Services, Equinox Corporation Little can be added to the literature and positive experiences that describe the value of CRM as a business system and/or culture in a business. Then why are management, users and the IT professionals still wondering why their project has lived up to expectations? Read more... 14. Peek-a-boo at CRM Market by Siddhartha Jain, Consultant, Digitra Infostructure Ltd. This white paper from Digitra delves into the effects of various C R M technologies on your organization and which vendors offer the technology you need. The paper is a structured approach to understand the CRM market. 15. Why CRM Projects Fail by Doug Tanoury & Kit Ireland, President - Founder, Customer Interactions Consulting This is an indepth and comprehensive white paper that discusses common mistakes and pitfalls that CRM initiatives are prone to encounter. More importantly the paper examines how your company can avoid these common mistakes. 16. Web-Enabled CRM: Today? Reality ~ Tomorrow? Vision by Doug Tanoury, President - Founder, Customer Interactions Consulting This paper is a look at CRM in its present state and where it is going based on the migration of current applications to the Internet, and the next generation of web-based tools. While there changes in technology open new opportunities for visionary companies, there are distinct dangers to those who bring old service paradigms to the new media. The Internet has set new standards for responsiveness, quality and customer segmentation. A candid and honest look at successes and failures is provided here and a look to where current technology and trends will take C R M in coming years. 17. CRM Initiatives Taking It Personal 7 Key Steps for Personalization by Thomas Hannigan & Christina Palandrano, Senior Consultant & Senior Project Manager, Chatham System Group This paper presents the 7 critical steps to effectively prepare your company to successfully implement personalization as part of your CRM strategy. Based on extensive experience with implementing large-scale CRM initiatives, the authors cover data, technology, organizational and strategic issues that must be addressed during the planning process to avoid costly barriers to success. Each step includes a checklist to use during the planning process. 18. eCRM:Theel l way by Vineet Agrawal & Manish Mittal, Consultants, SHTR Consulting Group With the advent of internet the role of CRM within an organization is changing. Managers and team members need to stay ahead of the competition, collaborate on projects and help the flow of information internally and externally. This white paper is aimed at providing organizations understanding of how a web-based Customer Relationship Management (CRM) system can dramatically improve their ability to effectively run web based campaigns, track leads, convert leads into sales, deliver responsive service and support to internal, external customers and partners. 87 19. Adaptive CRM and Knowledge Turnover by Vince Kellen, Principal, Blue Wolf In the 21st century, businesses will discover that the key driver to improving performance is how quickly and correctly it can discover and apply new customer knowledge. Knowledge turnover is a new metric for measuring how quickly companies turn data into profitable action. Nimble business that quickly learn which products and services to add, change or drop, or can figure out how and which customers to attract and retain will outperform their slower-moving competitors. Companies need to practice Adaptive CRM. 20. Strategic Sales Processes For Improved Customer Relationship Management by Russ Lombardo, President/Owner, PEAK Sales Consulting Customer Relationship Management (CRM), in and of itself, is not a solution; it is a means to an end ?enhancing the sales process so you can better manage your relationships with your customers. In most companies, this involves multiple departments, such as Sales, Marketing, Customer Service, Technical Support, and even Accounting, to name a few. 21. Implementation Is the Easy Part: Action Items for CRM Success by Bryan J. Stone, Manager, Countrywide Stories of high-profile CRM failures abound. Your organization can avoid becoming one of these stories by following four planning strategies to avoid common pitfalls. 22. Defining CRM for Business Success by Hector D. Trestini, The Internet has brought about a new level of productivity and competitiveness for individuals and businesses worldwide. But, this online revolution has also created significant challenges for organizations. As customers rush to use web services that provide them with instant and personalized attention, businesses are struggling with managing increased competition, eroding customer bases, and growing financial pressures through outdated business models that are supported by a multitude of un-integrated technologies in their environments. The detailed documents list for Corpus two (VPN technology, markets, and product review) 1 IP VPN Equipment: Worldwide, 2001 Update http://riondel.library.ubc.ca/gartnergrp/research/109000/109002/109002.html 2 Carrier Buying Patterns for IP VPN Connectivity: EMEA, 2002 http://riondel.library.ubc.ca/gartnergrp/research/l 12400/112470/112470.html 3 Firewall and IP VPN Equipment Market: Worldwide, 1H02 http://riondeI.library.ubc.ca/gartnergrp/research/l 12200/112241/112241 .html 4 IP VPN Equipment: Worldwide, 2001 http://riondel.library.ubc.ca/gartnergrp/research/105400/105406/105406.html 5 IP VPN Services: It's Still Early in Asia/Pacific and Japan, but Speed Is Picking Up 88 http://riondel.library.ubc.ca/gartnergrp/research/l 04300/104359/104359.html 6 Firewall and IP VPN Equipment Market: Worldwide, 1H02 http://riondel.library.ubc.ca/gartnergrp/research/l 12200/112242/112242.html 7 Enterprise VPN Product Magic Quadrant Evaluation Criteria http://riondel.library.ubc.ca/gartnergrp/research/109600/109632/109632.html 8 IP VPN Equipment: Worldwide, 2001 Update (Executive Summary) http://riondel.library.ubc.ca/gartnergrp/research/109000/109003/109003.html 9 IP VPN Equipment: Worldwide, 2001 (Executive Summary) http://riondel.library.ubc.ca/gartnergrp/research/105400/105408/105408.html 10 CIO Update: Enterprise VPN Product Magic Quadrant for 2H02 http://riondel.library.ubc.ca/gartnergrp/research/l 10200/110217/110217.html 11 Enterprise VPN Product 2H02 Magic Quadrant http://riondel.library.ubc.ca/gartnergrp/research/109600/109633/109633.html 12 Management Update: Magic Quadrant Update for Enterprise VPN Products http://riondel. 1 ibrary.ubc.ca/gartnergrp/research/99000/99036/ 99036.html 13 Management Update: Enterprise VPN Products Magic Quadrant Update for 1H01 http://riondel.library.ubc.ca/gartnergrp/research/98500/98559/98559.html 14 Enterprise VPN Products Magic Quadrant Update for 1H01 http://riondel.library.ubc.ca/gartnergrp/research/98100/98130/ 98130.html 15 CIO Update: VPN Security and Placement http://riondel.library.ubc.ca/gartnergrp/research/101100/101169/101169.html 16 Management Update: The Pros and Cons of Building a VPN http://riondel.library.ubc.ca/gartnergrp/research/101100/101167/101167.html 17 VPN Security and Placement http://riondel.library.ubc.ca/gartnergrp/research/100500/100595/100595.html 18 The Pros and Cons of Building a VPN http://riondel.library.ubc.ca/gartnergrp/research/100500/100565/100565.html 19 VPN: Define Before You Design! http://riondel.library.ubc.ca/gartnergrp/research/100300/100379/100379.html 20 Virtual Private Networks (VPNs): Comparison Columns http://riondel.library.ubc.ca/gartnergrp/research/106500/106563/106563_r.html 21 Virtual Private Networks (VPNs): Comparison Columns http://riondel.library.ubc.ca/gartnergrp/research/106500/106563/106563_p.html 89 22 IP VPNs: 2000 and Beyond http://riondel.library.ubc.ca/gartnergrp/research/97700/97746/ 97746.html 23 IP VPNs, Worldwide: First Half 2001 Update http://riondel.library.ubc.ca/gartnergrp/research/l 03100/103108/103108.html 24 IP VPN Services: It's Still Early in Asia/Pacific and Japan, but Speed Is Picking Up http://riondel.library.ubc.ca/gartnergrp/research/102000/102036/102036.html 25 Lucent Technologies SpringTide IP VPN Product Offering http://riondel.library.ubc.ca/gartnergrp/research/98600/98690/98690.html 26 Lucent Technologies VPN Firewall Brick Appliances http://riondel.library.ubc.ca/gartnergrp/research/97300/97318/ 97318.html 27 Cisco Systems 7100 Series VPN Router http://riondel.library.ubc.ca/gartnergrp/research/96300/96343/ 96343.html 28 Cisco Systems VPN 5000 Concentrator http://riondel.library.ubc.ca/gartnergrp/research/95300/95365/ 95365.html 29 Cisco Systems VPN 3000 Concentrators http://riondel.library.ubc.ca/gartnergrp/research/95100/95143/ 95143.html 30 Nortel Networks Contivity VPN Switches http://riondel.library.ubc.ca/gartnergrp/research/90600/90684/ 90684.html The detailed documents list for Corpus three (SAN technology, markets, and product review) 1. Fibre Channel SAN Components: The Switch in 2000 http://riondel.library.ubc.ca/gartnergrp/research/98500/98542/ 98542.html 2. Compaq's MSA 1000: A SAN for Beginners? http://riondel.library.ubc.ca/gartnergrp/research/103900/103949/103949.html 3. NAS vs. SAN: Technology Overview http://riondel.library.ubc.ca/gartnergrp/research/97400/97417/97417.html 4. iSCSI Could Offer a Cost-Effective Solution for SAN Architectures http://riondel.library.ubc.ca/gartnergrp/research/108200/108257/108257.html 5. Scaling SANs to the Enterprise: Islands Still Recommended http://riondel.Iibrary.ubc.ca/gartnergrp/research/109400/109409/109409.html 6. NAS and SAN Blurring: Fast File Systems and Fibre Channel http://riondel.library.ubc.ca/gartnergrp/research/107900/107930/107930.html 7. SAN 1-Gbps Fibre Channel Switch Magic Quadrant N 90 http://riondel.library.ubc.ca/gartnergrp/research/l 07500/107586/107586.html 8. SAN Solution Magic Quadrant: Second Assessment http://riondel.library.ubc.ca/gartnergrp/research/l 01900/101922/101922.html 9. Semiconductors in the SAN Landscape http://riondel.library.ubc.ca/gartnergrp/research/97000/97038/ 97038.html 10. EMC Celerra: NAS and SAN Together http://riondel.library.ubc.ca/gartnergrp/research/94800/94855/ 94855.html 11. Fibre Channel SAN Extenders: The Last 10,000 Miles http://riondel.library.ubc.ca/gartnergrp/research/94300/94340/ 94340.html 12. SAN Management Tools Are Finally Here http://riondel.library.ubc.ca/gartnergrp/research/102100/102130/102130.html 13. Network Appliance Offers NAS and SAN on One Platform http://riondeI.library.ubc.ca/gartnergrp/research/l 11700/111700/111700.html 14. Mainframe and Open Systems: The Same SAN? http://riondel.library.ubc.ca/gartnergrp/research/94800/94866/ 94866.html 15. StorageApps' SANLink: The Standard for SAN Appliances http://riondel.library.ubc.ca/gartnergrp/research/93600/93617/ 93617.html 16. SAN Solution Magic Quadrant: Preliminary Assessment http://riondel.library.ubc.ca/gartnergrp/research/92800/92854/ 92854.html 17. Cisco and Brocade Bridge SAN Islands With IP and DWDM http://riondel.library.ubc.ca/gartnergrp/research/92500/92580/ 92580.html 18. Longer-Term SAN Benefits: Better SRM via Disk Pooling http://riondel.library.ubc.ca/gartnergrp/research/86500/86573/ 86573.html 19. SAN Exploitation in the Future Server Architecture http://riondel.library.ubc.ca/gartnergrp/research/83200/83299/ 83299.html 20. SAN Exploitation for Backup: Will You Go Serverless? http://riondel.library.ubc.ca/gartnergrp/research/82900/82940/ 82940.html 91 Appendix C: Selected Token List of Three Corpus Table 1. Selected Token List for Corpus 1. (Document Frequency=30%, Term Frequency= 4%) ID Token Frequency Number of Doc 10 Custom 1160 23 590 Service 411 22 369 Crm 347 20 216 Busi 311 23 153 Manag 292 23 43 Compani 287 22 105 Bas 227 21 199 Product 219 21 417 Sal 216 20 616 Web 199 12 2 Market 194 20 620 Applic 184 17 13 Inform 179 20 418 process 175 23 664 solution 174 19 487 knowledg 169 18 156 tim 158 21 154 data 152 14 514 center 152 14 15 technologi 151 20 485 provid 144 22 352 cost 143 19 79 support 142 21 207 contact 135 16 595 system 134 19 28 commun 134 17 14 relationship 130 18 518 channel 124 16 592 call 114 19 519 partner 110 9 689 organ 107 18 383 requir 104 22 629 integr 100 14 163 mak 98 21 123 effect 98 17 1516 model 91 8 298 increas 89 18 388 point 87 16 229 tool 86 18 131 strategi 86 16 292 sell 86 16 997 vendor 86 13 170 group 85 19 33 measur 85 14 326 issu 84 16 169 internet 84 15 290 includ 83 22 628 interac 81 14 336 import 79 19 409 respons 79 18 866 enterpr 79 18 126 enabl 79 17 306 mail 79 11 895 autom 78 13 183 success 77 20 302 improv 76 20 579 result 75 18 1059 user 73 14 343 develop 71 16 143 level 68 17 1063 invest 68 16 1000 softwar 67 18 139 target 66 13 436 perform 65 15 547 number 64 16 421 address 61 17 205 offer 61 16 356 gener 61 16 711 industri 61 15 152 databas 60 15 507 expect 59 17 882 revenu 58 18 119 person 58 17 260 environ 57 18 81 year 56 18 921 roi 56 13 214 chang 55 19 233 work 55 17 21 individu 55 13 582 reduc 55 12 132 report 54 15 521 implement 53 17 142 high 52 17 372 achiev 51 17 90 creat 51 15 274 oper 50 20 437 abil 49 19 37 build 49 18 405 specif 49 16 49 experi 48 16 541 benefit 48 15 64 program 48 13 212 Goal 48 13 150 Term 47 16 52 Deliv 47 16 161 Approach 47 16 68 Understand 47 15 78 Effici | 47 14 644 Satisfac 47 11 Table 2. Selected Token List for Corpus 2. (Document Frequency=30%, Term Frequency= 2%) ID Token Frequency Number of Document 5 Vpn 1370 30 104 Service 598 28 11 Network 547 29 12 Product 453 26 125 Manag 418 25 167 Enterpr 377 28 107 Support 341 23 86 Secur 301 30 164 Tunnel 273 25 91 Client 263 22 14 Firewall 254 23 118 System 251 22 3 Softwar 248 23 16 Access 239 26 277 Provid 237 26 349 Vendor 230 23 253 Market 229 25 219 Protocol 198 25 114 Encryption 196 18 59 Bas 193 24 4 Technologi 190 24 179 Ipsec 183 22 141 User 176 28 17 Sit 167 22 • 58 Offer 166 23 140 Authent 166 21 15 Remot 160 23 2 Point 155 23 218 Port 152 13 354 Cisco 145 18 19 Content 138 30 357 Router 133 18 7 Solution 132 26 355 Concentr 128 12 57 Return 123 16 287 Inform 123 30 116 Office 121 20 80 Include 120 25 10 Privat 119 28 68 applianc 115 16 40 internet 114 26 143 capabl 108 18 376 lin 108 24 81 addition 107 22 65 window 104 16 . 259 custom 100 19 169 power 100 12 520 carrier 99 14 392 busi 99 20 63 server 98 15 90 end 97 26 149 interfac 97 11 96 devic 96 20 172 requir 94 22 13 integr 93 24 6 gatewai 92 18 69 hardwar 92 18 356 seri 88 12 421 sourc 88 30 176 wan 87 18 128 configur 87 19 106 function 87 22 166 larg 86 19 9 virtual 86 28 344 cost 86 24 144 equip 86 25 119 address 82 18 426 complet 82 30 205 bit 82 10 233 bandwidth 81 15 132 featur 81 18 314 applic 80 19 517 figur 77 16 50 unit 77 15 180 rout 75 14 163 simultan 75 12 146 connec 74 20 152 oper 73 18 202 data 72 20 162 number 72 22 366 lucent 68 14 22 pric 68 21 816 level 68 18 138 digit 67 20 25 limit 67 20 1 check 66 18 135 certif 66 18 156 layer 66 19 270 rang 65 17 761 packet 65 15 419 contain 65 30 115 small 63 23 575 global 62 14 863 ethernet 61 11 136 intern 60 18 702 percent 59 11 216 filter 59 10 223 session 58 16 248 stat 58 12 312 connect 58 17 101 shar 57 18 278 traffic 56 13 445 chang 55 30 174 lan 55 19 413 public 55 30 295 pptp 54 17 192 radiu 54 10 350 famili 53 11 306 set 52 21 74 nortel 51 16 1170 switch 51 10 89 enabl 50 17 193 control 50 18 231 tim 50 19 441 result 49 30 617 expan 48 13 264 licens 47 12 443 express 46 30 351 challeng 46 15 293 implem 46 21 783 sha 45 10 251 nat 45 11 688 gener 45 19 565 lack 44 14 190 extern 44 10 120 environ 44 14 210 kei 44 18 224 polici 44 13 439 achiev 44 30 98 high 43 17 93 singl 43 17 725 research 43 17 353 leader 43 15 177 option 43 17 124 central 42 14 448 updat 42 12 592 design 41 15 24 strength 40 17 1028 growth 40 10 159 destin 39 13 29 corpor 39 16 437 selec 39 30 507 deploym 38 15 191 dial 38 18 487 plan 38 17 185 standard 38 19 485 process 37 18 422 reliabl 37 30 196 dynamic 37 11 257 believ 36 30 500 legaci 36 10 483 outsourc 36 10 299 perform 36 15 435 assum 36 30 459 abil 36 10 134 commun 36 18 420 obtain 35 30 408 entir 35 30 490 full 35 18 103 qualiti 35 14 416 written 35 30 288 encrypt 34 14 262 respons 34 .30 ' 476 continu 34 14 659 compani 34 15 220 type 33 17 467 establish 33 15 414 form 33 30 271 implement 33 17 867 suppli 33 10 360 enterasi 33 11 676 increas 32 19 472 issu 32 14 298 upgrad 32 12 411 reserv 32 30 442 opinion 32 30 352 rat 32 18 436 sol 32 30 338 account 32 16 425 accuraci 31 30 318 execut 31 16 440 intend 31 30 345 bundl 31 11 457 criteria 31 10 444 subject 31 30 997 expect 30 10 550 class 30 11 446 notic 30 30 409 copi 30 30 410 right 30 30 412 reproduc 30 30 415 prior 30 30 432 interpret 30 30 417 perm is 30 30 363 intel 30 12 126 person 30 10 418 forbidden 30 30 423 disclaim 30 30 496 creat 30 17 424 warranti 30 30 ( 427 adequaci 30 30 438 materi 30 30 431 inadequaci 30 30 428 liabil 30 30 430 omis 30 30 429 error 30 30 382 prioriti 29 12 139 strong 29 15 735 separ 29 15 297 deliv 29 13 290 not 29 17 267 basic 29 14 609 position 29 14 538 reduc 28 14 522 prefer 28 10 600 messag 28 10 236 directori 28 12 508 branch 28 16 1029 provision 28 11 1034 futur 28 14 88 termin 28 11 Table 3 Selected Token List for Corpus 2. (Document Frequency=50%, Term Frequency= 2%) ID Token Frequency Number of Document 5 vpn 1370 30 104 servic 598 28 11 network 547 29 12 product 453 26 125 manag 418 25 167 enterpr 377 28 107 support 341 23 86 secur 301 30 164 tunnel 273 25 91 client 263 22 14 firewal 254 23 118 system 251 22 3 softwar 248 23 16 access 239 26 277 provid 237 26 349 vendor 230 23 253 market 229 25 219 protocol 198 25 114 encryption 196 18 59 bas 193 24 4 technologi 190 24 179 ipsec 183 22 141 user 176 28 17 sit 167 22 140 authent 166 21 98 58 offer 166 23 15 remot 160 23 2 point 155 23 354 cisco 145 18 19 content 138 30 357 router 133 18 7 solution 132 26 57 return 123 16 287 inform 123 30 116 offic 121 20 80 includ 120 25 10 privat 119 28 68 applianc 115 16 40 internet 114 26 143 capabl 108 18 376 lin 108 24 81 addition 107 22 65 window 104 16 259 custom 100 19 392 busi 99 20 90 end 97 26 96 devic 96 20 172 requir 94 22 13 integr 93 24 6 gatewai 92 18 69 hardwar 92 18 421 sourc 88 30 128 configur 87 19 176 wan 87 18 106 function 87 22 9 virtual 86 28 166 larg 86 19 344 cost 86 24 144 equip 86 25 426 complet 82 30 119 address 82 18 132 featur 81 18 314 applic 80 19 517 figur 77 16 146 connec 74 20 152 oper 73 18 162 number 72 22 202 data 72 20 22 pric 68 21 816 level 68 18 25 limit 67 20 138 digit 67 20 1 check 66 18 135 certif 66 18 156 layer 66 19 270 rang 65 17 419 contain 65 30 115 small 63 23 136 intern 60 18 312 connect 58 17 223 session 58 16 101 shar 57 18 413 public 55 30 445 chang 55 30 174 lan 55 19 295 pptp 54 17 306 set 52 21 74 nortel 51 16 89 enabl 50 17 231 tim 50 19 193 control 50 18 441 result 49 30 443 express 46 30 293 implem 46 21 688 gener 45 19 439 achiev 44 30 210 kei 44 18 177 option 43 17 93 singl 43 17 98 high 43 17 725 research 43 17 24 strength 40 17 29 corpor 39 16 437 selec 39 30 185 standard 38 19 487 plan 38 17 191 dial 38 18 485 process 37 18 422 reliabl 37 30 257 believ 36 30 134 commun 36 18 435 assum 36 30 416 written 35 30 490 full 35 18 420 obtain 35 30 408 entir 35 30 262 respons 34 30 271 implement 33 17 220 type 33 17 414 form 33 30 411 reserv 32 30 352 rat 32 18 338 account 32 16 676 increas 32 19 442 opinion 32 30 436 sol 32 30 444 subject 31 30 425 accuraci 31 30 318 execut 31 16 440 intend 31 30 438 materi 30 30 418 forbidden 30 30 417 permis 30 30 423 disclaim 30 30 496 creat 30 17 424 warranti 30 30 412 reproduc 30 30 415 prior 30 30 446 notic 30 30 410 right 30 30 432 interpret 30 30 431 inadequaci 30 30 427 adequaci 30 30 428 liabil 30 30 430 omis 30 30 429 error 30 .. -30 409 copi 30 30 290 not 29 17 508 branch 28 16 Table 4. Selected Token List for Corpus 3 (Document Frequency= 50%, Term Frequency = 4%) ID Token Frequency Number of Document 46 storag 612 20 3 san 574 20 211 manag 435 19 253 product 382 19 82 network 288 19 165 vendor 247 18 47 system 232 19 60 server 223 20 26 na 218 11 100 switch 189 18 41 data 189 15 261 market 162 17 103 support 149 19 726 devic 147 13 451 fil 143 13 58 technologi 136 19 106 bas 132 16 199 provid 131 18 212 softwar 129 17 252 applic 123 16 399 emc 114 14 363 gartner 107 20 270 disk 105 14 169 offer 102 14 143 inform 101 20 1 Compaq 100 11 198 requir 94 15 217 solution 91 16 157 includ 87 18 156 user 82 17 52 virtual 82 12 15 arrai 80 10 295 custom 74 15 362 content 71 20 51 enterpr 71 14 455 shar 70 18 654 interoper 67 10 63 function 66 16 139 perform 65 13 315 larg 63 17 335 high 62 16 268 featur 62 14 146 environ 62 12 359 area 60 19 93 channel 60 18 90 port 60 11 12 number 59 20 476 servic 59 16 323 integr 59 13 45 issu 58 20 357 attach 58 14 185 configur 57 14 241 volum 57 13 11 not 56 20 44 kei 56 20 77 standard 55 16 233 level 55 15 667 compani 55 15 130 capac 55 10 168 avail 54 15 61 control 54 14 239 access 53 12 92 fibr 52 17 360 interfac 52 11 214 connect 51 18 247 oper 50 16 216 cost 50 10 571 protocol 49 11 762 develop 47 16 345 continu 47 14 54 architectur 46 13 99 fabric 46 13 537 work 46 12 597 tool 46 11 28 lin 45 16 290 end 45 15 325 deliv 45 15 282 creat 45 13 126 instal 44 16 265 ibm 44 14 50 announc 44 13 634 compon 44 10 178 host 42 14 228 tim 42 13 55 hardwar 42 12 516 gener 41 16 124 addition 41 15 313 limit 41 15 699 approach 41 13 372 contain 40 20 203 multipl 40 12 299 result 39 20 380 complet 39 20 630 busi 39 12 281 mov 39 12 496 seal 39 10 244 chang 38 20 374 sourc 38 20 89 singl 38 10 9 type 37 20 475 block 37 10 29 comput 36 14 813 test 35 10 10 research 34 15 435 mak 34 15 24 pric 34 11 123 scalabl 34 11 367 form 33 20 257 small 33 13 551 serv 33 10 794 implement 33 10 219 part 32 15 358 independ 32 14 526 resourc 32 13 115 ad 32 12 505 increas 32 10 376 reliabl 31 20 49 year 31 17 459 current 31 13 232 set 31 13 223 enabl 31 12 294 percent 31 12 326 improv 30 13 527 common 30 12 465 expand 30 10 391 respons 29 20 213 infrastructur 29 10 21 exist 28 13 479 specif 28 12 320 challeng 27 13 808 todai 27 13 869 verita 27 10 38 public 26 20 394 achiev 26 20 27 bottom 26 13 439 expect 26 12 237 logic 26 11 404 capabl 26 10 361 entir 25 20 383 error 25 20 259 strategi 25 12 312 platform 25 11 16 position 25 10 639 perspect 25 10 Appendix D: Porter Stemming Algorithm M.F.Porter 1980 Originally published in Program. 14(3): 1980. 130-137. 1. Introduction Removing suffixes by automatic means is an operation which is especially useful in the field of information retrieval. In a typical IR environment, one has a collection of documents, each described by the words in the document title and possibly by words in the document abstract. Ignoring the issue of precisely where the words originate, we can say that a document is represented by a vetor of words, or terms. Terms with a common stem will usually have similar meanings, for example: CONNECT CONNECTED CONNECTING CONNECTION CONNECTIONS Frequently, the performance of an IR system will be improved if term groups such as this are conflated into a single term. This may be done by removal of the various suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT. In addition, the suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous. Many strategies for suffix stripping have been reported in the literature.(e g 1 _ 6 ) The nature of the task will vary considerably depending on whether a stem dictionary is being used, whether a suffix list is being used, and of course on the purpose for which the suffix stripping is being done. Assuming that one is not making use of a stem dictionary, and that the purpose of the task is to improve IR performance, the suffix stripping program will usually be given an explicit list of siffixes, and, with each suffix, the criterion under which it may be removed from a word to leave a valid stem. This is the approach adopted here. The main merits of the present program are that it is small (less than 400 lines of BCPL), fast (it will process a vocabulary of 10,000 different words in about 8.1 seconds on the IBM 370/165 at Cambridge University), and reasonably simple. At any rate, it is simple enough to be described in full as an algorithm in this paper. (The present version in BCPL is freely available from the author. BCPL is itself available on a wide range of different computers, but anyone wishing to use the program should have little difficulty in coding it up in other programming languages.) Given the speed of the program, it would be quite realistic to apply it to every word in a large file of continuous text, although for historical reasons we have found it convenient to apply it only to relatively small vocabulary lists derived from continuous text files. In any suffix stripping program for IR work, two points must be borne in mind. Firstly, the suffixes are being removed simply to improve IR performance, and not as a linguistic exercise. This means that it would not be at all obvious under what circumstances a suffix should be removed, even if we could exactly determine the suffixes of a word by automatic means. 105 Perhaps the best criterion for removing suffixes from two words W l and W2 to produce a single stem S, is to say that we do so if there appears to be no difference between the two statements 'a document is about W l ' and 'a document is about W2'. So if Wl=CONNECTION' and W2=CONNECT10NS' it seems very reasonable to conflate them to a single stem. But if Wl= RELATE' and W2=RELATIVITY' it seems perhaps unreasonable, especially if the document collection is concerned with theoretical physics. (It should perhaps be added that RELATE and RELATIVITY are conflated together in the algorithm described here.) Between these two extremes there is a continuum of different cases, and given two terms Wl and W2, there will be some variation in opinion as to whether they should be conflated, just as there is with deciding the relevance of some document to a query. The evaluation of the worth of a suffix stripping system is correspondingly difficult. The second point is that with the approach adopted here, i.e. the use of a suffix list with various rules, the success rate for the suffix stripping will be significantly less than 100% irrespective of how the process is evaluated. For example, if SAND and SANDER get conflated, so most probably will WAND and WANDER. The error here is that the -ER of WANDER has been treated as a suffix when in fact it is part of the stem. Equally, a suffix may completely alter the meaning of a word, in which case its removal is unhelpful. PROBE and PROBATE for example, have quite distinct meanings in modern English. (In fact these would not be conflated in our present algorithm.) There comes a stage in the development of a suffix stripping program where the addition of more rules to increase the performance in one area of the vocabulary causes an equal degradation of performance elsewhere. Unless this phenomenon is noticed in time, it is very easy for the program to become much more complex than is really necessary. It is also easy to give undue emphasis to cases which appear to be important, but which turn ut to be rather rare. For example, cases in which the root of a word changes with the addition of a suffix, as in DECEIVE/DECEPTION, RESUME/RESUMPTION, INDEX/INDICES occur much more rarely in real vocabularies than one might at first suppose. In view of the error rate that must in any case be expected, it did not seem worthwhile to try and cope with these cases. It is not obvious that the simplicity of the present program is any demerit. In a test on the well-known Cranfield 200 collection7 it gave an improvement in retrieval performance when compared with a very much more elaborate program which has been in use in IR research in Cambridge since ! 971/2.6). T h e t e s t was done as follows: the words of the titles and abstracts in the documents were passed through the earlier suffix stripping system, and the resultis stems were used to index the documents. The words of the queries were reduced to stems in the same way, and the documents were ranked for each query using term coordination matching of query against document. From these rankings, recall and precision values were obtained using the standard recall cutoff method. The entire process was then repeated using the suffix stripping system described in this paper, and the results were as follows: earlier system present system precision recall precision recall 0 57.24 0 58.60 10 56.85 10 58.13 20 52.85 20 53.92 30 42.61 30 43.51 40 42.20 40 39.39 106 50 39.06 50 38.85 60 32.86 60 33.18 70 31.64 70 31.19 80 ' 27.15 80 27.52 90 24.59 90 25.85 100 24.59 100 25.85 Geary, the performance is not very different. The important point is that the earlier, more elaborate system certainly performs no better than the present, simple system. (This test was done by prof. C.J. van Rijsbergen.) 2. The Algorithm To present the suffix stripping algorithm in its entirety we will need a few difinitions. A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term "consonant' is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel. A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: CVCV ... C CVCV ... V VCVC ... C VCVC ... V These may all be represented by the single form [C1VCVC ... [V] where the square brackets denote arbitrary presence of their contents. Using (VC) m to denote VC repeated m times, this may again be written as [C](VCf[V]. m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples: m=0 TR, EE, TREE, Y, BY. m=l TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. The rules for removing a suffix will be given in the form (condition) SI -> S2 This means that if a word ends with the suffix SI, and the stem before SI satisfies the given condition, SI is replaced by S2. The condition is usually given in terms of m, e.g. (m > 1) EMENT -> Here SI is "EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2. The 'condition' part may also contain the following: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). 107 And the condition part may also contain expressions with and, or and not, so that (m>l and (*S or *T)) tests for a stem with m>l ending in S or T, while (*d and not (*L or *S or *Z)) tests for a stem ending witha double consonant other than L, S or Z. Elaborate conditions like this are required only rarely. In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching SI for the given word. For example, with SSES -> SS IES ->I SS ->SS s -> (here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for SI. Equally CARESS maps to CARESS (Sl= SS') and CARES to CARE(S1= S'). In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows: Step la SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step lb (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing If the second or third of the rules in Step 1 b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=l and *o) -> E fail(ing) -> fail fil(ing) -> file The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4. 108 Step lc (*v*) Y -> I happy -> happi sky -> sky Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward. Step 2 (m>0) ATIONAL (m>0)TIONAL --> > ATE TION ENCE ANCE IZE ABLE AL ENT -> -> -> -> -> -> -> E -> OUS (m>0) ENCI (m>0) ANC1 (m>0) IZER (m>0) ABL1 (m>0) ALLI (m>0) ENTL1 (m>0) ELI (m>0) OUSL1 (m>0) IZATION -> IZE (m>0)ATION -> ATE (m>0)ATOR -> ATE (m>0)ALISM -> AL (m>0) IVENESS-> IVE (m>0) FULNESS -> FUL (m>0) OUSNESS -> OUS (m>0)ALITI -> AL (m>0)IVITI -> IVE (m>0)BILITI -> BLE relational conditional rational valenci hesitanci > relate -> condition -> rational > valence -> hesitance digitizer -> digitize conformabli -> conformable radicalli -> radical differentli -> different vileli -> vile analogousli -> analogous vietnamization -> vietnamize predication -> predicate operator -> operate feudalism -> feudal decisiveness -> decisive hopefulness -> hopeful callousness -> callous formaliti -> formal sensitiviti -> sensitive sensibiliti -> sensible The test for the string SI can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string SI. It will be seen in fact that the SI-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps. Step 3 (m>0)ICATE-> IC (m>0) ATIVE -> (m>0)ALIZE-> AL (m>0)ICITI-> IC (m>0) 1CAL -> IC (m>0)FUL -> (m>0)NESS -> Step 4 (m>l)AL -> (m>l)ANCE -> (m>l)ENCE -> (m>l)ER -> (m>l) IC -> (m>l) ABLE -> (m>l) IBLE -> triplicate -> triplic formative -> form formalize -> formal electriciti -> electric electrical -> electric hopeful -> hope goodness -> good revival -> reviv allowance -> allow inference -> infer airliner -> airlin gyroscopic -> gyroscop adjustable -> adjust defensible -> defens 109 (m>l) ANT -> irritant -> irrit (m>l)EMENT-> replacement -> replac (m>l)MENT -> adjustment -> adjust (m>l)ENT -> dependent -> depend (m>l and (*S or *T)) ION -> adoption -> adopt (m>l)OU -> homologou •..-> homolog (m>l)ISM -> communism -> commun (m>l) ATE -> activate -> activ (m>l) ITI -> angulariti -> angular (m>l)OUS -> homologous -> homolog (m>l)lVE -> effective -> effect (m>l) IZE -> bowdlerize -> bowdler The suffixes are now removed. Al l that remains is a little tidying up. S t e p 5a (m>l)E -> probate -> probat rate -> rate (m=l and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter contrail -> control roll -> roll The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix. For example, in the following two lists: list A list B RELATE DERIVATE PROBATE ACTIVATE CONFLATE DEMONSTRATE PIRATE NECESSITATE PRELATE RENOVATE -ATE is removed from the list B words, but not from the list A words. This means that the pairs DERIVATE/DERIVE, ACTIVATE/ACTIVE, DEMONSTRATE/DEMONS- TRABLE, NECESSITATE/NECESSITOUS, will conflate together. The fact that no attempt is made to identify prefixes can make the results look rather inconsistent. Thus PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL. In practice this does not matter too much, because the presence of the prefix decreases the probability of an erroneous conflation. Complex suffixes are removed bit by bit in the different steps. Thus GENERALIZATIONS is stripped to GENERALIZATION (Step 1), then to GENERALIZE (Step 2), then to GENERAL (Step 3), and then to GENER (Step 4). OSCILLATORS is stripped to OSCILLATOR (Step 1), then to OSCILLATE (Step 2), then to OSCILL (Step 4), and then to OSCIL (Step 5). In a vocabulary of 10,000 words, the reduction in size of the stem was distributed among the steps as follows: 110 Suffix stripping of a vocabulary of 10,000 words Number of words reduced in step 1: 3597 tl 2: 766 tl 3: 327 tl 4: 2424 11 5: 1373 Number of words not reduced: 3650 The resulting vocabulary of stems contained 6370 distinct entries. Thus the suffix stripping process reduced the size of the vocabulary by about one third. I l l Appendix E: Notation and Formula Table Formula Notation and Formula Descriptions •£><.p*., ft) =_L g_(r>iog p a i * J (1-1) The relative entropy or Kullback-Liebler distance between two probability distributions Pa and PZ>. Where Y is the data collection and Pa, Pb are two probability distributions ofY. Div (Pa, Pb) = D(Pa, Pb) + D(Pb, Pa) (1.2) Divergence measure of KL distance. D(Pa, Pb), D(Pb, Pa) are defined in formula 1.1. Distance (I, J) = Div (Pi(left) + Pj(left)) + Div(Pi(right) +Pj'(right)) (1.3) The distance between cluster / and cluster j is the sum of the KL distance between the left context probability, p(left), of the two clusters and the K L distance between the right context probabilities, p(right) of the two clusters. pichister,) p {cluster,) (1.4) Calculation formula of p(left) and p(right). Pi(left{vk)) is the probability that word vk is found to the left of words in cluster /. (Chotimongkol and Rudnicky, 1999) Similarly, pi(right(vk)) is the probability that word vk is found to the right of words in cluster i. T 7 ' pU')p(j') (1-5) . Definition of Average mutual information (AMI), where p(ij) is the bi-gram probability of cluster / and cluster j; i.e., the probability that a word in cluster /' 112 j precedes a word incluster j. (Chotimongkol and Rudnicky, 1999) C(u,v)=Sj f(S(u),j)*f(S(v),j) (2.1) Definition: The frequency of a term t(i) in a document d(j), is referred to as f(i,j). Let w=(m(i,j)) be an association matrix with I SI | rows and I Dl | columns, where m(i,j)=f(i,j). Let mt be the transpose of m. The matrix s=m*mt is a local term-term association matrix. Each element S(u,v) in s expresses a correlation C(u,v) between the term S(u) and S(v) namely, C(u,v)=IiSjl/r(ki,kj) (2.2) Definition: Let the distance r(ki, kj) between two keywords ki and kj be given by the number of words plus one between them in a same document. If ki and kj are in distinct documents, we take r(ki,kj)=oo. A local term-term metric correlation matrix s is defined as follows. Each element S(u,v) of s expresses a metric correlation C(u,v) between the terms Su and Sv, namely, 113 A (ti, tj) = 2>,0 W / D (ti, tj) f 1 (if ti, tj appears in the same sentence) W = "j (N-n)/N (n<N) 0 (n>=N) (2.7.1) Definition of Affinity Value measurement (not symmetric). I means that we will accumulate all the occurrence of ti and tj in the whole document collection. D is the distance, number of term between two terms being evaluated plus one. N is number of sentences in evaluation scope, while n is the number of sentences between two terms being evaluated plus one (two term in same sentence, n=0). AVE(ti,tj)=< r A(ti,tj)/N(ti,tj) (CN(ti, tj)<>0) (2.7.2) ^ 0 (CN(ti, tj)=0) Definition of Average Affinity value measurement. CN(ti, tj) is the number of co-occurrence of term ti, and tj. A (ti,tj) is defined in formula 2.7.1. CO(N)=N!/(2*(N-2)!) (4.1.1) For a cluster a, N is the number of unique terms in cluster a. The number of unique combinations of any two unique terms in cluster a (CO(N))can be calculated using formula 4.1.1 AveSim(cluster a). = I (ti, tj) (A(ti, tj) / CO(N) (4.1.2) The average within cluster similarity is defined as formula 4.1.2. Z (ti, tj) means accumulating the affinity value of all the unique combinations of ti and tj. CO(N) is defined in formula 4.1.2. 114 Appendix F: Original Output files from the system (provided in CD ROM) Appendix G: System programs code (provided in CD ROM) 115 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0090963/manifest

Comment

Related Items