"Business, Sauder School of"@en . "Management Information Systems, Division of"@en . "DSpace"@en . "UBCV"@en . "Yin, Nawei"@en . "2009-11-26T21:11:01Z"@en . "2004"@en . "Master of Science in Business - MScB"@en . "University of British Columbia"@en . "With the rapid development of information technology, individuals using the technology are liable to be overwhelmed by the excessive amounts of information available when conducting online (local or remote) document searches. It is important therefore that users specify the correct search terms. However, a user does not always know which terms to use and often the same idea can be described by different terms. Constructing lists of possible search terms for different domains would require a very substantial effort by experts in each domain. To alleviate these problems, automated techniques can be valuable to extract concepts and meaningful phrases for specific domains. This work is an exploratory study of automated extraction of compound concepts from a collection of documents in a specific domain. The concept-extraction methods used in this study employed clustering techniques based on distance measures that reflect term affinity statistics rather than techniques based on similarity measures adopted in most previous works. The study compared the effects of different methods of calculating affinities, depending on the sizes of textual units where terms co-occur and on directionality and asymmetry between terms. The accounting context was used as a case study to provide the data. An accounting expert evaluated the resulting clusters produced by the clustering program. As demonstrated by our results, the method identified meaningful accounting compound concepts and phrases. The research also indicated which affinity types generated better results. For example, affinities based on occurrence of terms within a document produced the poorest results. There was a significant manual effort involved in \"preprocessing\" the data prior to compound concept identification. However, we believe the techniques explored might be useful for users to search relevant information within individual domains and can be extended to support the construction of domain-specific thesauri."@en . "https://circle.library.ubc.ca/rest/handle/2429/15815?expand=metadata"@en . "5784465 bytes"@en . "application/pdf"@en . "USING TERM PROXIMITY MEASURES FOR IDENTIFYING COMPOUND CONCEPTS: AN EXPLORATORY STUDY by NAWEI Y I N A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN BUSINESS ADMINISTRATION in THE F A C U L T Y OF G R A D U A T E STUDIES DIVISION OF M A N A G E M E N T INFORMATION SYSTEMS SAUDER SCHOOL OF BUSINESS We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH C O L U M B I A August 2004 \u00C2\u00A9 Nawei Yin, 2004 THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF GRADUATE STUDIES Library Authorization In p resen t ing th is thes is in part ia l fu l f i l lment of t h e requ i remen ts fo r a n a d v a n c e d d e g r e e at the Un ivers i ty of Br i t ish C o l u m b i a , I a g r e e that t he L ibrary shal l m a k e it f ree ly ava i lab le for re fe rence a n d s tudy . I f u r the r a g r e e that pe rm iss ion for e x t e n s i v e c o p y i n g of th is thes is fo r scho la r l y p u r p o s e s m a y be g r a n t e d by t h e h e a d of m y d e p a r t m e n t or b y h is or her represen ta t i ves . It is u n d e r s t o o d that c o p y i n g or pub l ica t ion of th is thes is f o r f inanc ia l ga in shal l not be a l l o w e d w i thou t m y wr i t ten pe rm iss ion . N a w e i Y I N \u00E2\u0080\u00A2 / ^ H ^ ^ O - 7AX>^ N a m e of A u t h o r (please print) Da te ( d d / m m / y y y y ) Ti t le of T h e s i s : U S I N G T E R M P R O X I M I T Y M E A S U R E S F O R I D E N T I F Y I N G C O M P O U N D C O N C E P T S : A N E X P L O R A T O R Y S T U D Y Degree : Mas te r of S c i e n c e in Bus iness Admin is t ra t ion Year : 2 0 0 4 D e p a r t m e n t of Div is ion of M a n a g e m e n t In format ion S y s t e m s , S a d u e r S c h o o l of Bus iness , T h e facu l ty of G r a d u a t e S tud ies T h e Univers i ty of Br i t ish C o l u m b i a V a n c o u v e r , B C C a n a d a g r a d . u b c . c a / f o r m s / ? f o r m l D = T H S p a g e 1 of 1 last updated: 19-Aug-04 Abstract With the rapid development of information technology, individuals using the technology are liable to be overwhelmed by the excessive amounts of information available when conducting online (local or remote) document searches. It is important therefore that users specify the correct search terms. However, a user does not always know which terms to use and often the same idea can be described by different terms. Constructing lists of possible search terms for different domains would require a very substantial effort by experts in each domain. To alleviate these problems, automated techniques can be valuable to extract concepts and meaningful phrases for specific domains. This work is an exploratory study of automated extraction of compound concepts from a collection of documents in a specific domain. The concept-extraction methods used in this study employed clustering techniques based on distance measures that reflect term affinity statistics rather than techniques based on similarity measures adopted in most previous works. The study compared the effects of different methods of calculating affinities, depending on the sizes of textual units where terms co-occur and on directionality and asymmetry between terms. The accounting context was used as a case study to provide the data. An accounting expert evaluated the resulting clusters produced by the clustering program. As demonstrated by our results, the method identified meaningful accounting compound concepts and phrases. The research also indicated which affinity types generated better results. For example, affinities based on occurrence of terms within a document produced the poorest results. There was a significant manual effort involved in \"preprocessing\" the data prior to compound concept identification. However, we believe the techniques explored might be useful for users to search relevant information within individual domains and can be extended to support the construction of domain-specific thesauri. Table of Contents Abstract i i Table of Contents i i i List of Tables vii Acknowledgments ix Section 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Framework 4 Section 2 Literature Review 4 2.1 Introduction to Thesaurus Construction 4 2.1.1 Thesauri 4 2.1.2 Manual Thesauri 5 2.1.3 Automatic Thesauri 5 2.2 Automatic Techniques Guiding This Research 8 2.2.1 Document Collection 9 2.2.2 Object Filtering 9 2.2.3 Automatic Indexing 10 2.2.4 Co-occurrence Analysis 12 2.2.5 Evaluation 14 Section 3 Our Approach and Research Questions 15 Section 4 Our Affinity Measures 20 4.1 Term Affinity Statistics 20 4.2 Five Textual Units 22 4.2.1 SentenceOtd Textual Unit 23 4.2.2 SentenceUpTo5td Textual Unit 24 4.2.3 SentenceNoRestriction Textual Unit 24 4.2.4 ParagraphNoRestriction Textual Unit 24 4.2.5 DocumentNoRestriction Textual Unit 24 4.3 Estimating Fifteen Schemes' Affinity Values 25 Section 5 Experiment 25 - ii i -5.1 Domain Dependent Preprocessing - Tokenizing 26 5.1.1 File Extraction 26 5.1.2 Reformatting the Document Structure 27 5.1.3 Changing the Short-Forms of the Words in the Abbreviation List into Term-Phrases 27 5.1.4 Converting Meaningful Words to Term-Phrases 27 5.1.5 Removing Stop-Words 29 5.1.6 Producing a Full Token List 29 5.1.7 Removing Unwanted Tokens 30 5.1.8 Consolidating Wanted Tokens 30 5.1.9 Generating the Final Reduced Token List 31 5.2 Computing Term Affinities 32 5.2.1 Removing Unwanted Punctuations 32 5.2.2 Converting 1,344 Tokens to Token IDs 32 5.2.3 600 Tokens' Affinity Values 33 5.2.3.1 Generating the 600 Token List for Clustering 33 5.2.3.2 Summary of the Steps to Obtain the 600 Tokens 34 5.2.3.3 The Token IDs of Any Two Terms 35 5.2.3.4 An Example of a Dummy File to Illustrate Affinity Calculations 35 5.2.3.5 The Program's Affinity Calculation of Fifteen Schemes 37 5.3 Clustering 600 Tokens 38 5.3.1 Hierarchical Clustering Using Matlab 38 5.3.2 Affinity Values Fit into Matlab 38 5.3.3 Clustering Outputs 39 5.4 Evaluation 40 5.4.1 Clustering Terms to Be Evaluated 40 5.4.2 Expert's Evaluation 41 5.4.2.1 Instructions for Expert's Evaluation 42 5.4.2.2 Sample Evaluation Results ' 43 - iv -5.4.2.3 Statistical Analysis of Expert's Evaluation Results for A l l Fifteen Schemes 45 5.4.3 Comparing the Expert's Evaluations Results with Price Waterhouse Accounting Thesaurus 46 5.4.3.1 Sample Comparison 47 5.4.3.2 Statistical Comparisons of A l l Fifteen Schemes 48 5.5 Discussion 49 5.5.1 Can Relevant Concepts be Extracted Automatically from a Set of Documents in a Given Domain? (QI. 1) What Type of Semantic Relations Can Be Identified in the Extracted Concepts? (QI .2) 50 5.5.2 What Parameters Can Affect the Quality of the Results? (Q2) 52 5.5.2.1 Proximity - Sentence, Paragraph And Document Levels (Q 2.1) 52 5.5.2.2 Distance Within the Sentence Level (Q2.2) 53 5.5.2.3 Directionality And Asymmetry (Q2.3) 55 Section 6 Automatically Identifying Phrases (Q3) 58 6.1 Single-Word Tokenizing 59 6.1.1Broken Phrases List 59 6.1.2 Wanted Single Words 60 6.1.3 Converting Plural Wanted Single Words into Singulars 60 6.1.4 905 Final Single Tokens 60 6.2 905 Single Tokens' Affinity Calculations 61 6.2.1 One Level - SentenceOtd 61 6.2.2 Four Cases - Estimating Affinities 61 6.3 Clustering 905 Single Tokens 62 6.4 Evaluation - 905 Single Token Clustering 63 6.4.1 A Sample Evaluation 63 6.4.2 Statistical Evaluation Results 67 6.5 Automatic Phrase Discussions - 905 Single Tokens Clustering 68 - v -6.5.1 Our Techniques Can Automatically Identifying Some Accounting Phrases 68 6.5.2 Directionality And Asymmetry 69 Section 7 Conclusions And Future Research 72 7.1 Conclusions 72 7.2 Contributions 74 7.3 Limitations And Future Research 75 Bibliography 76 Appendices 80 Appendix A Domain Dependent Preprocessing - Tokenizing 80 Appendix B Computing Term Affinities 85 Appendix C Clustering 600 Tokens Data Outputs 88 Appendix D Statistical Evaluation Results of A l l Fifteen Schemes -Clustering 600 Tokens 94 Appendix E Automatically Identifying Phrases 99 Appendix E l : Single Word Tokenizing 99 Appendix E2: Clustering 905 Single Tokens Data Outputs 100 Appendix E3: Single Tokens Evaluation Results (Provided in CD-ROM) 101 - vi -List of Tables Table 3.1 Summary of Our Approach 19 Table 4.1 Fifteen Schemes' Affinity Estimations 25 Table 5.1 Summary of the Steps to Obtain the 600 Tokens 34 Table 5.2 Example of Comparing Two Token IDs 35 Table 5.3 Dummy File - Two Tokens' Co-occurrences in ParagraphNoRestriction 36 Table 5.4 Dummy File - Two Tokens' Statistical Affinity Values in ParagraphNoRestriction 37 Table 5.5 Top Ten Clustering Data of Sentence0tdAB_A 39 Table 5.6 Sample Evaluation Terms - Top Three Clusters of SentenceUpTo5tdAB_A 41 Table 5.7 Sample Evaluation Relation Type Alternatives 43 Table 5.8 Sample Expert's Evaluation Results - Top Three Clusters of SentenceUpTo5tdAB_A 44 Table 5.9 Statistical Analysis of the Expert's Evaluation Results - A l l Fifteen Schemes 46 Table 5.10 Sample Comparison of Expert's Evaluation Results with the Price Waterhouse Thesaurus - Top Three Clusters of SentenceUpTo5tdAB_A 48 Table 5.11 Comparing Fifteen Scheme of the Expert's Evaluation Results with the Price Waterhouse Thesaurus 49 Table 5.12 Statistical Results of \"No direction relation\" Type for all Fifteen Schemes 50 Table 5.13 Statistical Results of Average Relevance Scores for all Fifteen Schemes 51 Table 5.14 Statistical Results of the Top Two Most Frequently Chosen Relation Types in A l l Fifteen Schemes 52 Table 5.15 Identical Clusters Percentages by Comparing the Same Scheme in SentenceOtd and SentenceUpTo5td with SentenceNoRestriciton Respectively 53 - vii -Table 5.16 Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EvaluateSentenceOtd with EvaluateSentenceUpTo5td, and with EvaluateSentenceNoRestriction 54 Table 5.17 Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EvaluateSentenceUpTo5td with EvaluateSentenceNoRestriction 55 Table 5.18 Directionality - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B _ A at Five Levels And the Scheme B A A at Five Levels 56 Table 5.19 Asymmetry - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B A and A B B at 5 Levels And the Scheme A B _ B at Five Levels 57 Table 6.1 Top Ten Automatic Phrase Clustering Data in SentenceOtdAB_A 62 Table 6.2 Sample Evaluation of 905 Single Token Clusters - Top Five Clusters in SentenceOtdAB_A 63 Table 6.3 Statistical Evaluation of Automatic Phrases for Four Cases at SentenceOtd Level 67 Table 6.4 Asymmetry - Comparing Evaluation Automatic Phrase in SentenceOtdABA with in SentenceOtdABB; and in SentenceOtdBA_A with in SentenceOtdBA B 72 - viii -Acknowledgments I would like to thank Professor Yair Wand for instructing and supervising me throughout this study, and Professor Carson Woo and Professor Jacob Steif for their comments on the development of this thesis. I want to express my sincere appreciation to Ph.D. candidate Ofer Arazy for his advice and extensive support during the entire process of developing this research. I also would like to thank Ms. Yongwei Yin for her expert judgement of the experimental results, and Mr. Steve Doak for editing my thesis. Furthermore, I sincerely appreciate the continuous encouragement and assistance I have received from my husband, Mr. Jianwen Zhang, and my mother, Ms. Xiuge Sun. - ix -1. INTRODUCTION 1.1 Motivation With the recent rapid development of advanced technology, people nowadays can easily access and locate information they need by searching online, or through local database systems. Users, however, might be overwhelmed by the excessive amounts of information available from these various sources, and they may feel confused about how to effectively retrieve the needed information. This problem was labelled information overload by Chen et al. (1995). Another common predicament searchers encounter is a vocabulary problem: people often use different terms to describe the same concept. Furnas et al. (1987) noted that when two people spontaneously made a word choice for objects from various domains, the probability that they chose the same term was lower than 20%. Subsequently, Chen et al. (1995, p.177) argued, \"Due to the unique backgrounds, training and experiences of different users, the chance of two people using the same term to describe a concept is quite low and even the same person may use different terms to describe the same concept at different times (due to the learning process and the evolution of concepts).\" These problems exist in many industries, including financial accounting. K P M G Consulting L L C (2000) claimed that over two-thirds of firms in various fields in a survey of 423 organizations were overwhelmed by the information in their systems, and 50% of the organizations complained of difficulty when attempting to locate information (Garnsey, 2002). Because financial standards change over time and also because these standards apply to various types of organization, financial information has become more complex. This is why searchers such as accountants, financial workers, business professionals and other general users with diversified backgrounds and various searching goals are usually unfamiliar with the varied terms that can represent the same concept in accounting (Garnsey, 2002). To handle these problems, researchers have developed various automatic thesaurus-generation methods to identify related concepts in different applications, such as Information Retrieval (IR), Latent Semantic Indexing (LSI), text mining and knowledge discovery, among others. The traditional definition of \"thesaurus\" in Merriam-Webster Online Dictionary is \"a: book of words or information about a particular field or set of concepts; especially: a book of words and their synonyms; b: a list of subject headings or descriptors usually with a cross-reference system for use in the organization of a collection of documents for reference and retrieval\" (http://www.m-w.com). As Milstead et al. (1993) have noted, a thesaurus is characterised both as a tool for writers \"to help select the best word to convey a specific nuance of meaning,\" and an indexing system that can seve as \"an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus.\" WordNet, an on-line lexical database, has then become on of the most popular machine-readable thesauri. For more information about WordNet, see the introduction by Miller et al. (1993). A thesaurus can lead searchers to concepts associated with an initial term. Regarding the various definitions of concept in the lexicon of different languages, see the review by Sowa (2000). Grefenstette (1994, p.24) explained a generalized condition for developing a thesaurus: \"one of the aspects of language variability is that many different words can be used to describe the same concept, and here we have indications that an automatic means of discovering the words associated with a concept is possible.\" Leroy and Chen (2001, p.263) pointed out that \"terms and concepts are different entities. A concept is the underlying meaning of a set of terms. As such, each concept can be expressed by many different terms. For example, the concept cancer has 20 terms associated with it, two of which are malignant tumor and malignant tumoral disease.\" Within the field of accounting, Garnsey (2002) suggested, \" i f clusters of related accounting terms/phrases can be successfully constructed for accounting, it should be possible to give users of accounting information domain-specific knowledge. Eventually, this knowledge may be integrated into a retrieval system to improve the efficiency of searches for information about specific accounting topics.\" In conducting our literature survey, we have noticed that studies of automatic identification of accounting concepts are few in number and are only preliminary forays into the field. Therefore, we focus our research on constructing a feasible approach to automatically grouping accounting terms and phrases into different compound concepts. A compound concept (hereinafter referred to as a concept, for simplicity) in this thesis refers to a set of related terms in a specific domain. Hence, when a user wants to describe a concept, besides those words the user is able to think of, he or she can select terms belonging to that same concept suggested by the automatic mechanism. The approach proposed in our research, if practical and useful, could lead to an automatic tool that enhances users' searching performance. 1.2 Thesis Framework In the following section, we review previous research performed in related fields and position our research therein. Following this, our research approach is introduced, and the research questions that delineate our study's scope are presented in Section Three. In Section Four, we introduce our own statistical affinity measures to addressing the research questions. A n experiment on automatically extracting accounting concepts is described in Section Five, and data analysis is conducted to derive our experimental results. In Section Six, we describe another study on automatic accounting phrase generation and discuss the findings. Finally, we draw conclusions, summarize the contributions we have made, and remark on directions for future research. 2. LITERATURE REVIEW In this section, we summarize our survey of general thesaurus construction and of the existing techniques to automatically construct thesauri, which have influenced our research position and our approach to automatically extracting accounting concepts. 2.1 Introduction to Thesaurus Construction 2.1.1 Thesauri Grefenstette (1993) believed that a domain specific thesaurus could identify important concepts in the domain hierarchically and could suggest alternative words as well as phrases to describe the same concept in the domain. Schtltze and Pedersen (1997, p.308) - 4 -explained that \"a thesaurus is a data structure that defines semantic relatedness between words.\" Furthermore, according to Gietz (2001), \"a thesaurus is a collection of relevant terms ordered in a hierarchy of super ordinate and subordinate concepts and homonyms.\" 2.1.2 Manual Thesauri The most traditional way to construct a thesaurus involves manually constructing it using a semantic mapping table. However, this is expensive and time-consuming, inasmuch as it requires extensive involvement of domain experts. As well, manual construction is only possible in a specific domain when repeated use of the thesaurus exceeds the construction cost (SchUtze and Pedersen, 1997). Since significant human and time costs incur on the construction of a manual thesaurus, researchers have studied various approaches to building system generated automatic thesauri. 2.1.3 Automatic Thesauri We need to first investigate the methodologies of automatically constructing a thesaurus. Previous studies have already developed linguistic and statistical measures related to automatic thesaurus construction. For instance, in the areas of linguistics, Curran and Moens (2002) noted that some systems extract related terms that appear together in particular contexts by recognising linguistic patterns (e.g. X, Y and other Zs) which link synonyms and hyponyms. Regarding statistical measures, Grefenstette (1994, p.23) claimed that most of the semantic extraction work \"was based on the statistics of co-occurrence of words within the same window of text, where a window can be a certain number of words, sentences, paragraphs or an entire document\". Grefenstette (1994, p.26) then demonstrated that Church and Hanks (1990) \"use textual windows to calculate the mutual information between a pair of words. They employ an information theoretic definition of mutual information over a corpus are usually semantically related.\" Jang et al. (1999) also described the techniques of using mutual information statistics to identify the lexical relations between pairs of words. However, these works did not address compound concept identification. One approach to constructing an automatic thesaurus is to reuse existing online lexicographic databases, such as WordNet. Other attempts to incorporate existing thesauri were explained in detail by Chen et al. (1997). However, generic thesauri like WordNet are not specialized enough for domain specific databases (Caraballo, 1999). Chen et al. (1997, pp.20-21) reviewed in detail several algorithmic approaches to automatic thesaurus generation developed in numerous investigations, and concluded that most methodologies compute coefficients of \"relatedness\" between terms using statistical co-occurrence measures such as cosines, and Jaccard and Dice similarity functions (Chen and Lynch, 1992; Crouch, 1990; Rasmussen, 1992; Salon, 1989). The field that most extensively utilizes thesaurus construction is Information Retrieval (IR). The goal of IR is to develop systems that can retrieve all documents relevant to a user's query, but that only retrieves documents containing relevant information. Because our research focus is domain-specific, we reviewed past approaches primarily in the context of domain-specific automatic thesaurus generation. We also referred to other generic automatic thesaurus techniques that we thought useful and similar to our approach; these relevant techniques will be discussed in the next section. The University of Arizona Artificial Intelligence Lab, headed by Dr. Hsinchun Chen, conducted the most prominent series of studies to automatically develop a domain specific thesaurus. As Chen et al. (1998) summarized, in their previous research they generated domain-specific thesauri in different domains such as Russian computing (Chen and Lynch, 1992; Chen et al., 1993), business (Chen et al., 1994), and molecular biology (Chen et al., 1995). More recently, Hauck et al. (2001) conducted an experiment in geoscience, and Leroy and Chen (2001) studied medical terminology. Because their studies and techniques have been cited in many papers on constructing automatic thesauri and have also directed our research, we will introduce their essential techniques in detail in the following section. As for our research interest, the accounting domain, we have identified only three relevant papers classifying accounting concepts. Gangolly and Wu (2000) conducted preliminary research using statistical analysis of term-document frequencies to automatically classify accounting concepts. They used Indexicon, an indexing utility, to preprocess the text and produce terms. An agglomerative nesting algorithm was then adopted to derive clusters of concepts. This was only a preliminary investigation, and therefore their objective was only a rudimentary exploratory data analysis of financial accounting standards. Hence, we could not find detailed descriptions to guide our research. Although they found that some rudimentary clusters were differentiable and could extend to further research, the authors did not interpret the resulting clusters in detail. Encouraged by their findings that accounting terms can be classified in a hierarchical clustering, Gangolly and Wu's work motivated us to seek other related techniques that could be used to develop our own method to automatically classify accounting concepts. Garnsey (20001, 2002) investigated the feasibility of statistical method using the frequencies of particular words within documents. Latent Semantic Indexing (LSI) and agglomerative clustering were used to derive clusters of related accounting concepts. The results of this research found related terms included in the resulting clusters. We thought some techniques adopted in this study, such as accounting terms/phrases identification and evaluation, were very valuable for our research. We will discuss these techniques in detail in the following section. 2.2 Automatic Techniques Guiding This Research Chen et al. (1995, 1997) developed techniques for domain-specific automatic thesaurus construction which we have treated as guideline in our research to automatically identify accounting concepts. To obtain a useful domain-specific automatic thesaurus, Chen et al. suggested three criteria: a complete document collection capturing full knowledge in that domain, an appropriate co-occurrence function, and user-directed interaction. In this section, the techniques related to our research will be introduced in the following order: document collection, object filtering, automatic indexing, co-occurrence and cluster analysis, and evaluation. Meanwhile, the derived techniques that can be used in our approach will be proposed and justified when discussing each step. -8-2.2.1 Document Collection Chen et al. (1995, 1997) emphasized that acquiring a complete and up-to-date collection of documents from representative and authoritative domain sources was the key to creating a domain specific thesaurus. In the accounting domain, Gangolly and Wu (2000) and Garnsey (20001, 2002) chose FARS (Financial Accounting Research System) databases as their data collection method. Garnsey (2002) pointed out that FARS, \"with its key word search feature, is an improvement over the print version of G A A P . However, it does not address the fact that users may be unfamiliar with the vocabulary and, therefore, do not input the required terms to achieve adequate retrieval.\" Garnsey's (2002) research objective was to provide sets of related accounting concepts that were derived from the terms actually used in the authoritative literature in the FARS database. Our research also selected the accounting literature from the FARS databases as our collection to work with. 2.2.2 Object Filtering Chen et al. (1997) claimed that domain-specific controlled lists of keywords in databases (for example, the subject indexes at the back of a textbook) can help identify key search vocabularies to improve information retrieval. Chen et al. used four different sources from the biological sciences to compile a vocabulary list, including researcher names, gene names, experimental methods and subject descriptors. They identified terms that matched with terms in the known vocabulary, labelling this process object filtering. Garnsey (2002) adopted this method and filtered terms using a combination of G A A P guide indexes from two accounting texts, and intermediate and advanced textbooks. To make the list more complete, in several cases additional phrases were added to the indexes (for example, \"public traded company\"). In our study we used the following sources: Accounting Dictionary - Accounting Glossary -Accounting Terms (http://www.ventureline.com/glossary.aspl and Financial Accounting: an Introduction to Concepts, Methods, and Uses - 8th Edition (Stickney and Weil, 1997) to guide the vocabulary composition. 2.2.3 Automatic Indexing Referring to Salton's (1989) blueprint method for automatic indexing, Chen et al. (1995) implemented automatic indexing techniques in this order: \"word identification\" to identify words in each document without considering punctuation and case; \"stop-wording\" to create a domain specific stop-word list by removing non-semantic words (such as \"on\", \"in\", \"at\" and \"there\") as well as verbs that were too general and irrelevant to represent meaning of the document; though standard \"stemming\" was adopted in the beginning, the researchers later realized drawbacks to the methods, and removed the stemming process to \"avoid creating noise and ungrammatical phrases. For example, CLONING will not be stemmed as C L O N E (one is a process, the other is the output).\" Chen et al. added, \"term-phrase formation\" was then processed by the system to form phrases by containing up to three adjacent words. For example \"DAUER\", \" L A R V A \" , \"FORMATION\", \" D A U E R L A R V A \" , \" L A R V A FORMATION\", and \" D A U E R L A R V A F O R M A T I O N \" were generated from the three adjacent words \" D A U E R L A R V A FORMATION.\" Chen et al. (1995) in this paper referred to these phrases simply as \"terms.\" Garnsey (2002) performed word identification and stop wordings in automatic indexing by following the procedures used by Chen et al. (Chen and Lynch, 1992; Chen et al., 1998; - 1 0 -Chen et al., 1995). Stemming was not used due to the fact that, in accounting, different forms of a word may have very different meanings; for example, \"warranty\" means \"product guarantee\", whereas \"warrant\" means \"certificate representing stock rights.\" Instead, a limited number of closely related terms were combined (for example, singular and plural, past and present tense). This coincided with the work of Chen et al. (1995), in which found that stemming produces noise and creates ungrammatical phrases. Subsequently, stop words, non-semantic bearing words, adjectives, adverbs, pure verbs (such as \"belong\" and \"solve\") and non-accounting terms (such as \"army\" and \"standard\") were removed. This process was similar to that used by Chen et al. (1997), where a domain-specific stop word list for biology containing about 600 very general molecular biology terms (for example, \"gene\", \"process\", and \"mutation\") was created to remove general terms which were considered irrelevant in the thesaurus. High frequency words, which were too general to discriminate content, were also eliminated. Consistent with the work of Chen et al. (1998), those terms which did not occur in at least three documents were eliminated. As well, the low frequency words that do not contribute to content were also removed. The automatic indexing step in our research was processed similarly to both Garnsey (2002) and Chen et al. (1995), including term-phrase formation. We know that automatic indexing has some timesaving and cost-saving advantages over manual indexing, but is less accurate. Callan (1995) declared that several experiments has demonstrated that \"a combination of manual and automatic indexing is superior to either alone.\" Thereafter, there was some amount of manual work involved when selecting and identifying terms in our study, as is detailed later in the Experiment section of this paper. 2.2.4 C o - o c c u r r e n c e A n a l y s i s Chen and Lynch (1992, p.887) summarized that \"virtually all techniques for automatic thesaurus generation are based on the statistical co-occurrence of word types in text.\" Chen et al. (1995), on the other hand, claimed the most popular technique for constructing automatic thesauri is to compute probabilities of terms co-occurring in all documents constructing a data collection, a process known as co-occurrence analysis. The first stage in many cluster analyses is to convert the raw data into a matrix, usually with similarity, dissimilarity or distance measures. The output of a cluster analysis is a number of groups, clusters, and types of classes of individuals. Lassi (2002) reviewed the co-occurrence analysis developed by Chen et al. (1995, 1997) by first computing each term's document frequency (the number of documents that a word appears in) and term frequency (the number of times a word occurs in a document). The inverse document frequency was then computed by assigning higher weights to multiple-word terms than to single word terms because the multiple-word terms usually carry more precise semantic meaning than single words do. This co-occurrence measure was based, however, on document term co-occurrence analysis and was criticized by Schutze and Pedersen (1997). SchUtze and Pedersen (1997, p.309) developed a lexical co-occurrence method where \"two terms lexically co-occur if they appear in the text within some distance of each other (typically a window of k words). Qualitatively, the fact that two words often occur close to each other is more likely to be significant than the fact that they occur in the same - 12-documents, especially i f documents are long.\" Further, they noted \" i f the goal is to capture information about specific words, we believe that lexical co-occurrence is that preferred basis for statistical thesaurus construction.\" In Latent Semantic Indexing (LSI), document-by-matrices were used in attempts to discover the relationships between terms and documents. Although lexical co-occurrence thesauri are closely related to LSI, Schtltze and Pedersen (1997, p310), worked with a \"term-by-term matrix\" that is more efficient. They computed a symmetric term-by-term matrix C where \"the element Cy records the number of times that words / and j co-occur in a window of size k.\" The lexical co-occurrence method focuses on term representations independently with respect to local contexts rather than documents, whereas LSI only computes document representations. In addition, the region over which LSI co-occurrence is defined is the document, but Schtltze and Pedersen (1997) proposed to assess a region in a window of k words because they believed that local contexts of co-occurrence counts are better than document-based counts. In our research, since we wanted to investigate the relationships between terms rather than documents, and therefore we also deployed term-by-term matrixes to compute term co-occurrences. Chen et al. (1994) mentioned, after they identified terms in their automatic indexing process, that cluster analysis was then used to compute co-occurrence probabilities between any two terms for domain specific automatic thesaurus generation. However, most prevailing statistical co-occurrence functions, like those developed by Chen et al. discussed above, as well as LSI and lexical co-occurrences, were all based on similarity measures that group similar words such as synonyms into clusters. Schtltze and Pedersen - 13 -( 1 9 9 7 , p . 3 1 1 ) a s s u m e d \" w o r d s w i t h s i m i l a r m e a n i n g s w i l l o c c u r w i t h s i m i l a r n e i g h b o r s i f e n o u g h t e x t m a t e r i a l i s a v a i l a b l e . \" A s n o t e d a b o v e , t h e y d e f i n e d a t h e s a u r u s as \" a d a t a s t r u c t u r e tha t d e f i n e s s e m a n t i c r e l a t e d n e s s b e t w e e n w o r d s . \" W e b e l i e v e , t h o u g h , tha t the r e l a t i o n s b e t w e e n w o r d s are n o t o n l y s y n o n y m s , b u t a l s o c a n b e narrow term, broad term a n d o t h e r p o t e n t i a l c o m p o s i t e r e l a t i o n s h i p s . F u r t h e r m o r e , w e d o n o t t h i n k s i m i l a r i t y f u n c t i o n s c a n d e s c r i b e t h e p h e n o m e n a w h e n r e l a t e d t e r m s a p p e a r i n p r o x i m i t y b u t n o t i n s i m i l a r c o n t e x t s . I n o t h e r w o r d s , i f t w o t e r m s , r e g a r d l e s s o f w h e t h e r t h e y are s i m i l a r o r s e m a n t i c a l l y r e l a t e d t e r m s , c o - o c c u r w i t h i n a t e x t ' s s c o p e , e v e n t h o u g h t h e y d o n o t h a v e s i m i l a r s u r r o u n d i n g c o n t e x t s , t h e n c o n c e p t s c a n s t i l l be d e r i v e d f r o m the s e m a n t i c r e l a t e d n e s s ( k n o w n as affinity) b e t w e e n these t e r m s . C o n s e q u e n t l y , w e n e e d to d e v e l o p a n a f f i n i t y m e a s u r e t o s t u d y t h e i r d i s t a n c e f r o m e a c h o the r . T h i s w a s o n e o f o u r k e y t a s k s w h e n c o n d u c t i n g the p r e s e n t r e s e a r c h . 2.2.5 Evaluation A f t e r c o - o c c u r r e n c e a n a l y s i s , the c o n t e n t s i n the r e s u l t i n g c l u s t e r s w e r e t h e n e v a l u a t e d . F o l l o w i n g s i m i l a r e v a l u a t i o n m e t h o d s as t hose d e v e l o p e d b y C h e n et a l . (1995, 1998), i n G a r n s e y ' s (2002) r e s e a r c h t h i r t y t e r m s w e r e r a n d o m l y c h o s e n f r o m the c l u s t e r c o n c e p t s t h e y b e l o n g t o a n d the i n d i v i d u a l s w e r e i n v i t e d to e v a l u a t e i f e a c h t e r m w a s r e l e v a n t to o t h e r t e r m s o c c u r r i n g i n the s a m e c l u s t e r . T h i s e v a l u a t i o n w a s u s e d to d e t e r m i n e w h e t h e r c l u s t e r i n g c o u l d c l a s s i f y t e r m s tha t w e r e r e l a t e d to e a c h o the r . T h e i n d i v i d u a l a s se s so r s e v a l u a t e d s t r e n g t h o f r e l e v a n c e b e t w e e n t e r m s u s i n g the th ree a s s e s s m e n t s (Irrelevant, \" \"Somewhat Relevant\" a n d \"Very Relevant\") p r o v i d e d b y the r e sea r che r . H o w e v e r , the s p e c i f i c r e l a t i o n t y p e s o u r r e s e a r c h i s i n t e r e s t e d i n , s u c h as narrow term o r broad term, - 1 4 -were not identified in the evaluation process. With this goal, in our study we consulted an expert in accounting to identify and assess relation types generated by each cluster concept. 3. OUR APPROACH AND RESEARCH QUESTIONS In detailing the literature and our proposed positions, our objective in this admittedly preliminary research is to develop our own approach to automatically extracting valid compound concepts and to examine some possible factors that might affect clustering performances. We are interested in the following research questions: Q l . l : Can relevant concepts be extracted automatically from a set of documents in a given domain? Q1.2: What type of semantic relations can be identified in the extracted concepts? A thesaurus normally denotes the semantic relationships between one term and another such as a Narrow Term (NT), a Broad Term (BT), or a Preferred Term (USE) (Lassi, 2 0 0 2 ) . Furthermore, a Related Term (RT) relationship is used to indicate related terms that cannot be represented by either broader or narrower semantic relationship. Broader and narrower terms form the hierarchical relationships mentioned by the literature. Therefore, we investigated each concept, to determine what kind of relations in particular could be identified among the terms related to this concept. - 15 -Q2: What parameters can affect the quality of the results? We were also curious whether the following parameters, which almost no prior studies had examined, would be able to influence the resulting concepts: Q2.1 Proximity - sentence, paragraph and document levels Salton and Buckely (1991, p.23) declared \"Each available text is broken down into individual text units - for example, text sections, text paragraphs, and individual sentences.\" Rungsawang (1997) mentioned that word co-occurrences could be measured within \"local\" contexts (within sentences and paragraphs) and \"global\" contexts (the entire document). Like Schutze and Pedersen (1997), who suggested computing the number of times a word co-occurs with other words in a document, in a chapter or in a window of a number of words, we too were interested in computing the co-occurrences within the different local contexts. Zhang and Rudnicky (2002) investigated how multiple-levels of documents, paragraphs and sentences affect the derived semantic information. However, in the literature from previous studies, we could not find any comparisons across different levels of text-blocking. Therefore in this study, we wondered whether different contexts for proximities between terms, particularly at the levels of documents, paragraphs and sentences, would extract very diversified concepts? If so, the next objective would be to determine which level operated most accurately and effectively. Q2.2 Distance within the sentence level Dagan et al. (1995) employed the relation of co-occurrence of two words within a limited distance in a sentence. According to Martin et al. (1983), restricting a window to at most five words accounts for 98% of word relations within a single sentence. Experiments conducted by Losee (1994) also showed that identifying term dependence in text units of - 16-no more than five words increased the degree of information retrieval performance, but \"more dependence information results in relatively little increase in performance.\" Hence, we wonder whether even within the same sentence, the distance between two terms (how closely they are located in relation to each other) will affect the relevance in the resulting concepts. Q2.3 Directionality and asymmetry Many past attempts we have examined have treated the context as \"a bag of words\" in that they ignored the word order, resulting in information loss (see Zhang and Rudnicky, 2002). Hence, it is of interest to study pairs of terms that co-occur, to assess the following issues: (1) whether order-dependent co-occurrence statistics might affect the results - (this is referred to as directionality); and (2) whether basing the co-occurrence measure on the more frequently-occurring word might affect the results (this is referred to as asymmetry). Q3: Is our approach useful to automatic formation of term-phrases? Although term-phrases have been extensively used in automatic indexing to form index terms, few works have clearly demonstrated how to use single-word terms to automatically form multiple-word phrases. Thus, we wanted to apply our approach to explore whether term-phrases could be automatically formed. In this research, we followed the techniques we discussed in the literature review above for automatic indexing with some changes in the domain-specific prepossessing stage. We then developed our own term co-occurrence statistical affinity measures to explore the - 17-effects of different parameters on the outputs. Our proposed approach is summarized in Table 3.1, with comparisons to approaches used in other studies. An accounting expert scrutinized the extracted concepts after we had generated our results. The research outcomes were based on both the expert's evaluation and our analysis of the outputs. - 18-Elements of Our Approach Literature Our Research Compared to Previous Studies Our Techniques Document Collection FARS database: see Section 2.2.1 Same FARS database: also see Section 5.1.1 Object Filtering Various external sources used to make a vocabulary list: see Section 2.2.2 Similar Our external sources: also see Section 2.2.2 Automatic Indexing Word identification, stop-wording, no stemming, term-phrase formation: see Section 2.2.3 Similar and Extended Similar aspects: see Section 5.14-5.17; No stemming supplied. Used consolidation of wanted tokens: see Section 5.1.8 Co-occurrence Analysis Popular similarity measures grouping similar words / synonyms that co-occur with similar neighbors: see Section 2.2.4 New: (1) we studied proximity, not similar Used our own Statistical Affinity Measure (affinity based on co-neighbors; (2) we grouped words to form compound concepts, not only synonyms occurrence count, normalized by term occurrence): See Section 4.1: Affinity Measure converted to Distance Measure: see Section 5.3.2 Clustering Hierarchical clustering: see Section 2.13 Similar Hierarchical clustering - Matlab: see Section 5.3.1 Evaluation Categorization of identified concepts: \"Irrelevant\", \"somewhat relevant\", \"very relevant\": see Section 2.2.5 Similar and Extended Relevance score: 1-5 / \"mostly unrelated\" - \"mostly related\": see Section 5.4.2.1 Semantic Relations Narrow Term (NT), Broad Term (BT), Related Term (RT): see Section 3-Q1.2 Similar and Extended Extended: we used \"Broader Term\" to represent N T & BT, other additional relation type alternatives to represent Related Terms (RT): see Section 5.4.2.1 Sentence, Paragraph and Document Level Within each individual sentence, paragraph or document: see Section 2.1.3, 2.2.4, 3-Q2.1 Extended and New We not only studied each individual sentence, paragraph and document, but also Compared them (this is New): see Section 4.2 Within a Sentence Five-word distance within a sentence: see Section 3-Q2.2 Extended Extended: different distances between two words within a sentence: see Section 4.2 Directionality Many previous studies ignored the word order: see Section 3-Q2.3 Somewhat New Compared effects of different word order affect: see Section 4.1 Asymmetry Have not found previous studies: see Section 3-Q2.3 New Compared normalization's affect: see Section 4.1 Table 3.1: Summary of Our Approach - 19-4. OUR AFFINITY MEASURE 4.1 Term Aff in i ty Statistics As we explained in the literature review section above, our research proposed an affinity measure to estimate co-occurrence probabilities between any two terms (for example, term A and term B), based on their relative distance within fixed textual scopes: within a single document, within a paragraph, and within a sentence. Similar to approaches adopted by Besancon et al. (1999) and Dagan et al. (1995), in our investigation we set affinity between two terms (terms are identified as meaning-bearing single words or phrases) as the relative directional occurrence of two words within a textual unit, given the frequency of one of them. The statistics of the co-occurrence of a pair of terms were used to calculate the general probability of their co-occurrence. For the sake of brevity, we refer to these measures as statistical or probabilistic co-occurrence measures. To address the research questions of directionality (term order) and asymmetry, four estimations, explained below, were developed using the sample terms \"reaction\" (term A) and \"nuclear\" (term B). The numbers shown in the examples below were imagined in order to illustrate how to estimate four different affinity values: they do not represent real cases. \u00E2\u0080\u00A2 Pr (A->B/A) ^ #/M: In the entire collection, given that the current term A appears within the sentences, paragraphs or documents, what is the probability that the other term B will co-occur AFTER term A in a specific predefined textual unit? For example, consider term A \"reaction\" and term B \"nuclear\": if in the entire - 2 0 -collection, given 150 sentences, paragraphs or documents carrying current term A \"reaction\", there are 30 sentences, paragraphs or documents having term B \"nuclear\" co-occurring after the given term A \"reaction\" in a specific textual unit, then Pr (A->B/A) ^ (estimated by) #/^reaction 30/150. This indicates that in the entire collection, given current term A \"reaction\" appears within the sentences, paragraphs or documents, the probability that term B \"nuclear\" will co-occur after term A \"reaction\" in a specific predefined textual unit is 20%. \u00E2\u0080\u00A2 Pr (B->A/A) ^ #/#A: In the entire collection, given that current term A appears within the sentences, paragraphs or documents, what is the probability that the other term B will co-occur BEFORE term A in a specific predefined textual unit? For example, consider term A \"reaction\" and term B \"nuclear\": i f in the entire collection, given 150 sentences, paragraphs or documents carrying current term A \"reaction\", there are 70 sentences, paragraphs or documents having term B \"nuclear\" co-occurring before the given term A \"reaction\" in a specific textual unit, then Pr (B->A/A) ^ (estimated by) #/^reaction = 70/150. This indicates that in the entire collection, given current term A \"reaction\" appears within the sentences, paragraphs or documents, the probability that term B \"nuclear\" will co-occur before term A \"reaction\" in a specific predefined textual unit is 47%. \u00E2\u0080\u00A2 Pr (A->B/B) ^ #/#B: In the entire collection, given that current term B appears within the sentences, paragraphs or documents, what is the probability that the other term A will co-occur BEFORE term B in a specific predefined textual unit? For example, consider term A \"reaction\" and term B \"nuclear\": if in the entire collection, given 100 sentences, paragraphs or documents carrying current term B -21 -\"nuclear\", there are 30 sentences, paragraphs or documents having term A \"reaction\" co-occurring before the given term B \"nuclear\" in a specific textual unit, then Pr (A->B/B) ^ (estimated by) #/#nuclear = 30/100. This indicates that in the entire collection, given current term B \"nuclear\" appears within the sentences, paragraphs or documents, the probability that term A \"reaction\" will co-occur before term B \"nuclear\" in a specific predefined textual unit is 30%. \u00E2\u0080\u00A2 Pr (B->AJB) ^ #/#B: In the entire collection, given that current term B appears within the sentences, paragraphs or documents, what is the probability that the other term A will co-occur AFTER term B in a specific predefined textual unit? In this study, we wanted to explore combinations of different distances, order and asymmetry between two terms. In order to make the research feasible and simple, we reduced the number of combinations to the first three estimates: #/#A, #/#A, and #/#B. We believe these three cases are sufficient to address both issues of order and asymmetry. By definition the higher the estimated probabilities are, the more likely it is that two terms will appear together, which also means their affinity is more proximate. Moreover, the closer together they are, the more probable it is that they could belong to the same concept. 4.2 Five Textual Units As discussed above, the predefined context scopes for our research were within a sentence, within a paragraph, and within a document, in regards to each of which different relations for indexing can be investigated. We also noted that Martin et al. (1983) - 2 2 -indicated restricting a window to at most five words accounts for 98% of word relations within a single sentence; therefore, in our research the cut-off level used within the sentence level was less than or equal to five terms intervening in the same sentence. In other words, we examined the neighborhood of each word w within a span of five words (-5 words and +5 words around w). There were three representative cases within a sentence level that we were interested in investigating: two terms next to each other (no term existing between them - in this case these two terms should have the highest affinity); two terms with up to five terms between them, and two terms co-occurring in a sentence regardless of how many other terms exist between them. That is, we further divided the same sentence level into three textual units: zero token difference (SentenceOtd), up to five token difference (SentenceUpTo5td), and entire sentences with disregarding token differences (SentenceNoRestriction). We probed the affinities of a total of five textual units, with different proximity between the terms in our experiment: SentenceOtd, SentenceUpTo5td, SentenceNoRestriction, ParagraphNoRestriction and DocumentNoRestriction. Using the affinity measure we developed, the followings explain how to calculate two terms' affinity values in each of the five textual units in the directional of A->B (A is before B) or B->A (A is after B) in our experiment: \u00E2\u0080\u00A2 Within the Same Sentence 4.2.1 SentenceOtd Textual Unit; as long as any two tokens in the order of A->B / B->A were next to each other with zero token distance between (Otd) them in the same - 2 3 -sentence, regardless of how many of these patterns were in that same sentence, we counted only 1 per sentence. 4.2.2 SentenceUpToStd Textual Unit: as long as any two tokens in the order of A->B / B->A were located in the same sentence with five tokens or less between them (UpTo5td), regardless of how many of these patterns were in the same sentence, we counted only 1 per sentence. 4.2.3 SentenceNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a same sentence, with any number of tokens between them (NoRestriction - with no token's restriction), regardless of how many of these patterns were in the same sentence, we counted only 1 per sentence. \u00E2\u0080\u00A2 Within the Same Paragraph 4.2.4 ParagraphNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a single paragraph but across different sentences with any number of sentences between them (NoRestriction - with no sentences restriction), regardless of how many of these patterns were in that same paragraph, we counted only 1 per paragraph. \u00E2\u0080\u00A2 Within the Same Document 4.2.5 DocumentNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a single document, across different paragraphs with any number of paragraphs between them (NoRestriction - with no paragraphs restriction), regardless of how many of these patterns were in that same document, we counted only 1 per document. - 2 4 -4.3 Estimating Fifteen Schemes' Affinity Values For each of the five textual units there were three affinities we wanted to estimate: #/#A: #/#A; and #/#B, with values ranging from 0 to 1. The entire fifteen schemes' affinities can be estimated by the formula listed in Table 4.1. 15 Schemes #/#A #/#A #/#B SentenceOtd # of sentences having B A F T E R A with Otd / # of sentences carrying A # of sentences having B B E F O R E A with 0 td / # of sentences carrying A # of sentences having A B E F O R E B with Otd / # of sentences carrying B SentenceUDTo5td # of sentences having B A F T E R A with UpTo5td / # of sentences carrying A # of sentences having B B E F O R E A with UpTo5td / # of sentences carrying A # of sentences having A B E F O R E B with UpTo5td / # of sentences carrying B SentenceNoRestriction # of sentences having B A F T E R A with no token restriction / # of sentences carrying A # of sentences having B BEOFRE A with no token restriction / # of sentences carrying A # of sentences having A B E F O R E B with no token restriction / # of sentences carrying B ParagraDhNoRestriction # of paragraphs having B A F T E R A but across different sentences with no sentence restriction / # of paragraphs carrying A # of paragraphs having B B E F O R E A but across different sentences with no sentence restriction / # of paragraphs carrying A # of paragraphs having A B E F O R E B but across' different sentences with no sentence restriction / # of paragraphs carrying B DocumentNoRestriction # of documents having B A F T E R A but across different paragraphs with no paragraph restriction / # of documents carrying A # of documents having B B E F O R E A but across different paragraphs with no paragraph restriction / # of documents carrying A # of documents having A B E F O R E B but across different paragraphs with no paragraph restriction / # of documents carrying B Table 4.1: Fifteen Schemes' Affinity Estimations 5 EXPERIMENT To answer our research questions, we conducted an experiment using a computerized program to automatically extract accounting concepts. The following sections describe our experiment process in detail. -25 -5.1 Domain Dependent Preprocessing - Tokenizing Domain dependent tokenizing is a process of breaking a specific domain text into individual domain-specific meaning-carrying units. It includes extraction of tokens as well as eliminating non-domain related tokens. Our tokenizing techniques employed a similar automatic indexing process to the one used by Chen et al. (1992, 1994, 1997, 1998). We performed the steps listed below in sequence. In the end, a specific token list and tokenized texts were generated so as to be ready for use in the experiment's later stages. 5.1.1 File Extraction Like Garnsey (2002), in our research we selected the accounting literature from the FARS 4.2 (Financial Accounting Research System) database, which is current through to June 15, 2002. Our entire research collection was comprised of 835 documents, which were automatically extracted from the following categories in the FARS database: Committee on Accounting Procedure Accounting Research Bulletins (ARB) Accounting Principles Board Opinions (APB) AICPA Accounting Interpretations (AIN) FASB Statements of Financial Accounting (FAS) FASB Interpretations (FIN) FASB Technical Bulletins (FTB) FASB Statements of Financial Accounting Concepts (CON) FASB Emerging Issues Task Force Abstracts (EITF) - the sections \"Introduction\" and \"Full Text of EITF Abstracts (including Appendix D)\". - 2 6 -5.1.2 Reformatting the Document Structure The program first processed the original document collection in three steps: (1) The topics of each file in the original database (for example, A R B 49: Earnings per Share) were connected with their associated filenames (for example, fars-0008.txt) (see Appendix A - Section 5.1.2). (2) The periods (dots) that did not designate punctuation to end sentences were removed, such as \"Ch.3A\" (see Appendix A - Section 5.1.2). (3) A l l words were converted into lowercase (see Appendix A - Section 5.1.2). 5.1.3 Changing the Short-Forms of the Words in the Abbreviation List into Term-Phrases In this step, we transformed all abbreviations into the term-phrase format by creating an abbreviation-controlled list (see Appendix A - Section 5.1.3). 5.1.4 Converting Meaningful Words to Term-Phrases For the specific domain thesaurus, Salton (1989) as well as Chen et al. (1992, 1994, 1995, 1997, 1998) recommended and implemented the technique of forming term-phrases by combining adjacent words. Garnsey (2002) also identified additional accounting phrases and added them into the accounting indexes. Without these phrases combining from multiple words, individual words and tokens cannot convey accurate meanings, in accounting applications. Furthermore, the majority of the database (82.5%) listed in the \"Topical Index\" (titles of each document) section of FARS consists of term-phrases, while only 17.5% of them are single terms; we believe the term-phrases carry more accurate meanings than the single terms. We also used other two external sources, Accounting Dictionary - Accounting Glossary - Accounting Terms - 2 7 -(http://yvavw.venturelinexom/glossarv.asp) and Financial Accounting: An Introduction to Concepts, Methods, and Uses (8th Edition), creating a comprehensive financial accounting phrase-controlled list which included 1,774 phrases (see Appendix A - Section 5.1.4, List 2) containing between and nine single words in each phrase. Because words may appear in the text either in singular or plural forms, in order to incorporate all the cases we therefore included both singular and plural patterns for most phrases when needed. The program scanned the collection and then automatically connected these single words into their associated term-phrases. For instance, if the program detected single words appearing consecutively, such as \"stock option and stock purchase plan,\" it then would link them automatically into one meaningful token, \"stock_option_and_stock_purchase_plan.\" Similarly, the program would link the plural form, \"stock option and stock purchase plans,\" from the text into one token, \"stock_option_and_stock_purchase_plans.\" In our prior step we had already connected abbreviations into a single term-phrase format (for example, the abbreviation \" A C R S \" had been converted into \"accelerated_cost_recovery_system\"), however sometimes the text carried individual separate words of the abbreviations instead of their abbreviated forms. In the case of this particular situation and in order to cover all possibilities, our phrase-controlled list also included the probable singular and plural forms of these words, for example including both \"accelerated cost recovery system\" and \"accelerated cost recovery systems.\" Then the program checked the text and converted the terms into \"accelerated_cost_recovery_system\" and \"accelerated_cost_recovery_systems\" respectively i f they found matching terms. (In the remainder of this paper, we use \"terms\" or \"tokens\" to mean either \"single word terms\" or \"term-phrases.\") Therefore, in this step, - 2 8 -the program connected adjacent words as single phrases and the tokenized files were then comprised of phrases and other single terms. 5.1.5 Removing Stop-Words As noted above, Chen et al. (1997) generated a biology domain-specific stop-word list by removing non-significant terms in that domain. Similarly, to arrive at practical index terms in our research, we defined an accounting domain-specific list of stop-words by adding 184 additional stop-words that are irrelevant to the accounting domain into the existing S M A R T stop-words list, which already contained 5 4 4 stop-words. For instance, some adverbs that have little effect on accounting (such as \"approximately\" and \"eventually\") as well as some nouns that are too general to contribute to the accounting domain (such as \"background\", \"conclusion\", and \"football\") were added. Then we obtained a new comprehensive list of 728 stop-words (see Appendix A - Section 5.1.5, List 3), which was more tailored to the accounting domain. Consequently, the program simply used our domain specific stop-words list to scan the tokenized files exported from the last step carrying phrases, as well as many other single terms, to remove those significantly general and irrelevant words. 5.1.6 Producing a Ful l Token List After the stop-words had been eliminated from the text, the program extracted the remaining unique tokens from the collection while incrementing the token's frequency each time it encountered the given token. Finally, the program produced 14,437 tokens with information on the number of frequencies as well as on the number of documents appearing in the text collection. - 2 9 -5.1.7 Removing Unwanted Tokens The 14,437 tokens were then processed as follows. We sorted them first by the tokens' frequencies in the entire collection, second by the number of documents in which a given token appeared, and finally by alphabetical order. Following similar procedures as those presented by Garnsey (2002), where non-semantic words, adjectives, adverbs, verbs and non-accounting terms were eliminated, we manually checked the first 10,000 tokens to further eliminate the non-accounting related tokens. As well, very low frequency words were directly eliminated in both Garnsey (2002) and Chen et al. (1998). Consequently, the 10,001th to 14,437th tokens in our list, due to their extremely low frequencies in the entire collection (each appeared only once), were treated as unwanted tokens and discarded immediately. The resulting vocabulary then included 2,052 wanted tokens (see Appendix A - Section 5.1.7, List 4). The program scanned the documents and only kept these wanted tokens in the text. 5.1.8 Consolidating Wanted Tokens As discussed above in the literature review section, both Chen et al. (1995) and Garnsey (2002) removed the stemming process to avoid creating noise and ungrammatical phrases; because in specific domains, stemming can cause looses of information. Similarly, standard stemming processes were not performed in our research, because different stems sometimes refer to different concepts in the accounting domain. For example, \"taxable\" is different from \"tax,\" inasmuch as \"taxable income\" means \"the amount of income subject to income taxes\" whereas \"tax\" means \"fee charged (levied) by a government.\" Hence, in this step, we combined some terms into their common formats only when necessary. We manually checked the 2,052 wanted tokens and created a list consolidating 994 tokens (see - 3 0 -Appendix A - Section 5.1.8, List 5) by converting every plural token to its singular form (for example, \"yields\" was converted to \"yield\") and by combining past and present tenses of a number of tokens into their most representative roots (such as \"taxed\" and \"taxing\" both converted to \"tax\"). On the other hand, we did not unite some tokens when we believed that the different forms conveyed varied meanings, such as \"account,\" \"accountant,\" and \"accounting.\" Thus, the program processed the exported files from the prior step that contained only wanted tokens to further consolidate some tokens according to the list. 5 .1 .9 G e n e r a t i n g t h e F i n a l R e d u c e d T o k e n L i s t The program rescanned the document to obtain the new number of token frequencies and the number of documents, since the content of the files was changed after consolidation. Eventually we got a list of the final 1,344 unique tokens from the system (see Appendix A - Section 5.1.9, List 6), which were ordered by the tokens' descending frequencies within the entire collection. The resulting text contained 835 tokenized files. We then used this text comprising 1,344 different domain-specific meaning-bearing units to represent the entire collection, with the tokens' original relative distances maintained. This domain-dependent text was then processed further, as will be explained in the following sections, to extract accounting concepts at different textual levels. In summary, besides the generic process for automatic indexing, in the preprocessing stages of generating domain-dependent indexing, we introduced significant domain knowledge, such as by creating the abbreviation list, the term-phrases controlled list and -31 -additional domain-specific stop-words, by manually picking out the wanted tokens, and by consolidating the closely related terms instead of regular stemming. In the end, we obtained exported tokenized files with 1,344 kinds of tokens located in their original positions. 5.2 Computing Term Affinities In this section, we will describe how to compute the fifteen schemes' probabilistic co-occurrences as their affinity values based on our discussion in the Approach section above. 5.2.1 Removing Unwanted Punctuations Before we scanned the collection to compute affinity values, we needed to further process the 835 tokenized files which were produced in the last step and which still contained all of their initial punctuation. We only wanted, however, to keep the three punctuation marks that represent the end of a sentence and a paragraph: \"!\" and \"?\" These three punctuations are important indications for calculating the number of sentences that include each token at the sentence level. Therefore, in this step, the program got rid of all punctuation except for \"!\" and \"?\". 5.2.2 Converting 1,344 Tokens to Token IDs To facilitate further processing, the program assigned each of the final 1,344 tokens with a unique token ID in the same order of tokenizing stage (see Appendix B - Section 5.2.2, List 7). Then we replaced all tokens in the tokenized files with their matching token IDs. The text now only contained tokens represented by these 1,344 different IDs and was then ready for calculating token affinities. - 3 2 -5 .2 .3 6 0 0 T o k e n s ' A f f i n i t y V a l u e s 5 .2 .3 .1 G e n e r a t i n g t h e 6 0 0 T o k e n L i s t f o r C l u s t e r i n g We believed that the tokenized files from the last stage containing 1,344 various tokens were significant and were enough to represent the content of the original collection. It was sufficient that we studied any two tokens' affinity in a structural context, only comprised of these 1,344 different tokens. Utilizing a similar method, Gangolly and Wu (2000) restricted the index terms to 93 out of their previously acquired 983 terms because of the limitations of data visualization in a large index. To ensure visualization of the data, Garnsey (2001) also generated a final matrix of only 118 terms from their initial 676 terms. In order to concentrate our preliminary study on examining the relations of only those terms we had interest in and also for purposes of visualization, from the 1,344 tokens we manually chose only 600 - the most common and representative terms in accounting according to our knowledge - to be the research subject terms that would be investigated in the resulting clusters. When doing so, we were careful to retain words that might not have an accounting meaning in themselves (for example, \"call\"), but which might appear in accounting terms (for example, in \"call options\"). These 600 token subjects would then be used in the rest of this research to compute their affinities and then to perform clustering analysis. Although the remaining 744 tokens' affinities were not studied to form clusters, the 600 selected tokens still kept their original locations in the 1,344 token structure - the context including all 1,344 tokens did not change. In other words, in order to trace any two tokens' relative distances in the context of 1,344 tokens, the program still kept the 1,344 token IDs -33 -for marking the tokens' positions in the tokenized files. The program then assigned another set of IDs for these 600 final token subjects (see Appendix B - Section 5.2.3.1, List 8), which were different from the 1,344 token structure IDs. The new set of 600 token IDs were ordered first by tokens' descending frequencies in the whole collection, second by the descending number of documents the tokens appear in, and last by ascending alphabetical letter. We can see that since these 600 tokens were chosen from the 1,344 tokens, the tokens' frequencies and the numbers of documents in the 600 token list were the same as those in 1,344 token list, with only the token IDs being different. 5 . 2 . 3 . 2 S u m m a r y o f t h e S t e p s t o O b t a i n t h e 6 0 0 T o k e n s Table 5.1 summarizes the progress thus far in our experiment to obtain the resulting 600 tokens. Step Description Method # Tokens in Results 1 Initial Preprocsssing Computer N/A 2 Forming Term-phrases: see Section 5.1.4; Removing Stop-words: see Section 5.1.5 Manual (/, 774 controlled-phrases: \"accounts payable \" -> \"accounts_payable \"; 728 domain-specific stop-words: \"approximately\", football\") + Computer 14,437 tokens 3 Removing Unwanted Tokens: see Section 5.1.7 Manual (manually checked the first 10,000 tokens to further eliminate non-accounting related tokens, discarded thel0,001th to 14,437th tokens that appeared only once) + Computer 2,052 wanted tokens 4 Consolidating Wanted Tokens: see Section 5.1.8 Notes: No Stemming because different stems refer to different concepts Manual (consolidated 994 tokens by: converting plurals to singulars: \"yields\" -> \"yield\"; combining past and present tenses to most representative roots: \"taxed\" and \"taxing\" -> \"tax. This step kept other different forms with varied meanings: \"account\", \"accountant\", and \"accounting\") + Computer 1,344 final tokens 5 Generating 600 Tokens: see Section 5.2.3.1 Manual (chose only 600 - the most representing terms in accounting according to our knowledge to be the subject terms to form clusters. This step retained words which had no accounting meaning in themselves: \"call\", but might appear in accounting terms:\" call options \") + Computer 600 subject tokens Table 5.1: Summary of the Steps to Obtain the 600 Tokens -34-5.2.3.3 The Token IDs of Any Two Terms To study affinities between any two terms A and B, in this study we always treated the token with the smaller number ID as token A, and the other with a larger ID number as token B. For example, consider two ID numbers \"4\" and \"5.\" We treated \"4\" as token A, and \"5\" as token B. So #/#A should be #<4, 5>/#4, #/#A should be #<5, 4>/#4, and #/#B should be #<4, 5>/#5. Since the program assigned IDs to tokens first by their descending frequencies in the whole collection, second by descending number of documents that the tokens appeared in, and last by ascending alphabetical order, the \"#Frequencies\" or \"#Documents\" (if the first column, \"#Frequencies,\" was used constantly) of the smaller ID, token A , would thus be higher than those of the larger ID, token B (see an example in Table 5.2). ID T O K E N # F R E Q U E N C I E S # D O C U M E N T S 1 asset 15878 510 2 accounting 13950 797 3 cost 10548 434 4 loss 6993 389 5 tax 6739 261 Table 5.2: Example o f Comparing Two Token IDs 5.2.3.4 An Example of a Dummy File to Illustrate Affinity Calculations Before we implemented the program to calculate the real affinity values, we created three dummy files using five tokens, ranging from number 1 to number 5, as examples for the three levels: within a sentence, within a paragraph and within a document. Note that the numbers 1 to 5 in the dummy files only represented the token's name and that we randomly located them in the texts; therefore, these numbers (1 through 5) were not assigned in any order. Thus, for example, token l 's frequency may or may not be greater than the other tokens' frequencies in the dummy files. In each dummy file, we manually -35 -computed statistical affinity values and then compared each figure with outputs from the program so that we could make sure we got the correct results from the program when it processed the larger volume of real documents. Below is the dummy file for the ParagraphNoRestriction level to illustrate how we calculated the three probabilistic affinity patterns for any two tokens that were our research subjects: : 1 24 5 3 1. 3 1 42! 4 1.2 4. . 5 2 1 1. 3 2 5. ! 234 3. .7 13124 3. 254343 2 4! 3 1 5 : 2 4 1.3 2 3 4 5! 1 32. ? 43 5 1? ! 2 3. 2. : 42 14 1.2 5 3. ?4. 1 2 5! 3? 3 1 ! 4 2! . ? 3 1 .? : 3. ! . 1 4. ?23 1. . 4 4 We only studied 600 tokens' affinities our of the 1,344 various token context; in this preliminary test, we only selected only three tokens (tokens \"1,\" \"3\" and \"5\") as our subjects out of the total five token context. In other words, these three tokens were still in a fixed context comprised of a total of five tokens, so their co-occurrences (see Table 5.3) and statistical affinity values (see Table 5.4) would not change: Two Tokens l->2 l->3 l->4 l->5 Whole paragraph 7 4 Two Tokens 3->l 3->2 3->4 3->5 Whole paragraph 4 2 Two Tokens 5->l 5->2 5->3 5->4 Whole paragraph 2 4 Table 5.3: Dummy File - Two Tokens' Co-occurrences in ParagraphNoRestriction - 3 6 -Token 1 and Token 3: #/#A #/#A #/#B Probabilistic Affinity Pattern l->3/#l 3->l/#l l->3/#3 Probabilistic Affinity 0.7 0.4 0.7 Value (7/10) (4/10) (7/10) Token 1 and Token 5: Probabilistic Affinity Pattern l->5/#l 5->l/#l l->5#5 Probabilistic Affinity 0.4 0.2 0.5714286 Value (4/10) (2/10) (4/7) Token 3 and Token 5: Probabilistic Affinity Pattern 3->5/#3 5->3/#3 3->5/#5 Probabilistic Affinity 0.2 0.4 0.2857143 Value (2/10) (4/10) (2/7) Table 5.4: Dummy File - Two Tokens' Statistical Affinity Values in ParagraphNoRestriction ParagraphNoRestriction #/#A: # of paragraphs having B A F T E R A but across different sentences with no sentence restriction / # of paragraphs carrying A #/#A: # of paragraphs having B B E F O R E A but across different sentences with no sentence restriction / # of paragraphs carrying A #/#B: # of paragraphs having A B E F O R E B but across different sentences with no sentence restriction / # of paragraphs carrying B 5.2.3.5 The Program's Affinity Calculation of Fifteen Schemes As illustrated in Table 4.1, because the sentence level analysis required information about the number of sentences including each token, while the ParagraphNoRestriction condition also required the number of paragraphs that each token appeared in, the program counted them and added these numbers with two more columns added to the 600 tokens list. Then the program processed all of the tokenized files, and, for each of the fifteen schemes, the program computed each pair of terms' statistical affinity values for all 600 token subjects. - 3 7 -5.3 Clustering 600 Tokens Grefenstette (1993) declared that a domain specific thesaurus presents \"a hierarchical view of the important concepts in the domain, as well as suggesting alternative words and phrases that may be used to describe the same concept in the domain.\" The clustering technique we adopted in this study was the hierarchical clustering produced by Matlab, which links together pairs of terms that are in close proximity. The \"Tutorial - Cluster Analysis\" section of Statistics Toolbox for Use with Matlab User's Guide (Version 3, 1-54) (http://www, busim. ee. boun. edu. tr/-resources/statsJb.pdf) defines and explains the Matlab linkage function, which uses \"the distance information to determine the proximity of objects to each other. As objects are paired into binary clusters, the newly formed clusters are grouped into larger clusters until a hierarchical tree is formed.\" We then used the \"linkage function\" in Matlab to form our clusters. 5.3.1 Hierarchical Clustering Using Matlab As noted above, the linkage function of Matlab identifies the distance values, and links pairs of objects that are close together into binary clusters. It then creates larger clusters by linking those newly formed clusters to other clusters until all the objects are linked together in a hierarchical tree (for example, see Appendix B - Section 5.3.1). 5.3.2 Affinity Values Fit into Matlab Matlab first groups the objects that have the closest proximity to each other. In other words, the first two terms that are linked together by Matlab always have the smallest relative distance from each other in the whole data set. As discussed above, we assume that the higher the statistical affinity value is, the closer the terms' affinities are and the - 3 8 -smaller the distance between them should be. The most closely related terms (with the highest affinity values) should be grouped first. Hence, in order to fit the affinity values into the Matlab linkage function which links minimum distance first, we computed the distance value of two terms using the formula (1 - Statistical Affinity Value). 5.3.3 Clustering Outputs The program processed the entire collection and used the Matlab linkage function to cluster the 600 tokens for each of the fifteen schemes. Table 5.5 shows the top ten clustering data outputs, which had been replaced by token names in the scheme of SentenceOtdAB_A (see Appendix C for the top 50 clustering outputs for each of the fifteen schemes): Cluster Index Object in First Group Object in Second Group Distance 601 261 ;stock_dividend 346;stock_split 0.62791 602 392;accounts_payable 429;accrued_expense 0.74286 603 102;short 139;duration 0.8034 604 114;taxable 122;deductible 0.80942 605 41;derivative 46;financial_instrument 0.82476 606 601; stock_dividend; stock_split 269;split 0.84884 607 10;future 27;cash_flow 0.85654 608 5 5 3 ;unco 1 lectib le_account 595;unearned_discount 0.85714 609 170;direct_cost 209;direct_financing_lease 0.8599 610 38;receivable 70;payable 0.87065 Table 5.5: Top Ten Clustering Data of SentenceOtdABA As noted above, Matlab grouped together the first set of related terms with the least distance between them to form the first cluster. Because there were a total of 600 terms in the data set, Matlab assigned a cluster index number starting with 601 (600+1) to this new cluster. Note the example shown in Table 5.4: Sentence0tdAB_A first grouped two terms with token IDs of 261 and 346 and used the number 601 to represent this newly formed cluster. We can also see that token 261 and token 346 in cluster 601 had the closet - 3 9 -proximity (distance value = 0.62791) in the entire 600 token data set of SentenceOtdAB_A., Matlab continuously formed other related terms into new clusters in ascending order of their distance value. Let us examine cluster 606, which grouped clusters 601 and 269. As Matlab guide illustrated, here token 269 formed a larger cluster with cluster 601, which had already linked tokens 261 and 346. That is, in this step, these three terms of tokens 261, 346 and 269 were grouped together into cluster 606. 5.4 E v a l u a t i o n To justify our methodology and to assess the clustering performances of all fifteen schemes, we invited an accounting expert to evaluate the clustering outputs. The expert used her domain knowledge of the accounting industry and an external online website (http://www, investorwords. com) which listed thousands of definitions of current authoritative accounting terms that the expert could consult to evaluate automatic concepts from all fifteen schemes. In addition, in order to borrow another independent accounting thesaurus to further check the accounting concepts produced by us, we also compared the expert's evaluation results with the Price Waterhouse Thesaurus (1974). 5.4.1 C l u s t e r i n g T e r m s t o B e E v a l u a t e d We know that the distance between objects in the cluster becomes greater as the newly formed cluster number gets larger. That is, in the clustering data tables, only the top clusters in each scheme can group closely related accounting concepts with small distances between them to each other. Regarding the special scheme DocumentNoRestrictionAB_B, only the top 25 clusters needed to be evaluated. Because in this scheme each larger cluster was formed by continuously adding one more term to - 4 0 -the previous newly-formed clusters, the process was different from the clustering outputs used in the other fourteen schemes. Furthermore, the hierarchical tree grew unmanageably high after only a few steps due to noise, namely distances affected by random occurrences; therefore, we believe the top 25 clusters would be good enough to represent this scheme. For the remaining 14 schemes, only the first top 50 clusters in each scheme needed to be evaluated. In total, there were 725 clusters of fifteen schemes to be evaluated. Let us examine the example of the first top three clusters to be evaluated for the scheme SentenecUpTo5tdAB_A (Table 5.6), where the object number in each cluster had been automatically replaced with term names by the program. Cluster No. Terms in the Cluster Separated by \";\" Total # of terms Terms in the first group Terms in the second group 1 (601} stockdividend (26]) stocksplit (346) 2 2 mi} accountspayable (392) accruedexpense (429) 2 3 mm stockdividend; stocksplit (601) split (269) 3 Table 5.6: Sample Evaluation Terms - Top Three Clusters of SentenecUpTo5tdAB_A Note: The number, for example, (601). means the original cluster index; the number beside the term, such as (261) means the original token ID. 5.4.2 Expert's Evaluation Since this was only a preliminary study, in this stage we invited only one accounting expert to evaluate the clustering outputs. Furthermore, the final results of this research were based not only on the expert's evaluation, but also on the analysis of the clustering data itself and the Price Waterhouse Thesaurus (1974). Therefore, a single expert was adquete to evaluate the outputs of the clusters produced by each of the fifteen schemes in -41 -order to see what portion of every 25 or 50 clusters each scheme automatically groups together relevant terms so that meaningful accounting concepts can be derived. If identical clusters appeared in the results of two different schemes, the expert could cut and paste the evaluations. By comparing the statistical evaluation results using Excel worksheets, we can summarize the similarities and differences among the fifteen schemes, and be able to recognize the optimal schemes. 5 .4 .2 .1 I n s t r u c t i o n s f o r E x p e r t ' s E v a l u a t i o n The expert was provided with detailed instructions (see Appendix D - Section 5.4.2.1) to select the relation types among the terms in each cluster, to provide explanations about her choices, and to score the relevance ranging from 1 to 5 by how closely related the terms in each cluster were. As we introduced earlier in the Literature Review, that thesauri normally contain semantic relationship as Narrow Term (NT), Broad Term (BT), and Related Term (RT). By further classifying more potential relationship types in accounting, we created the following relation type alternatives that the expert could choose from. We also gave her space to explain her choices (Table 5.7): - 4 2 -I. Relation Type Alternatives Syn Distinct, Broader a new Ant term Subgroup concept b) c) d) e) Partial j Other relation ! relation 0 ! g ) No direct relation 11. Explanation Table 5.7: Sample Evaluation Relation Type Alternatives With relation types a) \"Synonyms\" (same meaning); b) \"Antonyms\" (opposite meaning); c) \"Broader term\" (any term's meaning broader than all the others); d) \"Subgroup\" (if the terms are all related, are they both/all subgroups of another broader concept?); e) \"Distinct but forming a new concept\" (if they are all distinct, can they together form a new concept?); f) \"Partial relation\" (if there are more than two terms in the cluster, are only some of them related?); g) \"Other relation\" (if relation type is not listed previously); h) \"No direct relation\" (none of the terms is directly related). 5.4.2.2 Sample Evaluation Results The expert then assessed each cluster of every scheme in order of cluster number (from cluster 1 to cluster 25 or to cluster 50) to choose relation type alternatives, provide explanations, and give the relevance score on the terms' relations grouped by that cluster. Table 5.8 shows an example of the expert's evaluation results regarding the top three clusters from the scheme SentenecUpTo5tdAB_A. - 4 3 -Cluster No. Terms in the Cluster Separated by tt.it Total #of terms Relation Type a)~h) II. Explanation III. Relevanc e Score (1-5) Terms in the first group Terms in the second group 1 stockdividend stocksplit 2 e) The new concept is \"stock\". \"stock_dividend\" is distribution of retained earnings while \"stock_split\" is increasing the number of shares outstanding 5 2 accountspayable accruedexpense 2 d) Broader concept is \"liability\" because \"accounts_payabl e\" belongs to \"liability\". \"accrued_expense \" is also \"accruedjiability \" which is also a type of \"liability\" 5 3 stockdividend; stocksplit split 3 f) \"split\" is her \"stock_split\" in the accounting sense and, which means the increasing shares of outstanding, \"split\" is not related to \"stock_dividend\" because \"stockdividend\" is the distribution of retained earnings to share-holders 3 Table 5.8: Sample Expert's Evaluation Results - Top Three Clusters of SentenceUpTo5tdAB_A With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \"No direct relation\". -44 -5.4.2.3 Statistical Analysis of Expert's Evaluation Results for All Fifteen Schemes After the expert finished her evaluation, for each relation type column in every scheme we totalled the number of \" l s \" in all 25 or 50 clusters (how many clusters the expert believed had this relation type). The statistical percentage of each relation type was acquired by using the totals divided by the number of clusters (25 or 50) that had been evaluated. For example, in a specific scheme, the expert chose two clusters having relation type \"a\" in all 50 clusters; thus, the statistical percentage figure was 4% for type \"a\". Meanwhile, we totalled all of the relevance scores divided by the number of clusters (25 or 50) that had been evaluated to obtain the mean relevance score for each scheme. The expert then evaluated every scheme individually for all fifteen schemes and filled out the answers in the evaluation table. Then we calculated statistics for each of the fifteen schemes (see Appendix D - Section 5.4.2.3). The statistical analysis of the expert's evaluation data for all fifteen schemes is presented in Table 5.9. - 4 5 -I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h) (%) III. Mean Score (1-5) AH 15 Schemes a) b) O d) e) g) h) EvaluatcSentenccOtclAB A 0% 4% 46% 42% ' 6% 2% . 0% iiflm 4.58 EvaluateSentenceOtdBA A 0% : 42% 2 \ , 0% ' 0% . 2% 4.40 \u00E2\u0080\u00A2 KvaluateSentenceOtdAB_B 0% > 44% 28% i 20% 0% : 4.06 - EvaluateSentenceUpTo5ldAB A;\u00E2\u0080\u009E -. 2% -36% 42% 2% 8% ,' 0% 2% -4.46 EvaluhicSetttencelJpTo5hlBA 4 4% 1 2 % ' ' 34\u00C2\u00B0o 54%' 2% 2%, 2% ()\u00E2\u0080\u00A2'<\u00E2\u0080\u00A2 4.52 I'.valuutcScntenccUp l Y o t d A B _ B o% ()%\u00E2\u0080\u00A2 4<>% 12% 4% 36% 0% \u00E2\u0080\u00A2I.0J EvaluateSentenceNoRestriction A B A 0% 10% 36% 44% 2% 6% 0% ' 2% 4.46 EvaluateSentenceNoRestriction BA A 4% 2% 3ir>, 5VX. \u00E2\u0080\u00A22% \u00E2\u0080\u00A22% '2% 4.52 EvaiualeSentenccNoRcstrictionAB_B ()'\, ()\".. 18\u00C2\u00B0.. 8\"., 6% -14\" o ()\"., 4% 3.94 EvaluateParagraphNoRestrictionAB_A 4% 4% 50% 34% 4% 2% 0% 2% 4.22 EvaluatePdragraphNoRestriction BA_A 4% 4% 42% 42% 2% 2% 0% 4% 4.30 EvaluateParagraphNoRestrictionAB_B 0% 0% 40% 14% 0% 46% 0% 0% 3.58 EvaluateDocumentNoRostrictioiiAB A 0% 4% 38% 22% 0% 26\" o 0% 10% 3 60 EvaluateDocumentNoKeslriciii'iiB \ \ 2% 0\u00C2\u00B0-o 58% 18% 0% 12% 0% 10\u00C2\u00B0n 3.82 l.vdluateDocumentNoRestiictionAB B 0% 0% 20% 0% 4% 76\"-.. 0% 0% 3 08 Table 5.9: Statistical Analysis o f the Expert 's Evaluation Results - A l l Fifteen Schemes With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \" N o direct relation\". 5.4.3 Comparing the Expert's Evaluation Results With Price Waterhouse Accounting Thesaurus The only existing accounting thesaurus we have seen so far is the one created by Price Waterhouse & Co. in 1974, which Hi l l , the National Director of Accounting and Auditing Services, called \"the first comprehensive Thesaurus of accounting and auditing terminology.\" This thesaurus includes 3,741 main term entries and uses the unabbreviated - 4 6 -form of a term rather than its acronym or abbreviation. It disposes of synonymous or closely related terms using the following abbreviations: U = Use (as cross reference) UF = Used For (as cross reference) BT = Broader Term NT = Narrower Term RT = Related Term Here is an example of terms classified into the same concept shown in their hierarchical positions: Accounts payable UF Advances payable BT Trade accounts payable RT Assumed liabilities Confirmation Creditors Cutoff tests Freight bill payment services Search for unrecorded liabilities Voucher system 5.4.3.1 Sample Comparison After the expert finished her evaluations, we checked all 725 clusters against the Price Waterhouse Thesaurus to count those similar relations assessed by these two methods. For every scheme, we compared the relations of each cluster or concept evaluated by the expert with the corresponding concept listed in the Price Waterhouse Thesaurus. If there were some matching relations in the Thesaurus, we then marked number of \"Is\" for those matching concepts in order to calculate the matches for that scheme. Meanwhile, for - 4 7 -purposes of clarity we also explained how the Price Waterhouse Thesaurus interpreted similar relations of the concept. Table 5.10 shows the comparison example of the top three clusters from the scheme SentenecUpTo5tdAB_A: Terms in the C lus t e r Separated by \" ; \" C o m p a r i s o n wi th the Pr ice Waterhouse Thesaurus Clus te r No . Te rms in the first group Terms in the second group T o t a l # o f terms Relat ion Type a ) ~ h ) P u t \" 1 \" here i f you can f ind a match, otherwise do nothing I f matching, then identify the relat ionship described in the Pr ice Waterhouse Thesaurus 1 s tock_dividend s t o c k s p l i t 2 e) 1 they are Related Terms 2 a c c o u n t s p a y a b l e acc ruedexpense 2 d) I--.: \"accrued expenses\" Uses \"accrued liability'' and \"liability\" is the Broader Term of \" accounts_payable\" and \"accrued_liability\" 3 s t o c k d i v i d e n d ; s t o c k s p l i t split 3 0 Table 5.10: Sample Comparison of Expert's Evaluation Results with the Price Waterhouse Thesaurus - Top Three Clusters of SentenecUpTo5tdAB_A With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \"No direct relation\". 5.4.3.2 Statistical Comparisons of All Fifteen Schemes For each scheme we compared all relations evaluated by the expert with the Price Waterhouse Thesaurus: again, we obtained the statistical matching percentage values (total number of matching clusters, compared against either 25 or 50 clusters in that scheme). See Table 5.11 for statistical comparison results for all fifteen schemes. - 4 8 -% Clusters Containing Relations Matching with Relations in the Price Waterhouse Thesaurus A B A BA A A B B EvaluateSentenceOtd 36% 36% ': 20% EvalualeSentenceUpToStd 36% 38% * 14% EvaluateSentenceNoRestriction 42% 38% ;CI6%.\u00C2\u00BB. *'! EvaluateParagraphNoRestriction 20% 40% 14\"., EvaluateDocumentNoRestriction 32% . \u00E2\u0080\u00A2 28% 1 (>\u00E2\u0080\u00A2'.. Table 5.11: Comparing Fifteen Schemes of the Expert's Evaluation Results with the Price Waterhouse Thesaurus We are aware that accounting terminology has been changed and updated in the 30 years since the Price Waterhouse Thesaurus was published. Although the Thesaurus and our study adopted very different sources of documents to construct the vocabulary collection, from the above data we still can find that the matching between these two methods is visible (between 14% and 42% accuracy). Therefore, we can see that our technique actually identified some valid accounting concepts that can be verified by the Price Waterhouse Thesaurus, which as noted, remains an authoritative accounting thesaurus. 5.5 D i s c u s s i o n By analyzing the clustering data itself, incorporating the statistical analysis of the expert's evaluation results and comparing them with the Price Waterhouse Thesaurus for all fifteen schemes, some interesting findings as well as the answers to our previously proposed research questions emerge: - 4 9 -5.5.1 Can Relevant Concepts be Extracted Automatically from a Set of Documents in a Given Domain? (Ql. l) What Type of Semantic Relations Can Be Identified in the Extracted Concepts? (Q1.2) From the statistical analysis of the expert's evaluation results (Table 5.9), we noticed the maximum percentage of clusters conveying no direct related concepts was only 10%, which was at the EvaluateDocumentNoRestriction level. The maximal percentage of unrelated terms in a cluster in any other scheme was only 4% (see Table 5.12). Statistical % of \"No Direct Relation\" -Relation Type \"h)\", all 15 Schemes AB A BA A AB B EvaluateSentenceOtd h) 0% h) 2% .. 11)2\",. EvalualeSeiitencellpToStd h) 2% h)0% hi 2'!-,, EvaluateSentenceNoRestriction h) 2% h) 0% h)4% E v a 1 u a tePa r aar a p h N o Res t r i ct i o n . Ii) 2% h) 4% h) ()\".., EvaluateDocumentNoRestriction h) 10% h) 10% hi ()\".. Table 5.12: Statistical Results of \"No direct relation\" Type for All Fifteen Schemes Using relation type h) \"No direct relation\". Moreover, ten out of the total fifteen schemes had mean relevance scores greater than 4, indicating that two-thirds of the schemes grouped at least \"somewhat related\" concepts. The remaining five schemes' mean relevance scores were greater than 3 but less than 4 (see Table 5.13). This indicated neither of the remaining third of the schemes grouped \"unrelated\" clusters (for details regarding how the ranking was defined, see Appendix D -Section 5.4.2.1). - 5 0 -Average Relevance Score in Each Scheme AB A BA A \B It EvaluateSentenceOtd 4.58 4.40 *(\u00C2\u00AB) EvaluateSentencelipToStd 4.46 4.52 \u00E2\u0080\u00A2'.AY--EvaluateSentenceNoRestriction 4.46 4.52 3 58 EvaluateParaeraphNoRestriction 4.30 EvaluateDocumentNoRestriction 3 60 3.82 3.08 Table 5.13: Statistical Results of Average Relevance Scores for All Fifteen Schemes The above analysis proved that all fifteen schemes extracted compound concepts containing related accounting terms, which had already been confirmed by comparing our method with the Price Waterhouse Thesaurus. We hence can declare: our methodology is useful for identifying accounting compound concepts with meaningful relations. As well, we answered Q 1.1. We further analyzed what were the top two most frequently chosen relation types included in the clusters and extracted these figures in Table 5.14. We can see the largest portion of relation type in every scheme was \"Broader term\" (one term is broader term than the other), coinciding with the \"Narrow Term (NT)\" and \"Broad Term (BT)\" relationships defined by prevailing thesauri discussed in the literature. Another of the most frequently chosen relation types was \"Subgroup\" (these are different from \"Broader term,\" but they are subgroups of other broader concepts), which also coincided with the \"Related Terms (RT)\" relationship identified in most existing thesauri. Because our hierarchical clustering continually added new terms into the previously formed clusters, only some of the terms were related (relation type \"Partial relation\") when many terms were included in the cluster. This answered Q1.2. -51 -The Top Two Most Frequently Chosen Relation Types in Each Scheme A B A -../-.. B A A A B B EvaluateSentenceOtd c) 46%: d) 42% c)-52%; d) 42% c) 1 l%:dP.8% EvaluateSentenceUpToStd c) 36%; d) 42% c)34%;d)54% c)4d%: i\") 3(>% EvaluateSentenceNoRestriction c) 36%: d) 44% ; = c)'30%; d)58% O ~8%: l'i44% EvaluateParagraphNoRestrictinn c) 50%; d) 34% , , \u00C2\u00B1 c) 42%; d) 42% ci 10'... 1\") 46' \u00C2\u00BB EvaluateDocumentNoRestriction c)38%;f)26% c)58%; d) 18% c) 20%Yt)'76% Table 5.14: Statistical Results of the Top Two Most Frequently Chosen Relation Types in All Fifteen Schemes Relation types c) \"Broader term\"; d) \"Subgroup\"; and f) \"Partial relation\" 5.5.2 W h a t P a r a m e t e r s C a n A f f e c t the Q u a l i t y o f the R e s u l t s ? (Q2) 5.5.2.1 P r o x i m i t y - Sentence, P a r a g r a p h A n d D o c u m e n t L e v e l s (Q2 .1) By examining differences in the statistical expert's evaluation results at three levels -within a sentence, within a paragraph, and within a document - we detected that all three schemes belonging within the sentence level extracted the most related concepts, followed by the paragraph level, with the document level being the least effective. Our results revealed the statistical average relevance scores as assessed by the expert for three schemes EvaluateSentenceOtdAB_A (4.46 ~ 4.68) > EvaluateParagraphNoRestrictionAB_A (4.22) > EvaluateDocumentNoRestrictionAB_A (3.60) (see Table 5.13). Likewise, the same findings were detected from the corresponding schemes of B A _ A and A B _ B . Furthermore, the relevance scores of all three schemes in EvaluateDocumentNoRestriction were all less than 4.00, which also indicated that the document level was the least effective textual level for producing valid concepts. From the average relevance scores in Table 5.13 and the relation types in Table 5.14, we also found that though different scopes of sentence, paragraph and document level - 5 2 -extracted varied accounting concepts, the sentence level was a little closer to the paragraph level than to the document level. 5.5.2.2 Distance Within the Sentence Level (Q2.2) We compared clustering data and counted the percentage of identical clusters in the same scheme in EvaluateSentenceOtd and in EvaluateSentenceUpTo5td with in the scheme EvaluateSentenceNoRestriction respectively (Table 5.15). This reveals that EvaluateSentenceUpTo5td and EvaluateSentenceNoRestriction generated similar results in clustering accounting terms (88% and 92% identical clusters respectively). The low level of identical clusters percentage (50%) was caused by the asymmetry factor that will be discussed later on. On the other hand, the percentages of identical clusters in EvaluateSentenceUpTo5td were all much higher than those in EvaluateSentenceOtd (88% > 42%, 92% > 42%, and 50% > 24%). This indicated that compared with EvaluateSentenceOtd, EvaluateSentenceUpTo5td level was much closer to EvaluateSentenceNoRestriction. % of Identical Clusters Compared with Clusters in EvaluateSentenceNoRestriction SentenceNoRestriciton A B A SentenceNoRestriciton B A A SentenceNoRestriciton A B B EvaluateSentenceOtd A B A . 42% N/A EvaluateSentenceOtd'BA A N/A 42% EvaluateSeiileneeOtdAB B N/A N/A 24\"-,, EvaluateSentencel)pToStdAB A 88% N/A E v a 1 u a leS en ie n ce U pToStd B A A N/A 92% N/A EvaiuateSentencelipToStdAB B N/A N/A 50% Table 5.15: Identical Clusters Percentages by Comparing the Same Scheme in SentenceOtd and SentenceUpTo5td with SentenceNoRestriction Respectively However, the differences between clustering data in EvaluateSentenceOtd and those in EvaluateSentenceNoRestriction were not significant either, because the identical -53 -percentages were not low. This can be further supported by examining the statistical analysis of the expert's evaluation results between EvaluateSentenceOtd with EvaluateSentenceUpTo5td and with EvaluateSentenceNoRestriction. In Table 5.16 we can see that the selected relation types and mean relevance scores marked by the expert conveyed some variances, but these were not significant. Therefore, it is not necessary to further modify the relative location of the terms in relation to each other within the same sentence level to achieve better concepts. All 15 Schemes I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h) (%) III . Score (1-5) a) b) 0 d) \u00C2\u00AB) 0 g) h) EvaluateSentenceOtd A B A 0% 4% 46% \u00E2\u0080\u00A2 .42% 6% 2% 0% 0% 4.58 FvaliiatcScnlcnceUpToStdAB A 8% - 36% 42% 2% \u00E2\u0080\u00A2 8% 0% ?% 4.46 EvaluateSentenceNoRestriction A B A 0% -10% ' 36% \u00E2\u0080\u00A2 44% .2% 6%, . .0% 2% 4.46.: Ev(iluateSi'.ntenceOtdBA_A 2% 52% 42% \u00E2\u0080\u00A2 0% - 0% ' t:40 EvaluatcSenknceUpToStdBA A 3-1% \u00E2\u0080\u00A2 2% 0% .: EvaluateSentenceNoRestriction BAA \u00E2\u0080\u00A2 4%. 2% 30%~ 5S% 2% 2% 0% 4.52 1 V.ll.i.'lcS.1 IICIILCOIU \\\ U ( ( 44% 28% . 18 0% 2% 4.06 EvaluateSentenceUpro5tdAB_B - \u00E2\u0080\u00A2 ( 0% 46% 12%) \u00E2\u0080\u00A2 36% 0% EvaluateSenienceNoRestrictionAB _M 0% 0% 38% 8% 6\".. 42% (1% 4% Table 5.16: Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EaluateSentenceOtd with EaluateSentenceUpTo5td, and with EvaluateSentenceNoRestriction With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \"No direct relation\". Similarly, when we contrasted the statistical analysis of the expert's evaluation results in EvaluateSentenceUpTo5td with those in EvaluateSentenceNoRestriction, we found very - 5 4 -similar outcomes for both the relation types and the relevance scores marked by the expert (see Table 5.17). This agreed with the findings by prior studies that performances of restricting windows to at most five words lead to almost identical results as examinations of entire sentences. I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h) (%) HI. Score (1-5) AH 15 Schemes a) b) c) d) e) 0 R) h) EvaluatoSentenceUpTo5ldAB A 2% . 8% 36% 4 2 % . 2% \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 8% 0% 2% ' 4.46 EvaluateSentenceNoRestrictionABA ()\".. It)\"., .36% 44% \u00E2\u0080\u00A2 2% 6% 1 0% 2% ' 4.46 LvafuateSenrenceL'pTo5(JBA A\u00E2\u0080\u00A2\u00E2\u0080\u00A2 2% - 34% 54% . ' ' 4.52 EvuluuU'Si'nlcn<.i'\iiFit'\lrKiiun B 1 1 4% .2\",, 30% 58% :% 2\".. 0% if EvaluateScntenccUpToMdAB B ( \u00E2\u0080\u00A2 0% 46% 12% 4% 36%* \u00E2\u0080\u00A2 0% - 2% 4.02 EvaluateSentenceNoRestrictionAB_B (\u00C2\u00BB\"o <)% 38% 8% 6% 4 2 % 0% 4% Table 5.17: Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EaluateSentenceUpTo5td with EvaluateSentenceNoRestriction With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \"No direct relation\". 5.5.2.3 Directionality And Asymmetry (Q2.3) Directionality - When we only compared the statistical expert's evaluation results for the same order (AB_A or B A _ A ) in each scheme, differences in the relation types and the relevance scores evaluated by the expert were small (see Table 5.18 below, as well as Table 5.13 above). We thus learned that given the same term, the directionality of two terms' co-occurrences has no significant impact on the performance of grouping-related accounting concepts. - 5 5 -Compare Directionality I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h) (%) III. Score (1-5) a) c) d) e) 0 h) EvaluateSentenceOtd A B_ A 0% 4% \u00E2\u0080\u00A2w>% 42% 6% 2% 6% 0% 4.58 ~ EvaluateSeiuencdtlldBAA 0% 2% J2% 42% 2% 0\"y\u00E2\u0080\u009E 2% 4.40 t L.valuaicScntcnccl!pTo5ld.-\B A 2% 8% 36% 42% 2% 8% 0%, 4.46 EvaluateSentenceUpToStdBA A 4% 2% 34% 5-/% 2% 2% M% EvaluateSentenceNoRestrictionAB A vo% I0\"o 36% 44% '2%\", : 6% ; 2% 4.46 EvaluateSentenceNoRestriction BA_A 4% 2% 30% .55% 2% . 2%> 2% 0%, 4.52 EvaluateParagraphNoRestrictionAB A 4% 4% 48% 34% 4% 2% 0% 2% 4.22 EvaluateParagraphNoRestriction BA_A 4% 42% 42% 2% 2% 0% 4%> 4.30 llBllllfBill rvdluiilcDuuimcnlNoKcsti ictionAB \ o% .4% '\u00E2\u0080\u00A238% 22\u00C2\u00B0,. 0% 26%, 0% 10%, 3 60 LwludteDocumentNoRosti icuonBA A .... > i - \u00E2\u0080\u00A2 > i 2'!'o 0% | 58% 18% \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00C2\u00BB' ,1 0% ' 12%. ()\" n m ' 10% 3 82 Table 5.18: Directionality - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B _ A at Five Levels And the Scheme B A _ A at Five Levels With relation types a) \"Synonyms\"; b) \"Antonyms\"; c) \"Broader term\"; d) \"Subgroup\"; e) \"Distinct but forming a new concept\"; f) \"Partial relation\"; g) \"Other relation\"; h) \"No direct relation\". Asymmetry - Table 5.19 demonstrated, however, that different schemes of using either of any two given terms to predict their affinity actually extracted very dissimilar concepts (see relation types chosen by the expert) accordingly. Besides, the statistical relevance scores of relations grouped by the schemes of given term A (AB_A) was always much higher than given term B ( A B B ) (see Table 5.13 above). As noted above, the tokens were arranged (by ID) according to decreasing frequency in the collection. Hence, the concepts from the scheme using a term with a higher frequency (term A) as the given term always - 5 6 -involved better performance than the scheme using a term with a lower frequency (term B) throughout the entire collection. It can be further interpreted that the more frequently a term appears in a text, the more possible that diversified relevant terms will co-occur in various proximate contexts. In other words, the more common term seems to be a better \"generator\" of composite terms. Therefore, using the more frequent term can extract more meaningful and accurate concepts. Compare Asymmetry 1. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h) (%) III. Score (1-5) a) b) c) d) e) 0 g) h) 1 v.ilualcSi-nlciKvOui \ B \ 0\u00C2\u00B0., 4% 16\".. 42\",, V \u00E2\u0080\u009E 2% 0% 0% 4.58 EvaluateSentenceOtdABB 0% 44\" o 28% 6% 18% 0% 2% 4.06 KvaluateSentenceUpTo5tdAB A 2% 8% 36% \u00E2\u0080\u00A2 42% 2% 8%~ 0% 2% 4.46 EvaluateSentenceUpTo5tdAB_B 0% 0\u00C2\u00B0.. 46% 12\",, 4% '36% 0% 2% ' 4 02 - \u00E2\u0080\u00A2 -EvaluateSentenceNoRestriction A B _ A 0% 10% . 36% 44% 2% 6% 0% 2% 4.46 EvaluateSentenceNoRestrictionAB B ()\"o 0\",. . 38% 8\",, 6% 42% 0% 4% 3.94 Eval uate ParagraphN oRestrict i on A B _ A 4% 4% 48% 34% 4% 2% 0%, 2% 4.22 EvaluateParagraphNoRestrictionAB B 0% 0% 40% 14% 0% 46% 0% 0% 3.58 \u00E2\u0080\u00A2 ' j lAB / B->A were next to each other with 0 token's distance between (Otd) in the same sentence, regardless of how many of these situations were in that same sentence, we counted only one per sentence. 6.2.2 Four Cases - Estimating Affinities In our main research, the three affinity probability estimations we had explored, which were #/#A, #/#A, and #/#B, were sufficient to address our research questions regarding differences on distances, order and asymmetry between two terms. In automatic phrases generation, our research focus was to explore the feasibility of using all four cases to automatically from phrases in SentenceOtd level. Therefore we added the fourth case, #/#B. In all four cases we defined their statistical affinities the same way as in the main research. The program then processed the text to calculate their affinities as follows: \u00E2\u0080\u00A2 #/#A: # of sentences having B AFTER A with Otd / # of sentences having A \u00E2\u0080\u00A2 #/#A: # of sentences having B BEFORE A with 0 td / # of sentences having A -61 -\u00E2\u0080\u00A2 #/#B: # of sentences having A BEFORE B with Otd / # of sentences having B \u00E2\u0080\u00A2 #/#B: # of sentences having A A F T E R B with Otd / # of sentences having B The program also counted the number of sentences that each token appeared in and then processed all of the tokenized files to compute any two tokens' probabilistic affinity values for all 905 single token subjects. 6.3 C l u s t e r i n g 9 0 5 S i n g l e T o k e n s In this small project, we used the same Matlab hierarchical clustering techniques as in the main research. The program transformed all the affinity values using the formula (1 -Statistical Affinity Value) to ascertain the distance values of any two tokens. These distance values were then entered into Matlab linkage function to form hierarchical trees in each of the four cases. Table 6.1 shows the top ten clustering data outputs for the case SentenceOtd A B A (see Appendix E2 for the top 50 clustering outputs for the four cases in SentenceOtd Single Token Cluster Data): Cluster Index Object in First Group Object in Second Group Distance 906 635; representational 638; faithfulness 0.056818 907 827; safe 834; harbor 0.16667 908 356; health 362; care 0.19312 909 428; joint 440; venture 0.27984 910 123; balance 181; sheet 0.40346 911 . 377; pro 453; forma 0.4244 912 414; conceptual 431; framework 0.44304 913 105; foreign 107; currency 0.46311 914 660; growing 686; timber 0.47887 915 33; cash 73; flow 0.49396 Table 6.1: Top Ten Automatic Phrase Clustering Data in Sen t enceOtdABA - 6 2 -6.4 Evaluation - 905 Single Token Clustering To maintain consistency with our main research, in our small project we again also only evaluated the top 50 clusters of each case. Because the clustering terms themselves actually revealed valuable information, therefore we further evaluated the resulting clusters. 6.4.1 A Sample Evaluation We created evaluation tables by replacing the token numbers with term names. Then we assessed whether the terms in each cluster actually formed valid phrases. Table 6.2 shows the example of the top five clusters in SentenceOtdAB A: Cluster No. Terms in the Cluster Separated by \" ; \" 1. Automatic Phrase Evaluation 2. Explanatio n Terms in the first group Terms in the second group Total # of terms Percentage (%) of Matching Original Manual Phrases (%) Could Not | Match Original, But Could Form a New Accounting Phrase (put \"1\" here) Ordering Indication (put\"1\" here for same order) 'Ordering Indication (put \"I\" here for opposite order) Not an Account -ing Phrase (put \" 1 \" here) 1 representational faithfulness 2 100 1 2 safe harbor 2 67 1 safe_harbor leases 3 health care 2 67 1 health_care _providers 4 joint venture 2 100 1 5 balance sheet 2 100 1 . Table 6.2: Sample Evaluation of 905 Single Token Clusters-Top Five Clusters in Sentence0tdAB_A As to how we conducted the main research evaluations, the first column listed the cluster numbers, which also clustered terms in ascending order of distance from each other. The -63 -terms were also located in two groups and the total number of terms was also listed. In the \"1 - Automatic Phrase Evaluation\" category, we included the following possible alternatives: \u00E2\u0080\u00A2 \"Percentage (%) of Matching Original Manual Phrases\": we combined the terms of the two groups in the order of the first group and then the second group. For example, in cluster #3 we got \"health care\" (two words). We then compared these terms in that order with the original 1,774 phrase control list in our main research (see Experiment - Section 5.14 above) to see whether we could find matching phrases there. In the 1,774 phrase list, we found a similar phrase, \"health care providers\" (3 words). There were only two words matching the original three words of the similar phrase, and so the matching percentage was 67% (2/3) and we entered the number \"67\" in this percentage column. We also needed to indicate the complete original phrases in the column \"2 - Explanation,\" in this case \"health care providers,\" (\"_\" was used by the program to link the phrases together as one phrase). Note that because the resulting cluster terms were obtained by the program that had already transformed all plural phrases into singulars, hence when we compared the terms with the original phrases list we could ignore the plural or singular patterns of each word. Another example is cluster #5: we combined the words of \"balance sheet\" and found the identical phrase in the 1,774 phrase list -\"balance sheet.\" Therefore, the matching percentage was 100% (2/2), and so we put the number \"100\" in this column and there was no need to indicate the original phrase in the Explanation column. Here we counted the terms when they were combined in each cluster in any order, and we calculated what percentage of terms could match the original phrases. For example, in - 6 4 -SentenceOtdBA_A cluster #17, the first group had the term \"entry\" and the second group had the term \"journal.\" When we combined them in that order we then obtained the new phrase \"entry journal,\" which exactly (100%) matched our original phrase \"journal entry\" (by switching the order). \u00E2\u0080\u00A2 \"Could Not Match Original, But Can Form a New Accounting Phrase (put \"1\" here)' i f we could not find a match in our original 1,774 phrase list (the matching percentage thus being 0%), we then checked whether combining these terms in the newly formed clusters could generate other meaningful accounting phrases that we could think of yet that were not included in our 1,774 phrases. This analysis was based on our common knowledge without reference to other resources. If such phrases were generated, we put \"1\" in this column, to be added together in the end. Meanwhile, i f the order was different, we then indicated the phrase with the right order. For example, in SentenceOtdAB_A, the cluster #42 included two terms, \"minority shareholder.\" Though we could not find a similar phrase in the list of 1,773 phrases, \"minority shareholder\" all together could compose another valid accounting phrase, and we then entered \"1\" in this column. If the newly formed accounting phrase was not in the right order but still could make sense when we switched the order among the terms, we also could enter \"1\" in this column, but we needed to write down the correct phrase in the Explanation column. As an example, SentenceOtdBA_A the cluster #34 included two terms, \"date effective,\" which we believed could compose a new accounting phrase but not in the right order. However, when we switched their order, we then got a new meaningful accounting phrase -65 -\"effective date,\" and so we entered'T\" in this column and recorded the right phrase, \"effective_date,\" in the Explanation column. \u00E2\u0080\u00A2 \"Ordering Indication (put\" 1\" here for same order)\": i f connecting the terms in the first group and then the second group of each cluster had the same order as in the original 1,774 phrases or had the same order as other accounting phrases we could think of, we then put \"1\" here. For example, in SentenceOtdAB_A cluster #3, we got \"health care,\" which was the same order as the original phrase \"health care providers\" (\"health\" was in the first group, \"care\" was in the second group), though not as complete as the original phrase. We then entered \"1\" in this column. As a further example, in SentenceOtdAB_A the cluster #42 included two terms \"minority shareholder,\" which we believed could compose a new accounting phrase and in the same order. \u00E2\u0080\u00A2 \"Ordering Indication (put\" 1\" here for opposite order)\": i f connecting the terms in the first group and then the second group of each cluster had the opposite order as the original phrase or had the opposite order with other accounting phrases we could think of, then we put \"1\" here. In SentenceOtdBA_A cluster #17, the first group included the term \"entry,\" and the second group included the term \"journal.\" When we connected them in that order, we obtained \"entry journal\" (\"entry\" was in the first group, \"journal\" was in the second group) which was the opposite order to the original phrase \"journal entry,\" though the percentage matching was 100% for this cluster. We then entered \"1\" in this column. To note yet another example, in SentenceOtdBA_A the cluster #34 included two terms, \"date effective,\" which we - 6 6 -believed they could compose a new accounting phrase but in the opposite order because the new meaningful accounting phrase should be \"effective date.\" \u00E2\u0080\u00A2 \"Not an Accounting Phrase (put\" 1\" here)\": i f the term matching percentage was 0% and any combination of terms in one cluster could not make up other meaningful accounting phrases either, we then entered \"1\" in this column. This also meant the terms in this cluster failed to form phrases. 6.4.2 Statistical Evaluation Results Based on the sample evaluation input detailed above, we obtained statistical evaluation results in each of the four cases respectively (see Appendix E3 - SentenceOtd Evaluate Automatic Phrases). Table 6.3 summarizes the statistical results for all four cases. All Cases Statistical Automatic Phrase Evaluation Percentage (%) of Matching Original Manual Phrases Percentage of 100% Matrhiii\" Phrases Not Matching Original, But Can Form a New Accounting Phrase (Percentage) Ordering Indication (Percentage for the Same Order) Ordering Indication (Percentage for the Opposite Order) Not an Accounting Phrase (Percentage) Automatic Phrase in SentenceOtd A B_A 61.8% 32% 6% 92% 0% 8\"-,. V i i K i m n t i c P h r a s e i n SciilerK'WH(iU\_\ 67.4% \u00E2\u0080\u00A2 !(>'\u00E2\u0080\u009E \u00E2\u0080\u00A2 6% \u00E2\u0080\u00A2 , . 0% 88% 12% Automatic Phrase in SentenceOtdABB 32.8% 12% 2% 56% 0% 44% Automatic Phrase in SentenceOtdBA B 59% 38%. 4% 0% 86% 14% Table 6.3: Statistical Evaluation of Automatic Phrases for Four Cases at SentenceOtd Level The additional column of \"Percentage of 100% Matching Phrases\" was constructed by adding the percentage number of the completely matching original 1,774 phrases (100% matching regardless of the order), plus the percentage of generated terms that completely - 6 7 -matched other accounting phrases (if the terms could not find similar phrases in the 1,774 original phrase list). We then calculated the percentage of matches in each case. 6.5 Automatic Phrase Discussions - 905 Single Tokens Clustering When we studied Table 6.3, we derived the following interesting findings: 6.5.1 Our Techniques Can Automatically Identify Some Accounting Phrases The statistics showed the terms grouped by our automatic generating clusters can make up many meaningful phrases because both the percentage of matching original phrases plus the percentage of newly formed accounting phrases in these four cases were not low (3 out of four cases were above 50%). In addition, there were a small number of clusters that could not group accounting phrases in that the percentage of \"Not an Accounting Phrase\" was low (all were less than 50% and three out of four were even less than 20%). But we could still tell among the top 50 clusters that all the numbers in \"Percentage of 100% Matching Phrases\" category were less than 50%, which indicated that our techniques automatically grouped many terms into incomplete phrases, though these incomplete phrases still could make some sense. This can be easily understood because in this rudimentary study the text carried mostly single tokens that had been broken down from the original phrases and also because we only investigated cases where two tokens were immediately next to each other (in other words, there was no noise between them), and so it was very likely and reasonable that the resulting clusters could identify potentially useful accounting phrases. - 6 8 -Although our proposed techniques were valid in automatically identifying accounting compound concepts as well as automatically grouping accounting phrases, theses two effects results from different directions. As many prior studies to derive automatic accounting concepts have suggested and verified, we also highly recommend the inclusion of term - phrases in the research vocabulary when extracting automatic accounting concepts. Each extracted accounting concept consists of not only individual related terms but also of more common term-phrases that belong to the same topic. In contrast, the automatic phrase techniques have to combine the terms in each cluster to form term-phrases and each term-phrase only represents a specific term, which is not a concept. 6.5.2 Directionality and Asymmetry \u00E2\u0080\u00A2 Directionality - When we compared \"Automatic Phrases SentenceOtdAB_A\" with \"Automatic Phrases SentenceOtdBA_A\" (see Table 6.3), there were no clear indications showing that the order a term appears in affects the quality of cluster generation, in that most evaluation statistics were not very different between these two cases. However, the order of the phrases formed by the resulting clustering terms varied significantly in accordance with the order of two tokens A and B that was used to derive clusters: SentenceOtd A B A grouped all same order accounting phrases, but SentenceOtdBA_A grouped all opposite order accounting phrases. This can be easily explained: \u00E2\u0080\u00A2 Direction A B - we computed two term affinities in the order of token A appearing before token B (order AB) in the original text, which also meant for this order the first token A 's ID was always smaller than second token B's ID. For example, for token A , \"financial,\" with token ID #14 and token B, - 6 9 -\"reporting,\" with token ID #54, we computed the statistical affinity values for these two tokens which appeared in the order of \"financial reporting\" in the original text. Later, we derived clusters for this case. When Matlab linked two objects into each cluster it always placed the smaller object number (in our project, the smaller ID - token A) into the first group and placed the larger object number (in our project the larger ID - token B) into the second group, and therefore potential automatic phrase for SentenceOtdAB_A could be obtained in the same correct order A B by simply directly linking the higher frequency term (smaller ID - token A), which was located in the first group followed by the lower frequency term (larger ID - token B), which was located in the second group of each resulting cluster. So in \"Automatic Phrases SentenceOtdAB_A\" cluster #48, the automatic phrase \"financial reporting\" could be obtained in the same right order by directly linking the first group term \"financial\" (ID - #14, Frequency - 14,727) to the second group term \"reporting\" (ID - #54, Frequency -5,265). This is why \"Automatic Phrases SenteneeOtdAB_A\" grouped all same order accounting phrases in Table 6.3. Direction B A - we computed two term affinities in the order of token B appearing before token A (the order BA) in the original text, which also meant for this order the first token B's ID was always larger than second token A 's ID. As an example, for token B \"financial,\" with token ID #14, and token A \"statement,\" with token ID #7, we computed the statistical affinity values for these two tokens, which appeared in the order \"financial statement\" in the original text. Later, we derived clusters for this case. When Matlab linked two objects into each cluster it - 7 0 -always placed the smaller object number (in our project the smaller ID - token A) into the first group and placed the larger object number (in our project the larger ID - token B) into the second group, the potential automatic phrase for SentenceOtdBA_A could thus be obtained in the same right order B A by directly linking the lower frequency term (bigger ID - token B), which was located in the second group followed by the higher frequency term (smaller ID - token A), which was located in the first group of each resulting cluster. Hence, in \"Automatic Phrases SentenceOtdBA_A\" cluster #25, the automatic phrase \"financial statement\" could be obtained in the same right order by directly linking the second group term \"financial\" (ID - #14, Frequency - 14,727) to the first group term \"statement\" (ID - #7, Frequency - 30,033). This explains why \"Automatic Phrases SentenceOtdBA_A\" grouped all opposite order accounting phrases in Table 6.3. Exactly the same directionality patterns occurred for \"Automatic Phrase in SentenceOtdAB_B\" and \"Automatic Phrase in SentenceOtdBAB\". \u00E2\u0080\u00A2 Asymmetry - We found \"Automatic Phrase in SentenceOtdABA\" grouped more meaningful accounting phrases than \"Automatic Phrase in SentenceOtdAB_B\" because the numbers in \"Percentage of Matching Original Manual Phrases,\" \"Percentage of 100% Matching Phrases\" and \"Not Matching Original, But Can Form a New Accounting Phrase\" were all greater in SentenceOtdAB_A. Furthermore, the number in \"Not an Accounting Phrase\" was smaller in SentenceOtdABA than in SentenceOtdAB_B (see Table 6.4). Exactly the same patterns occurred for \"Automatic Phrase in SentenceOtdBA_A\" and \"Automatic Phrase in SentenceOtdBA B . \" -71 -AH Cases Statistical Automatic Phrases Evaluation Percentage (%) of Matching Original Manual Phrases IPiiSlllliiiisilS Ijllliiilltllll! Percentage of 100% Matching Phrases Not Match Original, but can form a New Accounting Phrase (Percentage) Ordering Indication (Percentage for the same order) Ordering Indication (Percentage for the opposite order) Not an Accounting Phrase (Percentage) Automatic Phrase in SentenceOtdABA 61 8% 32% 6% 92% 0% 8% Automatic Phrase in SentenceOtdAB_B 32.8% 12% 2% 56% 0% 44% AiitoiiKilie Phrase in ScntiiueOldBA \ 67.4% 46% 6% (\u00E2\u0080\u00A2\".. 88% 12% Automatic Phrase in SentenceOtdBA B 59% 38% 4% 0% 86% 14% Table 6.4 (rearranged Table 6.3): Asymmetry - Comparing Evaluation Automatic Phrase in SentenceOtdABA with in SentenceOtd AB_B; and in Sentence0tdBA_A with in SentenceOtdBAB The above contrasts revealed that using term A, which had a higher frequency as a given term, always yielded better performance in automatically grouping phrases than using term B, which had a lower frequency in the entire collection. This also agreed with our main research findings. 7 CONCLUSIONS AND FUTURE RESEARCH 7.1 Conclusions Our research was an exploratory study comparing the use of different textual units in automatically identifying compound concepts and automatically forming phrases in a given context. We used the accounting context as a case study. Building from previous research, we developed our own domain-specific preprocessing automatic indexing and - 7 2 -statistical affinity-computing methods. The extracted hierarchical clustering outputs from fifteen different schemes were analyzed and evaluated by an accounting expert and also by reference to the Price Waterhouse Thesaurus. The results from our main experiment were differing in textual unit size using affinity measures. The outcomes were encouraging and answered our research questions. Our proposed approach in this study was capable of automatically identifying potential accounting compound concepts (at least 90% of the clusters in most schemes grouped meaningful concepts) and automatic accounting phrase formation. The expert's evaluation revealed that the most frequent relationships suggested by the concepts were \"Broader Term\" (one term is broader term than the other) and \"Subgroup\" (these are different from \"Broader term,\" but they are subgroups of other broader concepts). We also studied several issues that could affect the quality of the results. Analysis of relationships between terms within sentences, within paragraphs, and within documents generated results with varied qualities: the sentence level produced the best results while the document level produced the least usable results. Regarding relationships of terms within the same sentence, the findings of previous studies were also verified by our research: those restricting the window to at most five words can group significantly similar concepts as concepts formed within the entire sentence. Restricting the process to terms separated by fewer than five other words within a single sentence does not seem to significantly improve clustering performance. -73 -The order in which any particular pair of terms occurred did not exhibit any explicit impact on the automatic concepts thereby generated. In cases where the higher frequency term (A) appeared before the lower frequency term (B) in the original text (A->B), the potential phrase could be automatically formed simply by directly linking the first group term and the second group term in each resulting cluster. On the other hand, when the lower frequency term (B) appeared before the higher frequency term (A) in the original text (B->A), linking the second group term and then the first group term could obtain the automatic phrase. Moreover, in any two terms, we found that normalization based on the higher frequency term (AB/A or BA/A) yielded much better results than otherwise (AB/B or BA/B) . This was corrected both for automatic identification of both compound concepts and phrases. 7.2 Contributions Our research studied a set of techniques for automatic compound concept-identification and automatic phrase generation. While some work in these areas has been done before, the techniques we have employed are novel. Our distance measures were based on affinities, rather than on the popular similarities studied by most researchers, and therefore our classified concepts identified composite relationships among terms (instead of only synonyms). The thesis explored term proximities in different textual units and the effects of directionality and asymmetry between two terms. Again, little research had been -74-previously done in this field. Our preliminary study indicated that analysis at the document level and using low frequency words to generate composite terms led to poor results in generating meaningful concepts. This finding could direct future research. Our research has demonstrated the realistic possibility of grouping terms into compound concepts, a process that can provide users with assistance in judging how concepts are constructed, and that can thereby assist them in searching for useful information about specific concepts. 7.3 L i m i t a t i o n s A n d F u t u r e R e s e a r c h To obtain a useful index of terms to generate better results, there was a significant manual effort involved in our \"preprocessing - tokenizing\" stage. In particular, this included identification of domain-specific phrases and domain-dependent stemming (consolidating terms). In automatic phrases study, we used domain-depended stemming; in the future, however, more automatic term-identification techniques should be studied to further eliminate human resource costs while maintaining the accuracy of the terms selected by the automatic program. Although there is little theoretical literature to support our approach, it would be possible to extend this study to construct a domain-specific thesaurus based on the useful parameters identified in our research as influences on the extracted concepts. - 7 5 -BIBLIOGRAPHY Accounting Dictionary - Accounting Glossary - Accounting Terms [Online] Available: http://www.ventureline.com/glossary.asp Besancon, R., Rajman, M . , & Chappelier, J. (1999). Textual similarities based on a distributional approach. 10th International Workshop on Database & amp; Expert Systems Applications, September 01-03, Florence, Italy. Callan, J. (1995). Controlled Vocabularies and Ontologies. Carnegie Mellon University, 95-778 Digital Libraries. [Online] Available: http://hartford.lti.cs.cmu.edu/classes/95-778/Lectures/03-CtrlVocab.pdf Caraballo, S.A. (1999). Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistic, College Park, Maryland, USA, 120-126 Chen, H. , & Lynch, K.J . (1992). Automatic construction of networks of concepts characterizing document databases. IEEE Transaction on Systems, Man, and Cybernetics, 22(5), 885-902. Chen, H. , Lynch, K.J . , Basu, K. , & Ng, D. T. (1993). Generating, integrating, and activating thesauri for concept-based document retrieval. IEEE Expert, Special Series on Artificial Intelligence in Text-Based Information Systems, 5(2), 25-34. Chen, H. , Martinez, J., Kirchhoff, A. , Ng, T.D., & Schatz, B.R. (1998). Alleviating search uncertainty through concept associations: automatic indexing, co-occurrence analysis, and parallel computing. Journal of the American Society for Information Science, 49(3), 206-216. Chen, H. , Ng, T.D., Martinex, J., & Schatz, B.R. (1997). A concept space approach to addressing the vocabulary problem in scientific information retrieval: an experiment on the worm community system. Journal of the American Society for Information Science, 48(X), 17-31. Chen, H., Hsu, P., Orwig, R., Hoopes, L., & Nunamaker, J . F. (1994) Automatic concept classification of text from electronic meetings. Communications of the ACM, 57(10), 56-73. Chen, H. , Schatz, B.R., Yim, T., & Fye, D. (1995). Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Science, 46(3), 175-193. - 7 6 -Church, K. W., & Hanks P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22-29. Crouch, C. J. (1990). An approach to the automatic construction of global thesauri. Information Processing and Management. 26(5), 629-640. Curran, J. R., & Moens, M . (2002). Improvements in Automatic Thesaurus Extraction. In Proceedings of the Workshop on Unsupervised Lexical Acquisition Philadelphia, PA, USA, 59 - 67 Dagan, I., Marcus S., & Markovitch S. (1995). Contextual word similarity and estimation from sparse data. Meeting of the Association for Computational Linguistics. [Online] Available: http://acl. Idc. upenn. edu/P/P93/P93-l022.vdf. 164-171. Furnas, G. W., Landauer, T. K., Gomez, L. M . , & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 50(11), 964-971. Gangolly, J., & Wu, Y.F. (2000). On the automatic classification of accounting concepts: preliminary results of the statistical analysis of term-document frequencies. The New Review of Applied Expert Systems and Emerging Technologies, (6), 81-88. Garnsey, M . (2001). The use of latent semantic indexing and agglomerative clustering to automatically classify accounting concepts: a report of the preliminary findings. The New Review of Applied Expert Systems, (7), 129-140. Garnsey, M . (2002). Automatic classification of financial accounting concepts. In A.E . Baldwin and C E . Brown (Eds.) Collected Papers of the Eleventh Annual Research Workshop on: Artificial Intelligence and Emerging Technologies in Accounting, Auditing and Tax, 15-24. Gietz, P. (2001). Report on automatic classification systems for the T E R E N A activity portal coordination. [Online] Available: http://www, daasi. de/reports/Report-automatic-classification.html Grefenstette, G. (1993). Automatic thesaurus generation from raw text using knowledge-poor techniques. 9th Annual Conference of the University of Waterloo, Centre for the New Oxford English Dictionary and Text Research, Oxford. Grefenstette, G. (1994). Exploration in automatic thesaurus discovery. Kluwer Academic Publishers Boston/Dordrecht / London. -11 -Jang, M . , Myaeng, S.H., & Park, S.Y. (1999). Using mutual information to resolve query translation ambiguities and query term weighting. Annual Meeting of the ACL, Proceeding of the 37th Conference on Association for Computational Linguistics, College Park, Maryland, USA, 223-229. Hauck, R., Sewell, R., Ng, D. T., & Chen, H. (2001). Concept-based searching and browsing a geoscience experiment. Journal of Information Science, 27(4), 199-210. K P M G Consulting L L C . (2000). Companies suffer from information overload, according to K P M G consulting knowledge management report. [Online] Available: http://web.lexis-nexis.com/universe/(4127/Q0). Lassi, M . (2002). Automatic thesaurus construction. [Online] Available: http://www.adm.hb.se/personal/mol/gslt/thesauri.pdf Leory, G., & Chen, H. (2001). Meeting medical terminology needs - the ontology-enhanced medical concept mapper. IEEE Transactions on Information Technology in Biomedicine, 5(4), 261-270. Losee, Jr. R . M . (1994). Term Dependence: Truncating the Bahadur Lazarsfeld expansion. Information Processing & Management, 30(2), 293-303. Martin, W. J.R., A l B.P.F., & van Sterkenburg P.J.G. (1983). On the processing of a text corpus: from textual data to lexicographical information. Lexicography: Principles and Practice (Applied Language Studies Series), Hartman R.R.K, Ed. London: Academic. Merriam-Webster online dictionary and thesaurus [Online] Available: http://www.m-w.com Miller, G. A. , Beckwith, R., Fellbaum, C , Gross, D., & Miller K. (1993). Introduction to WordNet: an on-line lexical database. [Online] Available: http://wwwl.cs. Columbia. edu/~radev/cs4999/notes/5papers. pdf Milstead, J. L. (2000). About thesauri. [Online] Available: http://www.bay side-indexing. com/Milstead/about, htm Price Waterhouse & Co. (1974). Thesaurus of accounting and auditing terminology. New York: Price Waterhouse & Co. Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes & R. Baeza - Yates (Eds.) Information Retrieval: Data structures and algorithms. Engelwood Cliffs, NJ : Prentice Hall. - 7 8 -Rungsawang, A . (1998). A distributional semantics based information retrieval system. The National Computer Science and Engineering Conference (NCSEC'98). Salton, G. (1989). Automatic text processing. Reading, MA: Addison-Wesley Publishing Company Inc. Salton, G., & Buckley C. (1991). Automatic text structuring and retrieval-experiments in automatic encyclopedia searching. In G. Salton A. Bookstein, Y. Chiaramella and V.V Raghavan, editors, Proceedings of the Fourteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 21-30. SchUtze, H. , & Pedersen, J.O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing and Management, 33(3), 307-318. Statistics Toolbox for Use with Matlab User's Guide (Version 3) Tutorial - Cluster Analysis. (1-53 to 1-67). [Online] Available: http://www, busim. ee. boun. edu. tr/~resources/stats_tb. pdf Stickney, C P . , & Weil, R.L. (1997). Financial accounting an introduction to concepts, methods, and uses (Eighth Edition). The Dryden Press, Harcourt Brace College Publishers. Sowa, J . F. (2000) Concepts in the lexicon: introduction [Online] Available: http://www. jfsowa. com/ontology/lexicon. htm Zhang, R., & Rudnicky, A.I. (2002). Improve latent semantic analysis based language model by integrating multiple level knowledge. Proceedings of ICSLP 2002 (Denver, Colorado), 893-896. - 7 9 -Appendix A: Domain Dependent Preprocessing - Tokenizing 5.1.2 Reformatting the Document Structure Though the database had a filename for every document in the collection, the filename could not reflect its own content effectively. Therefore, a mapping file was created that automatically extracted the topic of each file (for example, A R B 49: Earnings per Share) in the database and linked the file with its associated filename (for example, fars-0008.txt) in the mapping file. Since the paragraph level calculation was one of the core tasks of this study, we had to make sure the program could recognize paragraph sections. The original files had manual line-break-characters in each line within each paragraph that confused the division between paragraphs. In addition, the program could not recognize the phrase tokens if there was an end of line character inside the term phrases. We noticed in the original documents that i f the ends of line characters appeared consecutively it meant the end of a paragraph. Thus, the program got rid of the extra ends of line characters so that there would be only one end of line character for each paragraph. Since the sentence level calculation was also at the core of this research, we needed to make sure the program could recognize sentences as well. Punctuation marks like \"!\" and \"?\" can indicate the end of sentence. However, the period in a word like \"E.g.\" when it appeared in the middle of a sentence did not mean the end of sentence. Therefore, there were two situations where the period that did not indicate the end of sentence was - 8 0 -automatically removed. First, if the first character of a word was a capital letter and the last character was a period, then the token would be removed, such as for \"No.\", \"Mr.\" And \"Messers.\" Thus, we could deal with all cases where the periods were located in the middle of sentences. Among these cases, though some periods were also located at the ends of sentences or paragraphs, because those cases were not statistically significant and because our method in this study was oriented toward automatic procedures, we simply ignored them. The second situation was that if the period was located inside the word but not in the last character, the word would be removed, such as in the cases of \"1.4\", \"Ch.3A\", and \"e.g.\" There were statistically very few cases, though, where those words were also the last words in the sentences or paragraphs. Similarly guided by the trade-off strategy and automatic practice, we had to remove these words in order to automatically recognize the ends of sentences as well as paragraphs. We then converted all of the words with periods already removed in the text collection into lowercase so as to make the next few procedures (such as removing stop-words) easier to do. This was to avoid cases where the stop-words in the documents could not be removed because of the problem caused by upper and lower-case confusion. Similarly, the program also converted uppercase words from the other two documents we manually produced, the abbreviation list (see List 1 below) and the term-phrases controlled list (see List 2 below), into lowercase. Moreover, the punctuation was also separated from the word so that the program could recognize the punctuation individually. -81 -5.1.3 Changing the Short-forms of the Words in the Abbreviation List into Term-phrases Since there were many abbreviations in the accounting texts, we produced an abbreviation-controlled list by consulting two external sources of Accounting Dictionary -Accounting Glossary - Accounting Terms (http://www.ventureline.com/glossary.asp) and Financial Accounting an Introduction to Concepts, Methods, and Uses (8th Edition). AppAList 1: Top 10 out of 39 Manually Produced Abbreviations ACRS (ACCELERATED COST RECOVERY SYSTEM) ABC (ACTIVITY BASED COSTING) AICPA (AMERICAN INSTITUTE OF CERTIFIED PUBLIC ACCOUNTANTS) AICPASAS (AICPA STATEMENT ON AUDITING STANDARDS) AICPASOPS (AICPA STATEMENTS OF POSITION) AMT (ALTERNATIVE MINIMUM TAX) AROs (ASSET RETIREMENT OBLIGATIONS) BOM (BILL OF MATERIALS) CMOs (COLLATERALIZED MORTGAGE OBLIGATIONS) CPI (COST OF LIVING INDEX) The list included 39 abbreviations in total that standardized the abbreviations into the term-phrases format. Note that all the letters in this list were already converted into lowercase in the previous step, so once the program found theses abbreviations in the texts, which were also in lowercase, it would automatically transform them into the matching term-phrases connected by the symbol of For example, ACRS would be converted to \"accelerated_cost_recovery_system\". 5.1.4 Converting Meaningful Words to Term-phrases AppAList 2: First 10 Phrases of the entire 1,774 Phrase-Controlled - 8 2 -A B A N D O N E D PROPERTY A B N O R M A L COST A B N O R M A L COSTS A C C E L E R A T E D COST R E C O V E R Y S Y S T E M A C C E L E R A T E D COST R E C O V E R Y SYSTEMS A C C E L E R A T E D DEPRECIATION A C C E L E R A T E D DEPRECIATIONS A C C O U N T I N G A D J U S T M E N T A C C O U N T I N G A D J U S T M E N T S A C C O U N T I N G C H A N G E 5.1.5 Removing Stop-words AppAList 3: First 10 Stop-words of the entire 728 Stop-Word a able about above according accordingly accounted acquire across actually 5.1.7 Removing Unwanted Tokens AppAList 4: First 10 Tokens of the whole 2, 052Wanted Tokens T O K E N #FREQUENCIES #DOCUMENTS absence 291 130 absences 61 11 absent 94 57 accelerated_cost_recovery_system 21 4 accelerateddepreciation 29 12 accomplished 75 48 account 1242 318 account_receivable 2 2 accountant 50 30 accountants 158 56 - 8 3 -5.1.8 Consolidating Wanted Tokens AppAList 5: First 10 Tokens of the entire 994 Consolidating Wanted Tokens W A N T E D T O K E N S T O B E C O N V E R T E D FORMS OF T O K E N S A F T E R CONVERSION absences absence account_receivable accounts_receivable accountants accountant accounting_adj ustments accounting_adjustment accounting_changes accounting_change accounting_concepts accounting_concept accounting_periods accounting_period accounting_policies accounting_policy accounting_principles accounting_principle accounting_principles_and_methods accounting_principles_and_method 5.1.9 Generating the Final Reduced Token List AppAList 6: First 10 Tokens of the Entire Final 1,344 Tokens T O K E N #FREQUENCIES #DOCUMENTS asset 15878 510 accounting 13950 797 cost 10548 434 amount 10081 561 liability 7141 406 loss 6993 389 tax 6739 261 interest 6630 395 financial_accounting_standards_board 6527 763 entity 6252 369 -84-Appendix B: Computing Term Affinities 5.2.2 Converting 1,344 Tokens to Token IDs AppBList 7: First 10 Tokens of the Entire Final 1,344 Token IDs T O K E N ID #FREQUENCIES #DOCUMENTS asset 1 15878 510 accounting 2 13950 797 cost 3 10548 434 amount 4 10081 561 liability 5 7141 406 loss 6 6993 389 tax 7 6739 261 interest 8 6630 395 financial_accounting_standards_board 9 6527 763 entity 10 6252 369 5.2.3.1 Generating the 600 Token List for Clustering AppBList 8: First 10 Tokens of the Entire Final 600 Token IDs ID T O K E N #FREQUENCIES #DOCUMENTS 1 asset 15878 510 2 accounting 13950 797 3 cost 10548 434 4 loss 6993 389 5 tax 6739 261 6 interest 6630 395 7 financial accounting standards board 6527 763 8 fair value 6121 331 9 financial statement 5495 460 10 future 5456 422 - 8 5 -5.3.1 Hierarchical Clustering Using Matlab The \"Tutorial - Cluster Analysis\" section of Statistics Toolbox for Use with Matlab User's Guide (Version 3, 1-57) (http://www.busim.ee.boun.edu.tr/~resources/stats Jb.pdfl showed an example on how to interpret the Matlab linkage function. \"For example, given the distance vector Y from the sample data set of x and y coordinates, the linkage function generates a hierarchical cluster tree, returning the linkage information in a matrix, Z. Z - linkage(Y) Z = 1.0000 3.0000 1.0000 4.0000 5.0000 1.0000 6.0000 7.0000 2.0616 8.0000 2.0000 2.5000 In this output, each row identifies a link. The first two columns identify the objects that have been linked, that is, object 1, object 2, and so on. The third column contains the distance between these objects. For the sample data set of x and y coordinates, the linkage function begins by grouping together objects 1 and 3, which have the closest proximity (distance value = 1.0000). The linkage function continues by grouping objects 4 and 5, which also have a distance value of 1.0000. The third row indicates that the linkage function grouped together objects 6 and 7. If our original sample data set contained only five objects, what are objects 6 and 7? Object 6 is the newly formed binary cluster created by the grouping of objects 1 and 3. When the linkage function groups two objects together into a new cluster, it must assign the cluster a unique index value, starting with the value - 8 6 -m+\, where m is the number of objects in the original data set. (Values 1 through m are already used by the original data set.) Object 7 is the index for the cluster formed by objects 4 and 5. As the final cluster, the linkage function grouped object 8, the newly formed cluster made up of objects 6 and 7, with object 2 from the original data set.\" - 8 7 -Appendix C: Clustering 600 Tokens Data Outputs AppC SentenceOtd Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceOtd Cluster Index SentenceOtd AB A SentenceOtd AB B SentenceOtd BA A 601 261 346; 0.62791 149; 596 0 4; 14; 0.63247 602 392 429; 0.74286 212; 578 0 149; 194; 0.67611 603 102 139; 0.8034 5; 488 0.0625 196; 229: 0.7375 604 114 122; 0.80942 2; 312 0.076923 126; 602; 0.75068 605 41; 46; 0.82476 98 517 0.18182 5; 23; 0.80818 606 601 269; 0.84884 19 562 0.2 7; 0.81854 607 10; 27; 0.85654 46 435 0.21739 24; 66: 0.82776 608 553 595; 0.85714 11 123 0.24922 209; 214; 0.85333 609 170 209; 0.8599 604; 320 0.25 18; 27; 0.85561 610 38; 70; 0.87065 606; 566 0.25 25; 29: 0.86167 611 546 576; 0.875 72; 569 0.25 22; 65; 0.87587 612 106 215; 0.87905 131; 575 0.25 32; 41; 0.87697 613 16; 33; 0.87923 26; 551 0.28571 103; 165; 0.88048 614 30; 32; 0.88738 128; 553 0.28571 86; 116; 0.88343 615 147; 208; 0.88848 105; 280 0.31765 369; 408; 0.88372 616 105; 280; 0.88973 24; 347 0.32 278; 298; 0.89535 617 18; 614; 0.89109 609; 271 0.33333 132; 375; 0.89744 618 39; 57; 0.89262 613; 589 0.33333 113; 258; 0.89899 619 613; 28; 0.8929 80; 588 0.33333 611; 30; 0.90194 620 605; 52; 0.8944 261; 346 0.33333 314; 334: 0.90196 621 34; 603; 0.90447 617; 267 0.34783 609; 119; 0.90512 622 448 533; 0.90476 54; 545 0.375 447; 516; 0.90909 623 5; 25; 0.90662 30; 351 0.39216 249: 622; 0.91011 624 51; 96; 0.90762 174; 525 0.4 400; 512: 0.91177 625 451 498; 0.90909 62 1; 119 0.4005 392; 494; 0.91429 626 617 90; 0.91117 34; 139 0.41132 324: 350; 0.91525 627 394 502; 0.91429 27; 223 0.43698 603; 255; 0.9187 628 20; 69; 0.91476 608; 497 0.45454 49; 121; 0.92078 629 63; 80; 0.91796 40; 501 0.46154 54; 79; 0.92266 630 619 19; 0.92165 610; 254 0.48235 33; 71; 0.92363 631 189 246; 0.92308 7; 291 0.49275 237; 274; 0.92373 632 178 431; 0.92547 625; 546 0.5 89; 168; 0.92818 633 618 81; 0.92647 632; 581 0.5 111; 238; 0.93103 634 223 369; 0.93277 4; 591 0.5 406; 429; 0.93103 635 13; 26; 0.93278 634; 623 0.5 48; 64; 0.93322 636 494 529; 0.93333 6; 599 0.5 625; 511; 0.93333 637 337 372; 0.93617 628; 582 0.5 404; 434; 0.93333 638 174 234; 0.93659 12 579 0.5 621; 612: 0.93441 639 635 35; 0.939 16 568 0.5 352; 460; 0.93478 640 626 58; 0.93936 17 550 0.5 63; 80; 0.9357 641 395 455; 0.93939 630; 25; 0.5 104; 108; 0.93686 -88-642 284; 637 0.93976 641 574 0.5 366; 368: 0.9375 643 639; 85; 0.94053 614 595 0.5 76; 608; 0.94032 644 12; 37; 0.94102 321 561 0.5 604; 264; 0.94038 645 107; 141 0.94104 638 415 0.51852 1; 34; 0.94094 646 627; 441 0.94286 178 431 0.52 485; 585; 0.941 18 647 124; 152 0.9435 71; 494 0.53333 35; 56; 0.94258 648 11; 123 0.94351 633 68; 0.53719 638; 6.19; 0.94284 649 299; 373 0.94444 616 503; 0.53846 16; 640; 0.94341 650 222; 264 0.94488 635 427; 0.55556 649; 630; 0.9441 7 AppC SentenceUpTo5td Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceUpTo5td Cluster Index SentenceUpTo5td_AB_A SentenceUpTo5td AB B SentenceUpTo5td_BA A 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 261 392 601 114 147 63; 395; 102; 41; 602; 16; 39; 170; 38; 604; 510; 10; 333; 23; 553; 19; 611 106 105 405 18; 622 487 544 546 626 196 346; 429; 269; 122; 208; 80; 455; 139; 46; 406; 33; 57; 209; 70; 126; 521; 27; 610; 25; 595; 28; 606 215: 280 545: 32; 621 537, 592 576 30; 224; 0.62791 0.74286 0.76744 0.76906 0.77695 0.77827 0.78788 0.79773 0.80984 0.82857 0.82909 0.83296 0.83575 0.83658 0.84173 0.84615 0.85191 0.85454 0.8571 0.85714 0.86687 0.86745 0.86825 0.86882 0.87097 0.87252 0.87277 0.875 0.875 0.875 0.87718 0.88125 11; 601 602 603 604 605 606 13 25 72 149 611 388 2; 5; 615; 616; 46; 98; 617; 105; 6; 622; 620; 3; 614 624 623 628 12; 17; 131; 578; 24; 212; 600; 22; 129 168 598 566 569 596 194 589 312 488: 607: 21; 435: 517: 123: 280 562 19; 424. 536: 320: 582 609 50; 579 610 575 0 0 0 0 0 0 0 0 0 0 0 0 0 0.046154 0.0625 0.0625 0.0625 0.17391 0.18182 0.18692 0.18824 0.2 0.2 0.21429 0.22222 0.25 0.25 0.25 0.25 0.25 0.25 0.25 4; 149 209 196 126 170 406 456; 491; 24; 25; 32; 269; 561; 369; 383; 18; 277: 63; 603; 86; 16; 392; 619; 39; 22: 628; 261; 67; 103; 14; 194; 214; 229: 602; 267; 429; 23; 7; 477; 496; 66; 29; 41; 346; 598; 408; 403; 27; 443; 80; 282; 1 16; 621; 494; 614; 57; 65; 35; 615; 98; 165: 0.61043 0.66802 0.69333 0.69375 0.72629 0.74879 0.75862 0.77909 0.78074 0.8 0.8 0.81105 0.81777 0.82036 0.83333 0.83333 0.83721 0.83871 0.83911 0.84 0.84146 0.84507 0.84691 0.85226 0.85714 0.86056 0.86428 0.86434 0.87081 0.87209 0.87289 0.8745 - 8 9 -6 3 3 609 52; 0.88154 134 6 3 4 13; 26; 0.88718 627 6 3 5 273 372 0.8875 626 6 3 6 5; 619 0.8887 26; 6 3 7 299 373 0.88889 128 6 3 8 530 535 0.88889 635 6 3 9 326 337 0.89286 634 6 4 0 617 631 0.89439 101 6 4 1 640 58; 0.89521 71; 6 4 2 417 461; 0.89655 625 6 4 3 634 20; 0.89709 642 6 4 4 456 477 0.9 643 6 4 5 377 519 0.90244 639 6 4 6 383 403 0.90323 636 6 4 7 67; 71; 0.90326 646 6 4 8 641; 90; 0.9034 647 6 4 9 89; 257; 0.90424 648 6 5 0 34; 608; 0.90447 44; 463 0.26667 544; 587; 0.875 497 0.27273 545; 595; 0.875 7; 0.28333 606; 622; 0.87923 551 0.28571 625; 607; 0.88 553 0.28571 223; 227; 0.88235 271 0.28736 31.4; 334; 0.88235 347 0.3 396; 444: 0.88235 510 0.30769 471; 523; 0.88235 447 0.31818 48; 64; 0.88335 593 0.33333 54; 55; 0.88353 629 0.33333 624; 33; 0.88378 590 0.33333 629; 30; 0.88592 633 0.33333 249; 447; 0.88764 613 0.33333 604; 224; 0.89375 107 0.33333 ! 13; 258; 0.89394 108 0.33333 374; 636; 0.89474 371 0.33333 89; 168; 0.89503 592 0.33333 626; 119; 0.89521 AppC SentenceNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceNoRestriction C l u s t e r I n d e x SentNoRestrict A B A SentNoRestrict A B B SentNoRestrict_ B A A 6 0 1 261 346 0.61628 l; 600 0 4; 14: 0.60814 6 0 2 392 429 0.74286 601 16; 0 149; 194; 0.66802 6 0 3 63; 80; 0.75055 602 22; 0 196; 229: 0.675 6 0 4 114 122 0.75336 603 24; 0 209; 214; 0.69333 6 0 5 601 269 0.76744 604 34; 0 126; 602; 0.72629 6 0 6 147 208 0.77695 605 129 0 170; 267; 0.74879 6 0 7 395 455 0.78788 606 168 0 406; 429; 0.75862 6 0 8 102 139 0.79773 607 384 0 5; 23; 0.77656 6 0 9 41; 46; 0.80761 608 578 0 2; 7; 0.77719 6 1 0 16; 33; 0.82378 609 i i ; 0 456; 477; 0.8 6 1 1 39; 57; 0.82774 610 212 0 491; 496; 0.8 6 1 2 602; 406; 0.82857 611 582 0 32; 41; 0.80729 6 1 3 38; 70; 0.82962 13; 598 0 24; 66; 0.80891 6 1 4 170 209 0.83575 25; 566 0 369; 408; 0.81395 6 1 5 333 612 0.83636 72; 569 0 25; 29; 0.81533 6 1 6 604 126 0.84173 149 596 0 269; 346; 0.81944 6 1 7 510 521 0.84615 616 194 0 277; 443; 0.82667 6 1 8 10; 27; 0.85065 388 589 0 392; 494; 0.82857 6 1 9 23; 25; 0.85083 392 584 0 63; 80; 0.83038 6 2 0 553; 595; 0.85714 2; 312 0.030769 561; 598: 0.83333 621 610; 603; 0.85986 612 488 0.0625 18; 0.83622 622 19; 28; 0.86056 621 5; 0.0625 383; 403; 0.83871 6 2 3 18; 32; 0.86551 622 21; 0.0625 618; 607; 0.84 - 9 0 -624 621 622 0.86783 46; 435 0.17391 86; 116; 0.8427 625 106 215 0.86825 623; 123 0.17445 16; 619; 0.84276 626 105 280 0.86882 98; 517 0.18182 604; 282; 0.84507 627 299 373 0.87037 105; 280 0.18824 621; 612; 0.85231 628 405 545 0.87097 6; 562 0.2 35; 65; 0.85851 629 196 224 0.875 628, 19; 0.2 39; 57; 0.85906 630 326 337 0.875 625 424 0.21429 22; 628; 0.86294 631 452 476 0.875 3; 536 0.22222 54; 55; 0.86442 632 487 537 0.875 630 523 0.25 374; 623; 0.86842 633 544 592 0.875 632 568 0.25 67; 98; 0.86952 634 546 576 0.875 620 320 0.25 261; 616; 0.87209 635 623 30; 0.87524 629 614 0.25 223; 227; 0.87395 636 5; 619; 0.87922 635 50; 0.25 103; 165; 0.8745 637 609; 52; 0.87971 12; 579 0.25 544; 587; 0.875 638 475; 494; 0.88235 17; 615 0.25 545; 595; 0.875 639 13; 26; 0.88244 131 575 0.25 625; 33; 0.87581 640 615; 459 0.88571 134 463 0.26667 606; 626; 0.87923 641 273; 372 0.8875 633 497 0.27273 603; 224; 0.88125 642 627; 383 0.88889 634; 7; 0.28333 635; 614; 0.88235 643 530; 535 0.88889 26; 551 0.28571 314; 334; 0.88235 644 618; 635 0.89026 128 553 0.28571 396; 444; 0.88235 645 644; 58; 0.89026 642 271 0.28736 471; 523: 0.88235 646 639; 20; 0.8912 641 347 0.3 48; 64: 0.8825 647 151; 159; 0.89573 101 510 0.30769 630: 30; 0.88447 648 417; 461; 0.89655 261 346 0.3125 357; 397; 0.88636 649 4; 14; 0.8967 71; 447 0.31818 249; 447; 0.88764 650 67; 71; 0.89989 646; 640 0.33333 89; 168; 0.8895 AppC ParagraphNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in ParagraphNoRestriction Cluster Index ParaNoRestrict_AB_A ParaNoRestrict_ A B B ParaNoRestrict_ B A A 601 63; 80; 0.82616 2; 563; 0 510; 521; 0.625 602 369; 408 0.83333 601 3; 0 590; 593; 0.66667 603 359; 492 0.84 602 6; 0 369; 408; 0.76667 604 383; 403 0.84 603 30; 0 63; 80; 0.81788 605 39; 57; 0.84263 604 54; 0 39; 57; 0.85006 606 151; 158 0.85906 605 81; 0 127; 192; 0.85333 607 4; 14; 0.86507 606 86; 0 16; 604; 0.86426 608 223; 602; 0.86735 607 88; 0 20; 26; 0.86567 609 73; 75; 0.86776 608 174; 0 4; 14; 0.86776 610 196; 224 0.86777 609 396; 0 243; 275; 0.87143 611 546; 576 0.875 610 590; 0 607; J J , 0.87228 612 16; 601 0.87629 611 47; 0 18; 41; 0.87457 613 20; 26; 0.87633 612 50; 0 520; 545; 0.875 614 612; 33; 0.87743 613 593; 0 75; 85; 0.87657 -91 -615 18; 41; 0.87803 614; 40; 0 5; 23; 0.8799 616 5; 23; 0.88213 9; 571; 0 359; 492; 0.88 617 475; 502; 0.88235 616; 62; 0 67; 98; 0.88147 618 416; 510; 0.88462 13; 594, 0 612; 27; 0.88288 619 615; 32; 0.8865 618; 35; 0 261; 346; 0.88406 620 461; 481; 0.88889 619; 85; 0 618; 32; 0.8842 621 619; 27; 0.88981 1; 544 0.16667 196; 224; 0.8843 622 192; 241; 0.89362 615 572 0.25 621; 229; 0.8843 623 67; 98; 0.89482 622 29; 0.25 416: 601; 0.88462 624 114, 122; 0.89891 623 573 0.25 48; 81; 0.88819 625 209, 214; 0.89923 624 574 0.25 73; 614; 0.88856 626 169 184; 0.90071 19; 485 0.3 54; 55: 0.88876 627 13; 613; 0.90178 32; 480 0.3 396; 444; 0.88889 628 616 43; 0.9022 621 625 0.33333 13; 608; 0.88908 629 35; 609; 0.90736 628 160 0.33333 605; 40; 0.89095 630 l ; 3; 0.90841 629 519 0.33333 35; 65; 0.89192 631 614 96; 0.90894 630 587 0.33333 326; 337; 0.89362 632 610 229; 0.90909 631 12; 0.33333 619; 269; 0.89474 633 505 516; 0.90909 632 25; 0.33333 209; 214; 0.89923 634 621 31; 0.90922 633 131 0.33333 515; 580; 0.9 635 630 2; 0.90946 634 135 0.33333 516; 553; 0.9 636 628 10; 0.90984 635 579 0.33333 635; 595: 0.9 637 629 85; 0.91004 636 589 0.33333 1; 0.90332 638 637 65; 0.91008 637 22; 0.33333 366; 368: 0.90476 639 631 28; 0.91065 638 23; 0.33333 151; 158; 0.90604 640 324 350; 0.91111 639 36; 0.33333 86; 116: 0.90775 641 337 352; 0.91177 640 56; 0.33333 24; 66: 0.90816 642 634 46; 0.9119 641 136; 0.33333 223; 603; 0.90816 643 261 269; 0.91304 620 49; 0.33333 304; 347; 0.90909 644 54; 55; 0.91452 643 65; 0.33333 620; 46; 0.90961 645 295 313; 0.91489 644 113 0.33333 114; 122; 0.90984 646 645 640; 0.91489 80; 583 0.33333 615; 43; 0.90984 647 326 641; 0.91489 5; 471 0.35294 155; 237; 0.91257 648 24; 66; 0.91522 642 , 510 0.375 156; 271; 0.91333 649 237; 274; 0.91579 648 , 562 0.4 295; 313; 0.91489 650 639; 132; 0.91667 626 , 27; 0.4 646; 10; 0.91494 AppC DocumentNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in DocumentNoRestriction Clus te r Index DocumentNoRestrict_AB A DocumenfNoRestrictAB B DocumenfNoRestrict BA A 601 336; 471; 0 i; 120; 0 336; 471; 0 602 480; 481; 0 601; 2; 0 480; 481; 0 603 602; 495; 0 602; 7; 0 602; 495; 0 604 603; 509; 0 603; 41; 0 603; 509; 0 - 9 2 -605 604; 515; 0 604; 606 605; 539; 0 605; 607 606; 540; 0 606; 608 607; 565; 0 607; 609 2; 7; 0.16437 608; 610 1; 609, 0.16667 609; 611 281; 487 0.2 610, 612 299, 373 0.2 611, 613 601 337 0.2 612 614 613 406 0.2 613 615 614 523 0.2 614 616 615 538 0.2 615 617 67; 71; 0.24138 616 618 617 74; 0.24138 617 619 187 190 0.25 618 620 452 608 0.25 619 621 463 488 0.25 620 622 618 78; 0.27586 621 623 622 98; 0.27586 622 624 121 161; 0.28571 623 625 610 3; 0.29263 624 626 58; 59; 0.3 625 627 625 5; 0.30268 626 628 627 23; 0.31034 627 629 117 118; 0.32143 628 630 63; 80; 0.33333 629 631 236 312 0.33333 630 632 238 267 0.33333 631 633 632 290 0.33333 632 634 616 514 0.33333 633 635 634 544 0.33333 634 636 635 578 0.33333 635 637 636 596 0.33333 636 638 382 , 442 0.33333 637 639 422 , 440 0.33333 638 640 479 510 0.33333 639 641 640 , 485 0.33333 640 642 641 ; 517 ; 0.33333 641 643 642 ; 521 ; 0.33333 642 644 643 ; 585 ; 0.33333 643 645 628 ; 9; 0.341 644 646 623 ; 101; 0.34483 645 647 645 ; 4; 0.34704 646 648 647 ; 6; 0.34937 647 649 18; 41; 0.35484 648 650 39; 40; 0.3625 649 43; 0 604; 515; 0 67; 0 605; 539; 0-72; 0 606; 540; 0 84; 0 519; 541; 0 91; 0 563; 571; 0 103; 0 578; 596; 0 106, 0 1; ->\u00E2\u0080\u00A2 0.047059 110, 0 4; 7; 0.087404 111 0 611; 612; 0.098039 112 0 613; 3; 0.10138 114 0 67; 74: 0.10345 123 0 614; 6; 0.10886 16; 0 616; 5; 0.12644 127 0 43: 68; 0.16 130 0 242; 248; 0.18182 131 0 41; 1 19; 0.18557 133 0 281; 379; 0.2 134 0 621; 395; 0.2 139 0 622; 455; 0.2' 143 0 623; 487; 0.2 154 0 601; 406; 0.2 i i ; 0 625; 423; 0.2 155 0 626; 461; 0.2 157 0 91; 312; 0.21212 162 0 133; 161; 0.23913 164 0 525; 536; 0.25 168 0 629; 157; 0.27273 169 0 39: 40; 0.2875 175 0 628; 236; 0.29167 183 0 58; 59; 0.3 184 0 63; 80; 0.30864 192 0 615; 98; 0.31034 195 0 617; 23; 0.31801 4; 0 382; 415; 0.33333 9; 0 638; 433; 0.33333 58; 0 639; 442; 0.33333 * 197 o 640; 446; 0.33333 200 o 641; 466; 0.33333 202 ; o 485; 585; 0.33333 204 ; o 486; 500; 0.33333 205 ; o 498; 511; 0.33333 212 ; o 514; 523; 0.33333 213 ; o 561; 586; 0.33333 3; 0 125; 619; 0.3421 214; 0 636; 71; 0.34314 215; 0 123; 282; 0.34375 - 9 3 -Appendix D: Statistical Evaluation Results of All Fifteen Schemes - Clustering 600 Tokens 5.4.2.1 Instructions for Expert's Evaluation Below are the instructions we showed the expert to do the evaluations: Please evaluate the relations of terms in each cluster based on your knowledge of and experience in the accounting industry. For additional reference, you can also check online the most comprehensive financial glossary at http://www, investorwords. com, where you can find quite accurate definitions for 6,000 current financial terms. The terms in each cluster were edited as follows: \u00E2\u0080\u00A2 AH terms in the attached table were accounting terms. \u00E2\u0080\u00A2 The symbol \"_\" was used to connect the term-phrase, which was here treated as a single term. Different terms were separated by \"; \". For example, in cluster No. 1, \"'stock_dividend\" was one term, yet ''stock_dividend; stock_split\" were two different terms separated by \";\". \u00E2\u0080\u00A2 Each cluster consisted of two groups that were separated in two columns. The expert should evaluate the relations among all the terms which reside only in different groups. If a single group within a single cluster contains more than one term, since these terms have been linked together by the previously formed clusters and have already been evaluated by the expert, the expert should therefore not evaluate the terms within the same groups. As an example, see Table 5.8 cluster #3: the first group contains two terms ' 'stock^dividend\" and \"stockjsplit, \" and since they were linked by cluster #1 and had already been evaluated by the expert when evaluating clusterM, -94 -so in cluster#3 the expert should only evaluate all terms residing in the first and second group respectively, that is, evaluate relations between ''stock_dividend\" (in the first group) and \"split\" (in the second group), and also evaluate relations between \"stockjsplit\" (in the first group) and \"split\" (in the second group). \u00E2\u0080\u00A2 The column \"# of terms \" counts the number of terms in two groups of each cluster. \u00E2\u0080\u00A2 In addition, all terms in the attached table were singulars and if originally they were in plural forms they have already been consolidated into their singular forms, for example, \"options -> option. \" So, in our attached table the only term you will see is \"option \" which actually meant either \"options \" or \"option \" in the original text. Similarly, the term \"financialjstatement\" actually originally included ' financial_statement\" and \"financial statements. \". The evaluation categories were illustrated as follows: I. Relation Type Alternatives (column I in the spreadsheet): For each relation among the terms, please choose the type of the relation from among the alternatives by entering the number \"1\" in the cell in its appropriate relation type column. a) Synonyms? : Are they all synonyms (same meaning)? b) Antonyms? : Are they all antonyms (opposite meaning)? c) Broader term? : Is any term's meanins broader than all of the other (s)? Example: \"accountingJerminology; terminology. \" d) Subgroups? : If none of them is the broader term of others but these terms are all related, are they both or all subgroups of another broader concept? Example: - 9 5 -first \"notes_payable \" and \"accounts_payable \" are related terms and then they are both subgroups of the broader concept \"payables. \" e) Though distinct, forms a new concept? : If these terms are both or all distinct (not related), can they tosether still form a new concept? Example: \"sun\" and \"lotion \": though individually \"sun \" and \"lotion \" are two distinct things, together they can form a new concept, \"sun lotion. \" f) Partial relation: If there are more than two terms in the cluster, are only some of them related? For the clustering containing more than 2 terms, if not all but several terms' relations belong to the relation type a) ~ e), you can choose \"Partial relation \" here as their relation. g) Other relation: If the relation type is not listed previously. h) No direct relation: If none of the terms are directly related. II. Explanation (column II in the spreadsheet): Please describe the relationships among the following terms for each relationship type: a) Describe the meanins of each synonym b) Describe the meanins of each antonym c) Describe which of the terms is the broader or broadest. Example: \"... is the broader term \" d) Describe the broader or broadest concept to which all terms belong. Example: \"the broader concept is .... \" e) Describe the new concept that is formed. Example: \"The new concept is.... \" - 9 6 -f) Describe -which terms are related. If only some but not all terms meet the relation type a) ~ e), here you should list which terms belong to which relation type. Example: \"... is the broader term of...\" g) Describe the new relationship type that is different from a) ~f). h) Describe the reason for no direct relation if it is not very obvious. III. Relative Relationship Score (1-5) (column III in the spreadsheet): Please give your score from \"1\" to \"5\" regarding how closely related are the terms are in each cluster, based on the percentage of \"how closely related\" for the clusters containing only two terms, and based on \"the percentage of terms related\" for the clusters containing 3 or more terms, where \"1\" - Mostly to completely unrelated: only 0% (inclusive) ~ 20% (exclusive) related. \"2 \" - Somewhat unrelated: only 20% (inclusive) ~ 45% (exclusive) related. \"3 \" - Hard to decide: 45% (inclusive) ~ 55% (exclusive) related. \"4 \" - Somewhat related: 55% (inclusive) ~ 80% (exclusive) related. \"5\" - Mostly to completely related: 80% (inclusive) ~ 100% (inclusive) related. 5.4.2.3 Statistical Analysis of the Expert's Evaluation Results for All 15 Schemes (Provided in CD-ROM) \u00E2\u0080\u00A2 AppD SentenceOtd Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceOtd \u00E2\u0080\u00A2 AppD SentenceUpTo5td Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceUpTo5td \u00E2\u0080\u00A2 AppD SentenceNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceNoRestriction \u00E2\u0080\u00A2 AppD ParagraphNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results in ParagraphNoRestriction - 9 7 -AppD DocumentNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results DocumentNoRestriction - 9 8 -Appendix E : Automatically Identifying Phrases Appendix E l : Single Word Tokenizing AppElList 1: First 10 Single Words of the entire 1,073 Broken Phrase List ABANDONED ABNORMAL ABSENCE ABSENCES ACCELERATED ACCELERATION ACCEPTED ACCOMPLISHMENTS ACCOUNT ACCOUNTANT AppElList 2: First 10 Words of the entire 1,188 Wanted Single Word List ABANDONED ABNORMAL ABSENCE ABSENCES ACCELERATED ACCELERATION ACCEPTED ACCOMPLISHMENTS ACCOUNT ACCOUNTANT AppElList 3: First 10 Plural Words of the entire 283 Plural Wanted Single Word To Singular List ABSENCES ABSENCE ACCOMPLISHMENTS ACCOMPLISHMENT ACCOUNTANTS ACCOUNTANT ACCOUNTrNGSACCOUNTING ACCOUNTS ACCOUNT ACQUISITIONS ACQUISITION ACTIVITIES ACTIVITY ADJUSTMENTSADJUSTMENT AFFILIATES AFFILIATE AGREEMENTS AGREEMENT - 9 9 -AppElList 4: First 10 Tokens of the entire 905 Final Single Token ID List T O K E N ID # FREQUENCIES # D O C U M E N T S of 1 137347 835 to 2 75746 834 in 3 71086 835 and 4 66441 833 for 5 48864 834 or 6 34588 801 statement 7 30033 802 an 8 22802 789 not 9 21576 803 asset 10 20467 535 Appendix E2: Clustering 905 Single Tokens Data Outputs AppE2Sentence0td 905 Single Tokens Cluster Data: Top 50 clusters of the entire 905 Single Token Clustering Data in SentenceOtd SentOtdAB _A Sent0tdAB_B Sent;0tdBA_A Sent0tdBA_B 906 635 638 0. 056818 3; 879 0 83: 86; 0. 030615 1 ; 887 0 907 827 834 0. 16667 906; 886 0 607 629; 0. 20213 2; 895 0 908 356 362 0. 19312 907; 902 0 510 570; 0. 2807 6; 904 0 909 428 440 0. 27984 5; 903 0 890 898; 0. 33333 9; 899 0 910 123 181 0. 40346 6; 840 0 304 346; 0. 41.09.1. 11 905 0 911 377 453 0. 4244 910; 896 0 17; 45; 0. 46262 13 876 0 912 414 431 0. 44304 9; 904 0 141 202; 0. 4929 25 862 0 913 105 107 0. 46311 912; 905 0 848 878; 0. 5 27 903 0 914 660 686 0. 47887 52; 893 0 774 825; 0. 52 30 896 0 915 33; 73; 0. 49396 123; 181 0. 00263.16 390 448; 0. 53.134 126 824 0 916 62; 101 0. 55368 635; 638 0. 023529 449 489; 0. 57.143 312 860 0 917 233; 335 0. 60447 2; 742 0. 025641 288 298; 0. 57843 316 745 0 918 118 228 0. 61431 377; 453 0. 03125 46: 106; ' 0. 6.138 390 727 0 919 74; 106 0. 64889 1; 542 0. 037879 454 552; 0. 62745 47' 902 0 920 593 611 0. 68932 33; 73; 0. 042204 11.9 128; 0. (34028 658 874 0 921 704 805 0. 69767 118; 228 0. 043702 332 500; 0. 646 716 883 0 922 664 687 0. 70769 89; 515 0. 064286 536 549; 0. 65693 848 878 0 923 520 554 0. 70779 704 805 0. 071429 355 459; 0. 65991 890 898 0 924 408 909 0. 7293 155 506 0. 073.171 22; 23; 0. 70032 510 570 0. 0080645 925 115 260 0. 73136 474 712 0. 078431 107 301; 0. 70968 83 86; 0. 018659 926 155 367 0. 74691 919 648 0. 083333 483 513; 0. 72105 17 45; 0. 025109 927 26; 35; 0. 75974 827 834 0. 090909 127 156; 0. 7271 332 500 0. 027473 928 568 922 0. 76423 926 739 0. 11905 30; 94; 0. 73905 176 644 0. 028169 929 422 430 0. 78467 928 789 0. 125 520 554; 0. 74026 24- 796 0. 05 930 474 712 0. 78539 160 850 0. 125 n . 1 , 14; 0. 75219 60 804 0. 052632 931 382 395 0. 79825 356 362 0. 14084 84; 200; 0. 75266 912 81.4 0. 058824 932 188 404 0. 80714 255 626 0. 14493 -161 544; 0. 75566 20' 576 0. 063063 933 364 388 0. 80916 289 578 0. 14563 61; 81; 0. 76022 1.10 604 0. 065934 - 100 -934 318 405 0.81022 917; 838 0. 15385 149; .155; 0. 76487 220; 648 0. 069444 935 255 476 0.81139 4; 233 0. 16294 924; 87; 0. 77587 10; 228 0. 071979 936 497 525 0.8125 922; 464 0. 16742 167; 221; 0. 77728 607 629 0. 074074 937 30; 50; 0.81994 62; 777 0. .18518 230; 242; 0. 78074 935 177 0. 076923 938 149 179 0. 82908 166 470 0.18889 340; 373: 0. 78495 107 301 0. 078212 939 343 446 0. 82965 934 457 0. 19512 31: 95; 0. 78637 69; 526 0. 079646 940 280 303 0. 84355 911 874 0.2 80: 236; 0. 79561 15; 241 0. 084408 941 926 506 0. 84362 25; 854 0.2 64; 121: 0. 79781. 906 757 0. 088235 942 112 215 0. 84866 189 880 0. 2 .190; 233; 0. 79847 201 677 0. 090909 943 692 763 0. 84906 935 862 0. 22222 1 1 \u00E2\u0080\u00A2 67; 0. 80067 931 539 0. 092308 944 21; 51; 0. 85964 940 773 0. 22222 287; 407; 0. 80363 939 371 0. 095109 945 820 826 0. 86667 943 701 0. 22642 487; 661; 0. 80612 914 800 0. 095238 946 289 578 0. 86866 115 260 0. 22947 933; 138; 0. 82066 119 490 0. 10615 947 376 434 0. 87 428 440 0. 24242 235; 252; 0. 82609 137 853 0. 11111 948 167 908 0. 87209 944 688 0. 2459 5; 12; 0. 82643 59; 706 0. 11321 949 22; 25; 0. 87331 908 882 0. 25 416; 562; 0. 83092 74; 522 0. 11972 950 426 488 0. 87455 50; 884 0. 25 56; 1.18; 0. 83377 141 202 0. 11982 951 550 759 0. 875 945 722 0. 26087 623: 821; 0. 83544 907 735 0. 13636 952 166 470 0. 87793 951 788 0. 26087 53; 66; 0. 83582 943 167 0. 14043 953 14; 54; 0. 88178 932 476 0. 26238 75; 934; 0. 83906 4; 239 0. 14286 954 183 258 0. 88707 939 572 0. 26667 91; 232; 0. 83946 774 825 0. 14286 955 394 647 0. 89137 36; 484 0. 28934 935; 36; 0. 83971 39: 585 0. 14563 Appendix E3: Single Tokens Evaluation Results (Provided in CD-ROM) AppE3SentenceOtd Evaluate Automatic Phrases: for Top 50 SentenceOtd Evaluate Automatic Phrases of 905 Single Tokens -101 -"@en . "Thesis/Dissertation"@en . "2004-11"@en . "10.14288/1.0091604"@en . "eng"@en . "Business Administration - Management Information Systems"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Using term proximity measures for identifying compound concepts : an expolatory study"@en . "Text"@en . "http://hdl.handle.net/2429/15815"@en .