Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using term proximity measures for identifying compound concepts : an expolatory study Yin, Nawei 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-0702.pdf [ 5.52MB ]
Metadata
JSON: 831-1.0091604.json
JSON-LD: 831-1.0091604-ld.json
RDF/XML (Pretty): 831-1.0091604-rdf.xml
RDF/JSON: 831-1.0091604-rdf.json
Turtle: 831-1.0091604-turtle.txt
N-Triples: 831-1.0091604-rdf-ntriples.txt
Original Record: 831-1.0091604-source.json
Full Text
831-1.0091604-fulltext.txt
Citation
831-1.0091604.ris

Full Text

USING TERM PROXIMITY MEASURES FOR IDENTIFYING COMPOUND CONCEPTS: AN EXPLORATORY STUDY by NAWEI YIN A THESIS SUBMITTED IN PARTIAL F U L F I L M E N T OF THE REQUIREMENTS FOR THE DEGREE OF M A S T E R OF SCIENCE IN BUSINESS ADMINISTRATION in THE F A C U L T Y OF G R A D U A T E STUDIES DIVISION OF M A N A G E M E N T INFORMATION SYSTEMS SAUDER SCHOOL OF BUSINESS We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH C O L U M B I A August 2004 © Nawei Yin, 2004  THE UNIVERSITY OF BRITISH COLUMBIA  FACULTY OF GRADUATE STUDIES  Library Authorization  In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l l m e n t o f t h e r e q u i r e m e n t s f o r a n a d v a n c e d d e g r e e at t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l m a k e it f r e e l y a v a i l a b l e f o r r e f e r e n c e a n d s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s m a y b e g r a n t e d b y t h e h e a d of m y d e p a r t m e n t o r b y h i s o r h e r r e p r e s e n t a t i v e s . It i s u n d e r s t o o d t h a t c o p y i n g o r publication of this thesis f o r financial gain shall not b e allowed without m y written permission.  •  Nawei YIN  /^H^^O-  N a m e o f A u t h o r (please print)  Title of T h e s i s :  7AX>^  Date (dd/mm/yyyy)  USING T E R M PROXIMITY MEASURES F O R IDENTIFYING COMPOUND CONCEPTS: A N EXPLORATORY  STUDY  Degree:  Master of S c i e n c e in Business Administration  Year:  2004  D e p a r t m e n t of  Division of M a n a g e m e n t Information S y s t e m s , S a d u e r S c h o o l of Business, T h e faculty of G r a d u a t e Studies  T h e U n i v e r s i t y o f British C o l u m b i a Vancouver, B C  Canada  grad.ubc.ca/forms/?formlD=THS  p a g e 1 of 1  last updated: 19-Aug-04  Abstract With the rapid development of information technology, individuals using the technology are liable to be overwhelmed by the excessive amounts of information available when conducting online (local or remote) document searches. It is important therefore that users specify the correct search terms. However, a user does not always know which terms to use and often the same idea can be described by different terms. Constructing lists of possible search terms for different domains would require a very substantial effort by experts in each domain. To alleviate these problems, automated techniques can be valuable to extract concepts and meaningful phrases for specific domains.  This work is an exploratory study of automated extraction of compound concepts from a collection of documents in a specific domain. The concept-extraction methods used in this study employed clustering techniques based on distance measures that reflect term affinity statistics rather than techniques based on similarity measures adopted in most previous works. The study compared the effects of different methods of calculating affinities, depending on the sizes of textual units where terms co-occur and on directionality and asymmetry between terms. The accounting context was used as a case study to provide the data. A n accounting expert evaluated the resulting clusters produced by the clustering program.  As demonstrated by our results, the method identified meaningful accounting compound concepts and phrases. The research also indicated which affinity types generated better results. For example, affinities based on occurrence of terms within a document produced the poorest results.  There was a significant manual effort involved in "preprocessing" the data prior to compound concept identification. However, we believe the techniques explored might be useful for users to search relevant information within individual domains and can be extended to support the construction of domain-specific thesauri.  Table of Contents  Abstract  ii  Table of Contents  iii  List of Tables  vii  Acknowledgments  ix  Section 1  Section 2  Introduction  1  1.1 Motivation  1  1.2 Thesis Framework  4  Literature Review  4  2.1 Introduction to Thesaurus Construction  4  2.1.1 Thesauri  4  2.1.2 Manual Thesauri  5  2.1.3 Automatic Thesauri  5  2.2 Automatic Techniques Guiding This Research  8  2.2.1 Document Collection  9  2.2.2 Object Filtering  9  2.2.3 Automatic Indexing  10  2.2.4 Co-occurrence Analysis  12  2.2.5 Evaluation  14  Section 3  Our Approach and Research Questions  15  Section 4  Our Affinity Measures  20  4.1 Term Affinity Statistics  20  4.2 Five Textual Units  22  Section 5  4.2.1 SentenceOtd Textual Unit  23  4.2.2 SentenceUpTo5td Textual Unit  24  4.2.3 SentenceNoRestriction Textual Unit  24  4.2.4 ParagraphNoRestriction Textual Unit  24  4.2.5 DocumentNoRestriction Textual Unit  24  4.3 Estimating Fifteen Schemes' Affinity Values  25  Experiment  25  - iii -  5.1 Domain Dependent Preprocessing - Tokenizing  26  5.1.1 File Extraction  26  5.1.2 Reformatting the Document Structure  27  5.1.3 Changing the Short-Forms of the Words in the Abbreviation List into Term-Phrases  27  5.1.4 Converting Meaningful Words to Term-Phrases  27  5.1.5 Removing Stop-Words  29  5.1.6 Producing a Full Token List  29  5.1.7 Removing Unwanted Tokens  30  5.1.8 Consolidating Wanted Tokens  30  5.1.9 Generating the Final Reduced Token List  31  5.2 Computing Term Affinities  32  5.2.1 Removing Unwanted Punctuations  32  5.2.2 Converting 1,344 Tokens to Token IDs  32  5.2.3 600 Tokens' Affinity Values  33  5.2.3.1 Generating the 600 Token List for Clustering  33  5.2.3.2 Summary of the Steps to Obtain the 600 Tokens  34  5.2.3.3 The Token IDs of Any Two Terms  35  5.2.3.4 A n Example of a Dummy File to Illustrate Affinity Calculations  35  5.2.3.5 The Program's Affinity Calculation of Fifteen Schemes  37  5.3 Clustering 600 Tokens  38  5.3.1 Hierarchical Clustering Using Matlab  38  5.3.2 Affinity Values Fit into Matlab  38  5.3.3 Clustering Outputs  39  5.4 Evaluation  40  5.4.1 Clustering Terms to Be Evaluated  40  5.4.2 Expert's Evaluation  41  5.4.2.1 Instructions for Expert's Evaluation  42  5.4.2.2 Sample Evaluation Results  43  - iv -  '  5.4.2.3 Statistical Analysis of Expert's Evaluation Results for A l l Fifteen Schemes  45  5.4.3 Comparing the Expert's Evaluations Results with Price Waterhouse Accounting Thesaurus 5.4.3.1 Sample Comparison  46 47  5.4.3.2 Statistical Comparisons of A l l Fifteen Schemes  48  5.5 Discussion  49  5.5.1 Can Relevant Concepts be Extracted Automatically from a Set of Documents in a Given Domain? (QI. 1) What Type of Semantic Relations Can Be Identified in the Extracted Concepts? (QI .2)  50  5.5.2 What Parameters Can Affect the Quality of the Results? (Q2)  52 5.5.2.1 Proximity - Sentence, Paragraph And Document Levels  Section 6  (Q 2.1)  52  5.5.2.2 Distance Within the Sentence Level (Q2.2)  53  5.5.2.3 Directionality And Asymmetry (Q2.3)  55  Automatically Identifying Phrases (Q3)  58  6.1 Single-Word Tokenizing  59  6.1.1Broken Phrases List  59  6.1.2 Wanted Single Words  60  6.1.3 Converting Plural Wanted Single Words into Singulars  60  6.1.4 905 Final Single Tokens  60  6.2 905 Single Tokens' Affinity Calculations  61  6.2.1 One Level - SentenceOtd  61  6.2.2 Four Cases - Estimating Affinities  61  6.3 Clustering 905 Single Tokens  62  6.4 Evaluation - 905 Single Token Clustering  63  6.4.1 A Sample Evaluation  63  6.4.2 Statistical Evaluation Results  67  6.5 Automatic Phrase Discussions - 905 Single Tokens Clustering  -v-  68  6.5.1 Our Techniques Can Automatically Identifying Some Accounting  Section 7  Phrases  68  6.5.2 Directionality And Asymmetry  69  Conclusions And Future Research  72  7.1 Conclusions  72  7.2 Contributions  74  7.3 Limitations And Future Research  75  Bibliography  76  Appendices  80 Appendix A  Domain Dependent Preprocessing - Tokenizing  80  Appendix B  Computing Term Affinities  85  Appendix C  Clustering 600 Tokens Data Outputs  88  Appendix D  Statistical Evaluation Results of A l l Fifteen Schemes -  Clustering 600 Tokens  94  Appendix E  99  Automatically Identifying Phrases  Appendix E l : Single Word Tokenizing Appendix E2: Clustering 905 Single Tokens Data Outputs  99 100  Appendix E3: Single Tokens Evaluation Results (Provided in C D ROM)  101  - vi -  List of Tables Table 3.1 Summary of Our Approach  19  Table 4.1 Fifteen Schemes' Affinity Estimations  25  Table 5.1 Summary of the Steps to Obtain the 600 Tokens  34  Table 5.2 Example of Comparing Two Token IDs  35  Table 5.3 Dummy File - Two Tokens' Co-occurrences in ParagraphNoRestriction  36  Table 5.4 Dummy File - Two Tokens' Statistical Affinity Values in ParagraphNoRestriction  37  Table 5.5 Top Ten Clustering Data of Sentence0tdAB_A  39  Table 5.6 Sample Evaluation Terms - Top Three Clusters of SentenceUpTo5tdAB_A  41  Table 5.7 Sample Evaluation Relation Type Alternatives  43  Table 5.8 Sample Expert's Evaluation Results - Top Three Clusters of SentenceUpTo5tdAB_A  44  Table 5.9 Statistical Analysis of the Expert's Evaluation Results - A l l Fifteen Schemes  46  Table 5.10 Sample Comparison of Expert's Evaluation Results with the Price Waterhouse Thesaurus - Top Three Clusters of SentenceUpTo5tdAB_A  48  Table 5.11 Comparing Fifteen Scheme of the Expert's Evaluation Results with the Price Waterhouse Thesaurus  49  Table 5.12 Statistical Results of "No direction relation" Type for all Fifteen Schemes  50  Table 5.13 Statistical Results of Average Relevance Scores for all Fifteen Schemes  51  Table 5.14 Statistical Results of the Top Two Most Frequently Chosen Relation Types in A l l Fifteen Schemes  52  Table 5.15 Identical Clusters Percentages by Comparing the Same Scheme in SentenceOtd and SentenceUpTo5td with SentenceNoRestriciton Respectively  53  - vii -  Table 5.16 Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EvaluateSentenceOtd with EvaluateSentenceUpTo5td, and with EvaluateSentenceNoRestriction  54  Table 5.17 Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EvaluateSentenceUpTo5td with EvaluateSentenceNoRestriction  55  Table 5.18 Directionality - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B _ A at Five Levels And the Scheme B A A at Five Levels  56  Table 5.19 Asymmetry - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B A and A B B at 5 Levels And the Scheme A B _ B at Five Levels  57  Table 6.1 Top Ten Automatic Phrase Clustering Data in SentenceOtdAB_A  62  Table 6.2 Sample Evaluation of 905 Single Token Clusters - Top Five Clusters in SentenceOtdAB_A  63  Table 6.3 Statistical Evaluation of Automatic Phrases for Four Cases at SentenceOtd Level  67  Table 6.4 Asymmetry - Comparing Evaluation Automatic Phrase in SentenceOtdABA with in SentenceOtdABB; and in SentenceOtdBA_A with in SentenceOtdBA B  72  - viii -  Acknowledgments  I would like to thank Professor Yair Wand for instructing and supervising me throughout this study, and Professor Carson Woo and Professor Jacob Steif for their comments on the development of this thesis. I want to express my sincere appreciation to Ph.D. candidate Ofer Arazy for his advice and extensive support during the entire process of developing this research. I also would like to thank Ms. Yongwei Yin for her expert judgement of the experimental results, and Mr. Steve Doak for editing my thesis. Furthermore, I sincerely appreciate the continuous encouragement and assistance I have received from my husband, Mr. Jianwen Zhang, and my mother, Ms. Xiuge Sun.  - ix -  1.  INTRODUCTION  1.1  Motivation  With the recent rapid development of advanced technology, people nowadays can easily access and locate information they need by searching online, or through local database systems. Users, however, might be overwhelmed by the excessive amounts of information available from these various sources, and they may feel confused about how to effectively retrieve the needed information. This problem was labelled information overload by Chen et al. (1995). Another common predicament searchers encounter is a vocabulary problem: people often use different terms to describe the same concept. Furnas et al. (1987) noted that when two people spontaneously made a word choice for objects from various domains, the probability that they chose the same term was lower than 20%. Subsequently, Chen et al. (1995, p.177) argued, "Due to the unique backgrounds, training and experiences of different users, the chance of two people using the same term to describe a concept is quite low and even the same person may use different terms to describe the same concept at different times (due to the learning process and the evolution of concepts)."  These problems exist in many industries, including financial accounting. K P M G Consulting L L C (2000) claimed that over two-thirds of firms in various fields in a survey of 423 organizations were overwhelmed by the information in their systems, and 50% of the organizations complained of difficulty when attempting to locate information (Garnsey, 2002). Because financial standards change over time and also because these standards apply to various types of organization, financial information has become more complex. This is why searchers such as accountants, financial workers, business professionals and  other general users with diversified backgrounds and various searching goals are usually unfamiliar with the varied terms that can represent the same concept in accounting (Garnsey, 2002).  To handle these problems, researchers have developed various automatic thesaurusgeneration methods to identify related concepts in different applications, such as Information Retrieval (IR), Latent Semantic Indexing (LSI), text mining and knowledge discovery, among others. The traditional definition of "thesaurus" in Merriam-Webster Online Dictionary is "a: book of words or information about a particular field or set of concepts; especially: a book of words and their synonyms; b: a list of subject headings or descriptors usually with a cross-reference system for use in the organization of a collection of documents for reference and retrieval" (http://www.m-w.com). As Milstead et al. (1993) have noted, a thesaurus is characterised both as a tool for writers "to help select the best word to convey a specific nuance of meaning," and an indexing system that can seve as "an information storage and retrieval tool: a listing of words and phrases authorized for use in an indexing system, together with relationships, variants and synonyms, and aids to navigation through the thesaurus." WordNet, an on-line lexical database, has then become on of the most popular machine-readable thesauri. For more information about WordNet, see the introduction by Miller et al. (1993).  A thesaurus can lead searchers to concepts associated with an initial term. Regarding the various definitions of concept in the lexicon of different languages, see the review by Sowa (2000). Grefenstette (1994, p.24) explained a generalized condition for developing a  thesaurus: "one of the aspects of language variability is that many different words can be used to describe the same concept, and here we have indications that an automatic means of discovering the words associated with a concept is possible." Leroy and Chen (2001, p.263) pointed out that "terms and concepts are different entities. A concept is the underlying meaning of a set of terms. As such, each concept can be expressed by many different terms. For example, the concept cancer has 20 terms associated with it, two of which are malignant tumor and malignant tumoral disease." Within the field of accounting, Garnsey (2002) suggested, " i f clusters of related accounting terms/phrases can be successfully constructed for accounting, it should be possible to give users of accounting information domain-specific knowledge. Eventually, this knowledge may be integrated into a retrieval system to improve the efficiency of searches for information about specific accounting topics."  In conducting our literature survey, we have noticed that studies of automatic identification of accounting concepts are few in number and are only preliminary forays into the field. Therefore, we focus our research on constructing a feasible approach to automatically grouping accounting terms and phrases into different compound concepts. A compound concept (hereinafter referred to as a concept, for simplicity) in this thesis refers to a set of related terms in a specific domain. Hence, when a user wants to describe a concept, besides those words the user is able to think of, he or she can select terms belonging to that same concept suggested by the automatic mechanism. The approach proposed in our research, if practical and useful, could lead to an automatic tool that enhances users' searching performance.  1.2  Thesis Framework  In the following section, we review previous research performed in related fields and position our research therein. Following this, our research approach is introduced, and the research questions that delineate our study's scope are presented in Section Three. In Section Four, we introduce our own statistical affinity measures to addressing the research questions. A n experiment on automatically extracting accounting concepts is described in Section Five, and data analysis is conducted to derive our experimental results. In Section Six, we describe another study on automatic accounting phrase generation and discuss the findings. Finally, we draw conclusions, summarize the contributions we have made, and remark on directions for future research.  2.  LITERATURE REVIEW  In this section, we summarize our survey of general thesaurus construction and of the existing techniques to automatically construct thesauri, which have influenced our research position and our approach to automatically extracting accounting concepts.  2.1  Introduction to Thesaurus Construction  2.1.1  Thesauri  Grefenstette (1993) believed that a domain specific thesaurus could identify important concepts in the domain hierarchically and could suggest alternative words as well as phrases to describe the same concept in the domain. Schtltze and Pedersen (1997, p.308)  -4-  explained that "a thesaurus is a data structure that defines semantic relatedness between words." Furthermore, according to Gietz (2001), "a thesaurus is a collection of relevant terms ordered in a hierarchy of super ordinate and subordinate concepts and homonyms." 2.1.2  Manual Thesauri  The most traditional way to construct a thesaurus involves manually constructing it using a semantic mapping table. However, this is expensive and time-consuming, inasmuch as it requires extensive involvement of domain experts. As well, manual construction is only possible in a specific domain when repeated use of the thesaurus exceeds the construction cost (SchUtze and Pedersen, 1997). Since significant human and time costs incur on the construction of a manual thesaurus, researchers have studied various approaches to building system generated automatic thesauri. 2.1.3  Automatic Thesauri  We need to first investigate the methodologies of automatically constructing a thesaurus. Previous studies have already developed linguistic and statistical measures related to automatic thesaurus construction. For instance, in the areas of linguistics, Curran and Moens (2002) noted that some systems extract related terms that appear together in particular contexts by recognising linguistic patterns (e.g. X, Y and other Zs) which link synonyms and hyponyms. Regarding statistical measures, Grefenstette (1994, p.23) claimed that most of the semantic extraction work "was based on the statistics of cooccurrence of words within the same window of text, where a window can be a certain number of words, sentences, paragraphs or an entire document". Grefenstette (1994, p.26) then demonstrated that Church and Hanks (1990) "use textual windows to calculate the mutual information between a pair of words. They employ an information theoretic  definition of mutual information over a corpus are usually semantically related." Jang et al. (1999) also described the techniques of using mutual information statistics to identify the lexical relations between pairs of words. However, these works did not address compound concept identification.  One approach to constructing an automatic thesaurus is to reuse existing online lexicographic databases, such as WordNet. Other attempts to incorporate existing thesauri were explained in detail by Chen et al. (1997). However, generic thesauri like WordNet are not specialized enough for domain specific databases (Caraballo, 1999). Chen et al. (1997, pp.20-21) reviewed in detail several algorithmic approaches to automatic thesaurus generation developed in numerous investigations, and concluded that most methodologies compute coefficients of "relatedness" between terms using statistical co-occurrence measures such as cosines, and Jaccard and Dice similarity functions (Chen and Lynch, 1992; Crouch, 1990; Rasmussen, 1992; Salon, 1989). The field that most extensively utilizes thesaurus construction is Information Retrieval (IR). The goal of IR is to develop systems that can retrieve all documents relevant to a user's query, but that only retrieves documents containing relevant information. Because our research focus is domain-specific, we reviewed past approaches primarily in the context of domain-specific automatic thesaurus generation. We also referred to other generic automatic thesaurus techniques that we thought useful and similar to our approach; these relevant techniques will be discussed in the next section.  The University of Arizona Artificial Intelligence Lab, headed by Dr. Hsinchun Chen, conducted the most prominent series of studies to automatically develop a domain specific thesaurus. As Chen et al. (1998) summarized, in their previous research they generated domain-specific thesauri in different domains such as Russian computing (Chen and Lynch, 1992; Chen et al., 1993), business (Chen et al., 1994), and molecular biology (Chen et al., 1995). More recently, Hauck et al. (2001) conducted an experiment in geoscience, and Leroy and Chen (2001) studied medical terminology. Because their studies and techniques have been cited in many papers on constructing automatic thesauri and have also directed our research, we will introduce their essential techniques in detail in the following section.  As for our research interest, the accounting domain, we have identified only three relevant papers classifying accounting concepts. Gangolly and Wu (2000) conducted preliminary research using statistical analysis of term-document frequencies to automatically classify accounting concepts. They used Indexicon, an indexing utility, to preprocess the text and produce terms. A n agglomerative nesting algorithm was then adopted to derive clusters of concepts. This was only a preliminary investigation, and therefore their objective was only a rudimentary exploratory data analysis of financial accounting standards. Hence, we could not find detailed descriptions to guide our research. Although they found that some rudimentary clusters were differentiable and could extend to further research, the authors did not interpret the resulting clusters in detail. Encouraged by their findings that accounting terms can be classified in a hierarchical clustering, Gangolly and Wu's work  motivated us to seek other related techniques that could be used to develop our own method to automatically classify accounting concepts.  Garnsey (20001, 2002) investigated the feasibility of statistical method using the frequencies of particular words within documents. Latent Semantic Indexing (LSI) and agglomerative clustering were used to derive clusters of related accounting concepts. The results of this research found related terms included in the resulting clusters. We thought some techniques adopted in this study, such as accounting terms/phrases identification and evaluation, were very valuable for our research. We will discuss these techniques in detail in the following section.  2.2  Automatic Techniques Guiding This Research  Chen et al. (1995, 1997) developed techniques for domain-specific automatic thesaurus construction which we have treated as guideline in our research to automatically identify accounting concepts. To obtain a useful domain-specific automatic thesaurus, Chen et al. suggested three criteria: a complete document collection capturing full knowledge in that domain, an appropriate co-occurrence function, and user-directed interaction. In this section, the techniques related to our research will be introduced in the following order: document collection, object filtering, automatic indexing, co-occurrence and cluster analysis, and evaluation. Meanwhile, the derived techniques that can be used in our approach will be proposed and justified when discussing each step.  -8-  2.2.1  Document Collection  Chen et al. (1995, 1997) emphasized that acquiring a complete and up-to-date collection of documents from representative and authoritative domain sources was the key to creating a domain specific thesaurus. In the accounting domain, Gangolly and Wu (2000) and Garnsey (20001, 2002) chose FARS (Financial Accounting Research System) databases as their data collection method. Garnsey (2002) pointed out that F A R S , "with its key word search feature, is an improvement over the print version of G A A P . However, it does not address the fact that users may be unfamiliar with the vocabulary and, therefore, do not input the required terms to achieve adequate retrieval." Garnsey's (2002) research objective was to provide sets of related accounting concepts that were derived from the terms actually used in the authoritative literature in the FARS database. Our research also selected the accounting literature from the FARS databases as our collection to work with. 2.2.2  Object Filtering  Chen et al. (1997) claimed that domain-specific controlled lists of keywords in databases (for example, the subject indexes at the back of a textbook) can help identify key search vocabularies to improve information retrieval. Chen et al. used four different sources from the biological sciences to compile a vocabulary list, including researcher names, gene names, experimental methods and subject descriptors. They identified terms that matched with terms in the known vocabulary, labelling this process object filtering. Garnsey (2002) adopted this method and filtered terms using a combination of G A A P guide indexes from two accounting texts, and intermediate and advanced textbooks. To make the list more complete, in several cases additional phrases were added to the indexes (for example, "public traded company").  In our study we used the following sources: Accounting Dictionary - Accounting Glossary -Accounting Terms (http://www.ventureline.com/glossary.aspl and Financial Accounting: an Introduction to Concepts, Methods, and Uses - 8th Edition (Stickney and Weil, 1997) to guide the vocabulary composition. 2.2.3  Automatic Indexing  Referring to Salton's (1989) blueprint method for automatic indexing, Chen et al. (1995) implemented automatic indexing techniques in this order: "word identification" to identify words in each document without considering punctuation and case; "stop-wording" to create a domain specific stop-word list by removing non-semantic words (such as "on", "in", "at" and "there") as well as verbs that were too general and irrelevant to represent meaning of the document; though standard "stemming" was adopted in the beginning, the researchers later realized drawbacks to the methods, and removed the stemming process to "avoid creating noise and ungrammatical phrases. For example, C L O N I N G will not be stemmed as C L O N E (one is a process, the other is the output)." Chen et al. added, "termphrase formation" was then processed by the system to form phrases by containing up to three adjacent words. For example " D A U E R " , " L A R V A " , " F O R M A T I O N " , " D A U E R L A R V A " , " L A R V A F O R M A T I O N " , and " D A U E R L A R V A F O R M A T I O N " were generated from the three adjacent words " D A U E R L A R V A F O R M A T I O N . " Chen et al. (1995) in this paper referred to these phrases simply as "terms."  Garnsey (2002) performed word identification and stop wordings in automatic indexing by following the procedures used by Chen et al. (Chen and Lynch, 1992; Chen et al., 1998;  -10-  Chen et al., 1995). Stemming was not used due to the fact that, in accounting, different forms of a word may have very different meanings; for example, "warranty" means "product guarantee", whereas "warrant" means "certificate representing stock rights." Instead, a limited number of closely related terms were combined (for example, singular and plural, past and present tense). This coincided with the work of Chen et al. (1995), in which found that stemming produces noise and creates ungrammatical phrases. Subsequently, stop words, non-semantic bearing words, adjectives, adverbs, pure verbs (such as "belong" and "solve") and non-accounting terms (such as "army" and "standard") were removed. This process was similar to that used by Chen et al. (1997), where a domain-specific stop word list for biology containing about 600 very general molecular biology terms (for example, "gene", "process", and "mutation") was created to remove general terms which were considered irrelevant in the thesaurus. High frequency words, which were too general to discriminate content, were also eliminated. Consistent with the work of Chen et al. (1998), those terms which did not occur in at least three documents were eliminated. As well, the low frequency words that do not contribute to content were also removed.  The automatic indexing step in our research was processed similarly to both Garnsey (2002) and Chen et al. (1995), including term-phrase formation. We know that automatic indexing has some timesaving and cost-saving advantages over manual indexing, but is less accurate. Callan (1995) declared that several experiments has demonstrated that "a combination of manual and automatic indexing is superior to either alone." Thereafter,  there was some amount of manual work involved when selecting and identifying terms in our study, as is detailed later in the Experiment section of this paper. 2.2.4  Co-occurrence Analysis  Chen and Lynch (1992, p.887) summarized that "virtually all techniques for automatic thesaurus generation are based on the statistical co-occurrence of word types in text." Chen et al. (1995), on the other hand, claimed the most popular technique for constructing automatic thesauri is to compute probabilities of terms co-occurring in all documents constructing a data collection, a process known as co-occurrence analysis. The first stage in many cluster analyses is to convert the raw data into a matrix, usually with similarity, dissimilarity or distance measures. The output of a cluster analysis is a number of groups, clusters, and types of classes of individuals. Lassi (2002) reviewed the co-occurrence analysis developed by Chen et al. (1995, 1997) by first computing each term's document frequency (the number of documents that a word appears in) and term frequency (the number of times a word occurs in a document). The inverse document frequency was then computed by assigning higher weights to multiple-word terms than to single word terms because the multiple-word terms usually carry more precise semantic meaning than single words do. This co-occurrence measure was based, however, on document term cooccurrence analysis and was criticized by Schutze and Pedersen (1997).  SchUtze and Pedersen (1997, p.309) developed a lexical co-occurrence method where "two terms lexically co-occur if they appear in the text within some distance of each other (typically a window of k words). Qualitatively, the fact that two words often occur close to each other is more likely to be significant than the fact that they occur in the same  - 12-  documents, especially i f documents are long." Further, they noted " i f the goal is to capture information about specific words, we believe that lexical co-occurrence is that preferred basis for statistical thesaurus construction." In Latent Semantic Indexing (LSI), documentby-matrices were used in attempts to discover the relationships between terms and documents. Although lexical co-occurrence thesauri are closely related to LSI, Schtltze and Pedersen (1997, p310), worked with a "term-by-term matrix" that is more efficient. They computed a symmetric term-by-term matrix C where "the element Cy records the number of times that words / and j co-occur in a window of size k." The lexical cooccurrence method focuses on term representations independently with respect to local contexts rather than documents, whereas LSI only computes document representations. In addition, the region over which LSI co-occurrence is defined is the document, but Schtltze and Pedersen (1997) proposed to assess a region in a window of k words because they believed that local contexts of co-occurrence counts are better than document-based counts. In our research, since we wanted to investigate the relationships between terms rather than documents, and therefore we also deployed term-by-term matrixes to compute term co-occurrences.  Chen et al. (1994) mentioned, after they identified terms in their automatic indexing process, that cluster analysis was then used to compute co-occurrence probabilities between any two terms for domain specific automatic thesaurus generation. However, most prevailing statistical co-occurrence functions, like those developed by Chen et al. discussed above, as well as LSI and lexical co-occurrences, were all based on similarity measures that group similar words such as synonyms into clusters. Schtltze and Pedersen  - 13 -  (1997, p.311) assumed "words with similar meanings w i l l occur with similar neighbors i f e n o u g h t e x t m a t e r i a l is a v a i l a b l e . " A s n o t e d a b o v e , t h e y d e f i n e d a t h e s a u r u s as " a d a t a structure that defines s e m a n t i c relatedness b e t w e e n w o r d s . " W e b e l i e v e , t h o u g h , that the r e l a t i o n s b e t w e e n w o r d s are not o n l y s y n o n y m s , but a l s o c a n be  narrow term, broad term  and other potential composite relationships. Furthermore, w e do not think similarity functions c a n describe the p h e n o m e n a w h e n related terms appear i n p r o x i m i t y but not i n s i m i l a r c o n t e x t s . I n o t h e r w o r d s , i f t w o t e r m s , r e g a r d l e s s o f w h e t h e r t h e y are s i m i l a r o r semantically related terms, co-occur w i t h i n a text's scope, even though they do not have s i m i l a r s u r r o u n d i n g c o n t e x t s , t h e n c o n c e p t s c a n s t i l l be d e r i v e d f r o m the s e m a n t i c relatedness ( k n o w n as  affinity) b e t w e e n  these terms. C o n s e q u e n t l y , w e n e e d to d e v e l o p a n  affinity m e a s u r e to study their distance f r o m e a c h other. T h i s w a s one o f o u r k e y tasks w h e n c o n d u c t i n g the present research.  2.2.5  Evaluation  A f t e r c o - o c c u r r e n c e a n a l y s i s , the contents i n the r e s u l t i n g clusters w e r e t h e n e v a l u a t e d . F o l l o w i n g s i m i l a r e v a l u a t i o n m e t h o d s as t h o s e d e v e l o p e d b y C h e n et a l . Garnsey's  (2002) r e s e a r c h  (1995, 1998), i n  thirty terms w e r e r a n d o m l y c h o s e n f r o m the c l u s t e r c o n c e p t s  t h e y b e l o n g to a n d the i n d i v i d u a l s w e r e i n v i t e d to evaluate i f e a c h t e r m w a s relevant to other t e r m s o c c u r r i n g i n the s a m e cluster. T h i s e v a l u a t i o n w a s u s e d to d e t e r m i n e w h e t h e r c l u s t e r i n g c o u l d classify terms that w e r e related to e a c h other. T h e i n d i v i d u a l assessors evaluated strength o f relevance b e t w e e n terms u s i n g the three assessments  "Somewhat Relevant"  and  "Very Relevant")  (Irrelevant, "  p r o v i d e d b y the researcher. H o w e v e r , the  s p e c i f i c r e l a t i o n t y p e s o u r r e s e a r c h i s i n t e r e s t e d i n , s u c h as  - 14-  narrow term  or  broad term,  were not identified in the evaluation process. With this goal, in our study we consulted an expert in accounting to identify and assess relation types generated by each cluster concept.  3.  OUR APPROACH AND RESEARCH QUESTIONS  In detailing the literature and our proposed positions, our objective in this admittedly preliminary research is to develop our own approach to automatically extracting valid compound concepts and to examine some possible factors that might affect clustering performances. We are interested in the following research questions:  Q l . l : Can relevant concepts be extracted automatically from a set of documents in a given domain?  Q1.2: What type of semantic relations can be identified in the extracted concepts? A thesaurus normally denotes the semantic relationships between one term and another such as a Narrow Term (NT), a Broad Term (BT), or a Preferred Term (USE) (Lassi, 2002).  Furthermore, a Related Term (RT) relationship is used to indicate related terms that  cannot be represented by either broader or narrower semantic relationship. Broader and narrower terms form the hierarchical relationships mentioned by the literature. Therefore, we investigated each concept, to determine what kind of relations in particular could be identified among the terms related to this concept.  - 15 -  Q2: What parameters can affect the quality of the results?  We were also curious whether the following parameters, which almost no prior studies had examined, would be able to influence the resulting concepts: Q2.1  Proximity - sentence, paragraph and document levels  Salton and Buckely (1991, p.23) declared "Each available text is broken down into individual text units - for example, text sections, text paragraphs, and individual sentences." Rungsawang (1997) mentioned that word co-occurrences could be measured within "local" contexts (within sentences and paragraphs) and "global" contexts (the entire document). Like Schutze and Pedersen (1997), who suggested computing the number of times a word co-occurs with other words in a document, in a chapter or in a window of a number of words, we too were interested in computing the co-occurrences within the different local contexts. Zhang and Rudnicky (2002) investigated how multiple-levels of documents, paragraphs and sentences affect the derived semantic information. However, in the literature from previous studies, we could not find any comparisons across different levels of text-blocking. Therefore in this study, we wondered whether different contexts for proximities between terms, particularly at the levels of documents, paragraphs and sentences, would extract very diversified concepts? If so, the next objective would be to determine which level operated most accurately and effectively. Q2.2  Distance within the sentence level  Dagan et al. (1995) employed the relation of co-occurrence of two words within a limited distance in a sentence. According to Martin et al. (1983), restricting a window to at most five words accounts for 98% of word relations within a single sentence. Experiments conducted by Losee (1994) also showed that identifying term dependence in text units of  - 16-  no more than five words increased the degree of information retrieval performance, but "more dependence information results in relatively little increase in performance." Hence, we wonder whether even within the same sentence, the distance between two terms (how closely they are located in relation to each other) will affect the relevance in the resulting concepts. Q2.3  Directionality and asymmetry  Many past attempts we have examined have treated the context as "a bag of words" in that they ignored the word order, resulting in information loss (see Zhang and Rudnicky, 2002). Hence, it is of interest to study pairs of terms that co-occur, to assess the following issues: (1) whether order-dependent co-occurrence statistics might affect the results - (this is referred to as directionality); and (2) whether basing the co-occurrence measure on the more frequently-occurring word might affect the results (this is referred to as asymmetry).  Q3: Is our approach useful to automatic formation of term-phrases?  Although term-phrases have been extensively used in automatic indexing to form index terms, few works have clearly demonstrated how to use single-word terms to automatically form multiple-word phrases. Thus, we wanted to apply our approach to explore whether term-phrases could be automatically formed.  In this research, we followed the techniques we discussed in the literature review above for automatic indexing with some changes in the domain-specific prepossessing stage. We then developed our own term co-occurrence statistical affinity measures to explore the  - 17-  effects of different parameters on the outputs. Our proposed approach is summarized in Table 3.1, with comparisons to approaches used in other studies.  A n accounting expert scrutinized the extracted concepts after we had generated our results. The research outcomes were based on both the expert's evaluation and our analysis of the outputs.  - 18-  Our Research Elements of Our Approach  Literature  Compared to Previous Studies  Our Techniques  F A R S database: see Section 2.2.1  Same  F A R S database: also see Section 5.1.1  Various external sources used to make a vocabulary list: see Section 2.2.2  Similar  Our external sources: also see Section 2.2.2  Word identification, stop-wording, no stemming, term-phrase formation: see Section 2.2.3  Similar and Extended  Similar aspects: see Section 5.14-5.17; No stemming supplied. Used consolidation of wanted tokens: see Section 5.1.8  Popular similarity measures grouping similar words / synonyms that co-occur with similar neighbors: see Section 2.2.4  New: (1) we studied proximity, not similar neighbors; (2) we grouped words to form compound concepts, not only synonyms  Used our own Statistical Affinity Measure (affinity based on cooccurrence count, normalized by term occurrence): See Section 4.1: Affinity Measure converted to Distance Measure: see Section 5.3.2  Hierarchical clustering: see Section 2.13  Similar  Hierarchical clustering - Matlab: see Section 5.3.1  Similar and Extended  Relevance score: 1-5 / "mostly unrelated" - "mostly related": see Section 5.4.2.1  Similar and Extended  Semantic Relations  Categorization of identified concepts: "Irrelevant", "somewhat relevant", "very relevant": see Section 2.2.5 Narrow Term (NT), Broad Term (BT), Related Term (RT): see Section 3-Q1.2  Extended: we used "Broader Term" to represent N T & BT, other additional relation type alternatives to represent Related Terms (RT): see Section 5.4.2.1  Within each individual sentence, paragraph or document: see Section 2.1.3, 2.2.4, 3-Q2.1  Extended and New  Sentence, Paragraph and Document Level  We not only studied each individual sentence, paragraph and document, but also Compared them (this is New): see Section 4.2  Five-word distance within a sentence: see Section 3-Q2.2  Extended  Within a Sentence  Extended: different distances between two words within a sentence: see Section 4.2  Many previous studies ignored the word order: see Section 3-Q2.3  Somewhat New  Compared effects of different word order affect: see Section 4.1  Have not found previous studies: see Section 3Q2.3  New  Compared normalization's affect: see Section 4.1  Document Collection  Object Filtering  Automatic Indexing  Co-occurrence Analysis  Clustering  Evaluation  Directionality  Asymmetry  Table 3.1: Summary of Our Approach  - 19-  4.  OUR AFFINITY MEASURE  4.1  Term Affinity Statistics  As we explained in the literature review section above, our research proposed an affinity measure to estimate co-occurrence probabilities between any two terms (for example, term A and term B), based on their relative distance within fixed textual scopes: within a single document, within a paragraph, and within a sentence. Similar to approaches adopted by Besancon et al. (1999) and Dagan et al. (1995), in our investigation we set affinity between two terms (terms are identified as meaning-bearing single words or phrases) as the relative directional occurrence of two words within a textual unit, given the frequency of one of them. The statistics of the co-occurrence of a pair of terms were used to calculate the general probability of their co-occurrence. For the sake of brevity, we refer to these measures as statistical or probabilistic co-occurrence measures.  To address the research questions of directionality (term order) and asymmetry, four estimations, explained below, were developed using the sample terms "reaction" (term A ) and "nuclear" (term B). The numbers shown in the examples below were imagined in order to illustrate how to estimate four different affinity values: they do not represent real cases. •  Pr (A->B/A) ^  #<A, B>/M: In the entire collection, given that the current term A  appears within the sentences, paragraphs or documents, what is the probability that the other term B will co-occur A F T E R term A in a specific predefined textual unit? For example, consider term A "reaction" and term B "nuclear": if in the entire  -20-  collection, given 150 sentences, paragraphs or documents carrying current term A "reaction", there are 30 sentences, paragraphs or documents having term B "nuclear" co-occurring after the given term A "reaction" in a specific textual unit, then Pr (A>B/A) ^ (estimated by) #<reaction, nuclear>/^reaction  30/150. This indicates  that in the entire collection, given current term A "reaction" appears within the sentences, paragraphs or documents, the probability that term B "nuclear" will cooccur after term A "reaction" in a specific predefined textual unit is 20%. •  Pr (B->A/A) ^  #<B, A>/#A: In the entire collection, given that current term A  appears within the sentences, paragraphs or documents, what is the probability that the other term B will co-occur BEFORE term A in a specific predefined textual unit? For example, consider term A "reaction" and term B "nuclear": i f in the entire collection, given 150 sentences, paragraphs or documents carrying current term A "reaction", there are 70 sentences, paragraphs or documents having term B "nuclear" co-occurring before the given term A "reaction" in a specific textual unit, then Pr (B>A/A) ^ (estimated by) #<nuclear, reaction>/^reaction = 70/150. This indicates that in the entire collection, given current term A "reaction" appears within the sentences, paragraphs or documents, the probability that term B "nuclear" will co-occur before term A "reaction" in a specific predefined textual unit is 47%. •  Pr (A->B/B) ^  #<A, B>/#B: In the entire collection, given that current term B  appears within the sentences, paragraphs or documents, what is the probability that the other term A will co-occur B E F O R E term B in a specific predefined textual unit? For example, consider term A "reaction" and term B "nuclear": if in the entire collection, given 100 sentences, paragraphs or documents carrying current term B  -21 -  "nuclear", there are 30 sentences, paragraphs or documents having term A "reaction" co-occurring before the given term B "nuclear" in a specific textual unit, then Pr (A>B/B) ^ (estimated by) #<reaction, nuclear>/#nuclear = 30/100. This indicates that in the entire collection, given current term B "nuclear" appears within the sentences, paragraphs or documents, the probability that term A "reaction" will co-occur before term B "nuclear" in a specific predefined textual unit is 30%. •  Pr (B->AJB) ^  #<B, A>/#B: In the entire collection, given that current term B  appears within the sentences, paragraphs or documents, what is the probability that the other term A will co-occur A F T E R term B in a specific predefined textual unit?  In this study, we wanted to explore combinations of different distances, order and asymmetry between two terms. In order to make the research feasible and simple, we reduced the number of combinations to the first three estimates: #<A, B>/#A, #<B, A>/#A, and #<A, B>/#B. We believe these three cases are sufficient to address both issues of order and asymmetry. By definition the higher the estimated probabilities are, the more likely it is that two terms will appear together, which also means their affinity is more proximate. Moreover, the closer together they are, the more probable it is that they could belong to the same concept.  4.2  Five Textual Units  As discussed above, the predefined context scopes for our research were within a sentence, within a paragraph, and within a document, in regards to each of which different relations for indexing can be investigated. We also noted that Martin et al. (1983)  -22-  indicated restricting a window to at most five words accounts for 98% of word relations within a single sentence; therefore, in our research the cut-off level used within the sentence level was less than or equal to five terms intervening in the same sentence. In other words, we examined the neighborhood of each word w within a span of five words (5 words and +5 words around w). There were three representative cases within a sentence level that we were interested in investigating: two terms next to each other (no term existing between them - in this case these two terms should have the highest affinity); two terms with up to five terms between them, and two terms co-occurring in a sentence regardless of how many other terms exist between them. That is, we further divided the same sentence level into three textual units: zero token difference (SentenceOtd), up to five token difference (SentenceUpTo5td), and entire sentences with disregarding token differences (SentenceNoRestriction).  We probed the affinities of a total of five textual units, with different proximity between the terms in our experiment: SentenceOtd, SentenceUpTo5td, SentenceNoRestriction, ParagraphNoRestriction and DocumentNoRestriction. Using the affinity measure we developed, the followings explain how to calculate two terms' affinity values in each of the five textual units in the directional of A->B (A is before B) or B->A (A is after B) in our experiment: •  Within the Same Sentence  4.2.1  SentenceOtd Textual Unit; as long as any two tokens in the order of A->B / B->A were next to each other with zero token distance between (Otd) them in the same  -23 -  sentence, regardless of how many of these patterns were in that same sentence, we counted only 1 per sentence. 4.2.2  SentenceUpToStd Textual Unit: as long as any two tokens in the order of A->B / B->A were located in the same sentence with five tokens or less between them (UpTo5td), regardless of how many of these patterns were in the same sentence, we counted only 1 per sentence.  4.2.3  SentenceNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a same sentence, with any number of tokens between them (NoRestriction - with no token's restriction), regardless of how many of these patterns were in the same sentence, we counted only 1 per sentence.  •  Within the Same Paragraph  4.2.4  ParagraphNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a single paragraph but across different sentences with any number of sentences between them (NoRestriction - with no sentences restriction), regardless of how many of these patterns were in that same paragraph, we counted only 1 per paragraph.  •  Within the Same Document  4.2.5  DocumentNoRestriction Textual Unit: as long as any two tokens in the order of A->B / B->A were located in a single document, across different paragraphs with any number of paragraphs between them (NoRestriction - with no paragraphs restriction), regardless of how many of these patterns were in that same document, we counted only 1 per document.  -24-  4.3  Estimating Fifteen Schemes' Affinity Values  For each of the five textual units there were three affinities we wanted to estimate: #<A, B>/#A: #<B. A>/#A; and #<A, B>/#B, with values ranging from 0 to 1. The entire fifteen schemes' affinities can be estimated by the formula listed in Table 4.1. 15 Schemes  #<B,  #<A. B>/#A  A>/#A  #<A, B>/#B  SentenceOtd  # of sentences having B A F T E R A with Otd / # of sentences carrying A  # of sentences having B B E F O R E A with 0 td / # of sentences carrying A  # of sentences having A B E F O R E B with Otd / # of sentences carrying B  SentenceUDTo5td  # of sentences having B A F T E R A with UpTo5td / # of sentences carrying A  # of sentences having B B E F O R E A with UpTo5td / # of sentences carrying A  # of sentences having A B E F O R E B with UpTo5td / # of sentences carrying B  # of sentences having B A F T E R A with no token restriction / # of sentences carrying A  # of sentences having B B E O F R E A with no token restriction / # of sentences carrying A  # of sentences having A B E F O R E B with no token restriction / # of sentences carrying B  # of paragraphs having B A F T E R A but across different sentences with no sentence restriction / # of paragraphs carrying A  # of paragraphs having B B E F O R E A but across different sentences with no sentence restriction / # of paragraphs carrying A  # of paragraphs having A B E F O R E B but across' different sentences with no sentence restriction / # of paragraphs carrying B  # of documents having B A F T E R A but across different paragraphs with no paragraph restriction / # of documents carrying A  # of documents having B B E F O R E A but across different paragraphs with no paragraph restriction / # of documents carrying A  # of documents having A B E F O R E B but across different paragraphs with no paragraph restriction / # of documents carrying B  SentenceNoRestriction  ParagraDhNoRestriction  DocumentNoRestriction  Table 4.1: Fifteen Schemes' Affinity Estimations  5  EXPERIMENT  To answer our research questions, we conducted an experiment using a computerized program to automatically extract accounting concepts. The following sections describe our experiment process in detail.  -25 -  5.1  Domain Dependent Preprocessing - Tokenizing  Domain dependent tokenizing is a process of breaking a specific domain text into individual domain-specific meaning-carrying units. It includes extraction of tokens as well as eliminating non-domain related tokens. Our tokenizing techniques employed a similar automatic indexing process to the one used by Chen et al. (1992, 1994, 1997, 1998). We performed the steps listed below in sequence. In the end, a specific token list and tokenized texts were generated so as to be ready for use in the experiment's later stages. 5.1.1  File Extraction  Like Garnsey (2002), in our research we selected the accounting literature from the F A R S 4.2 (Financial Accounting Research System) database, which is current through to June 15, 2002. Our entire research collection was comprised of 835 documents, which were automatically extracted from the following categories in the FARS database: Committee on Accounting Procedure Accounting Research Bulletins (ARB) Accounting Principles Board Opinions (APB) A I C P A Accounting Interpretations (AIN) F A S B Statements of Financial Accounting (FAS) F A S B Interpretations (FIN) F A S B Technical Bulletins (FTB) FASB Statements of Financial Accounting Concepts (CON) F A S B Emerging Issues Task Force Abstracts (EITF) - the sections "Introduction" and "Full Text of EITF Abstracts (including Appendix D)".  -26-  5.1.2  Reformatting the Document Structure  The program first processed the original document collection in three steps: (1) The topics of each file in the original database (for example, A R B 49: Earnings per Share) were connected with their associated filenames (for example, fars-0008.txt) (see Appendix A - Section 5.1.2). (2) The periods (dots) that did not designate punctuation to end sentences were removed, such as "Ch.3A" (see Appendix A - Section 5.1.2). (3) A l l words were converted into lowercase (see Appendix A - Section 5.1.2). 5.1.3  Changing the Short-Forms of the Words in the Abbreviation List into TermPhrases  In this step, we transformed all abbreviations into the term-phrase format by creating an abbreviation-controlled list (see Appendix A - Section 5.1.3). 5.1.4  Converting Meaningful Words to Term-Phrases  For the specific domain thesaurus, Salton (1989) as well as Chen et al. (1992, 1994, 1995, 1997, 1998) recommended and implemented the technique of forming term-phrases by combining adjacent words. Garnsey (2002) also identified additional accounting phrases and added them into the accounting indexes. Without these phrases combining from multiple words, individual words and tokens cannot convey accurate meanings, in accounting applications. Furthermore, the majority of the database (82.5%) listed in the "Topical Index" (titles of each document) section of FARS consists of term-phrases, while only 17.5% of them are single terms; we believe the term-phrases carry more accurate meanings than the single terms. We also used other two external sources, Accounting Dictionary - Accounting Glossary - Accounting Terms  -27-  (http://yvavw.venturelinexom/glossarv.asp) and Financial Accounting: An Introduction to Concepts, Methods, and Uses (8th Edition), creating a comprehensive financial accounting phrase-controlled list which included 1,774 phrases (see Appendix A - Section 5.1.4, List 2) containing between and nine single words in each phrase. Because words may appear in the text either in singular or plural forms, in order to incorporate all the cases we therefore included both singular and plural patterns for most phrases when needed. The program scanned the collection and then automatically connected these single words into their associated term-phrases. For instance, if the program detected single words appearing consecutively, such as "stock option and stock purchase plan," it then would link them automatically into one meaningful token, "stock_option_and_stock_purchase_plan." Similarly, the program would link the plural form, "stock option and stock purchase plans," from the text into one token, "stock_option_and_stock_purchase_plans." In our prior step we had already connected abbreviations into a single term-phrase format (for example, the abbreviation " A C R S " had been converted into "accelerated_cost_recovery_system"), however sometimes the text carried individual separate words of the abbreviations instead of their abbreviated forms. In the case of this particular situation and in order to cover all possibilities, our phrase-controlled list also included the probable singular and plural forms of these words, for example including both "accelerated cost recovery system" and "accelerated cost recovery systems." Then the program checked the text and converted the terms into "accelerated_cost_recovery_system" and "accelerated_cost_recovery_systems" respectively i f they found matching terms. (In the remainder of this paper, we use "terms" or "tokens" to mean either "single word terms" or "term-phrases.") Therefore, in this step,  -28-  the program connected adjacent words as single phrases and the tokenized files were then comprised of phrases and other single terms. 5.1.5  Removing Stop-Words  As noted above, Chen et al. (1997) generated a biology domain-specific stop-word list by removing non-significant terms in that domain. Similarly, to arrive at practical index terms in our research, we defined an accounting domain-specific list of stop-words by adding 184 additional stop-words that are irrelevant to the accounting domain into the existing S M A R T stop-words list, which already contained 5 4 4 stop-words. For instance, some adverbs that have little effect on accounting (such as "approximately" and "eventually") as well as some nouns that are too general to contribute to the accounting domain (such as "background", "conclusion", and "football") were added. Then we obtained a new comprehensive list of 728 stop-words (see Appendix A - Section 5.1.5, List 3), which was more tailored to the accounting domain. Consequently, the program simply used our domain specific stop-words list to scan the tokenized files exported from the last step carrying phrases, as well as many other single terms, to remove those significantly general and irrelevant words. 5.1.6  Producing a Full Token List  After the stop-words had been eliminated from the text, the program extracted the remaining unique tokens from the collection while incrementing the token's frequency each time it encountered the given token. Finally, the program produced 14,437 tokens with information on the number of frequencies as well as on the number of documents appearing in the text collection.  -29-  5.1.7  Removing Unwanted Tokens  The 14,437 tokens were then processed as follows. We sorted them first by the tokens' frequencies in the entire collection, second by the number of documents in which a given token appeared, and finally by alphabetical order. Following similar procedures as those presented by Garnsey (2002), where non-semantic words, adjectives, adverbs, verbs and non-accounting terms were eliminated, we manually checked the first 10,000 tokens to further eliminate the non-accounting related tokens. As well, very low frequency words were directly eliminated in both Garnsey (2002) and Chen et al. (1998). Consequently, the 10,001th to 14,437th tokens in our list, due to their extremely low frequencies in the entire collection (each appeared only once), were treated as unwanted tokens and discarded immediately. The resulting vocabulary then included 2,052 wanted tokens (see Appendix A - Section 5.1.7, List 4). The program scanned the documents and only kept these wanted tokens in the text. 5.1.8  Consolidating Wanted Tokens  As discussed above in the literature review section, both Chen et al. (1995) and Garnsey (2002) removed the stemming process to avoid creating noise and ungrammatical phrases; because in specific domains, stemming can cause looses of information. Similarly, standard stemming processes were not performed in our research, because different stems sometimes refer to different concepts in the accounting domain. For example, "taxable" is different from "tax," inasmuch as "taxable income" means "the amount of income subject to income taxes" whereas "tax" means "fee charged (levied) by a government." Hence, in this step, we combined some terms into their common formats only when necessary. We manually checked the 2,052 wanted tokens and created a list consolidating 994 tokens (see  -30-  Appendix A - Section 5.1.8, List 5) by converting every plural token to its singular form (for example, "yields" was converted to "yield") and by combining past and present tenses of a number of tokens into their most representative roots (such as "taxed" and "taxing" both converted to "tax"). On the other hand, we did not unite some tokens when we believed that the different forms conveyed varied meanings, such as "account," "accountant," and "accounting." Thus, the program processed the exported files from the prior step that contained only wanted tokens to further consolidate some tokens according to the list. 5.1.9  G e n e r a t i n g the F i n a l R e d u c e d T o k e n L i s t  The program rescanned the document to obtain the new number of token frequencies and the number of documents, since the content of the files was changed after consolidation. Eventually we got a list of the final 1,344 unique tokens from the system (see Appendix A - Section 5.1.9, List 6), which were ordered by the tokens' descending frequencies within the entire collection. The resulting text contained 835 tokenized files.  We then used this text comprising 1,344 different domain-specific meaning-bearing units to represent the entire collection, with the tokens' original relative distances maintained. This domain-dependent text was then processed further, as will be explained in the following sections, to extract accounting concepts at different textual levels.  In summary, besides the generic process for automatic indexing, in the preprocessing stages of generating domain-dependent indexing, we introduced significant domain knowledge, such as by creating the abbreviation list, the term-phrases controlled list and  -31 -  additional domain-specific stop-words, by manually picking out the wanted tokens, and by consolidating the closely related terms instead of regular stemming. In the end, we obtained exported tokenized files with 1,344 kinds of tokens located in their original positions.  5.2  Computing Term Affinities  In this section, we will describe how to compute the fifteen schemes' probabilistic cooccurrences as their affinity values based on our discussion in the Approach section above. 5.2.1  Removing Unwanted Punctuations  Before we scanned the collection to compute affinity values, we needed to further process the 835 tokenized files which were produced in the last step and which still contained all of their initial punctuation. We only wanted, however, to keep the three punctuation marks that represent the end of a sentence and a paragraph:  "!" and "?" These three  punctuations are important indications for calculating the number o f sentences that include each token at the sentence level. Therefore, in this step, the program got rid of all punctuation except for 5.2.2  "!" and "?".  Converting 1,344 Tokens to Token IDs  To facilitate further processing, the program assigned each of the final 1,344 tokens with a unique token ID in the same order of tokenizing stage (see Appendix B - Section 5.2.2, List 7). Then we replaced all tokens in the tokenized files with their matching token IDs. The text now only contained tokens represented by these 1,344 different IDs and was then ready for calculating token affinities.  -32-  5.2.3  600 T o k e n s ' Affinity  Values  5.2.3.1 G e n e r a t i n g the 6 0 0 T o k e n L i s t f o r C l u s t e r i n g  We believed that the tokenized files from the last stage containing 1,344 various tokens were significant and were enough to represent the content of the original collection. It was sufficient that we studied any two tokens' affinity in a structural context, only comprised of these 1,344 different tokens. Utilizing a similar method, Gangolly and Wu (2000) restricted the index terms to 93 out of their previously acquired 983 terms because of the limitations of data visualization in a large index. To ensure visualization of the data, Garnsey (2001) also generated a final matrix of only 118 terms from their initial 676 terms. In order to concentrate our preliminary study on examining the relations of only those terms we had interest in and also for purposes of visualization, from the 1,344 tokens we manually chose only 600 - the most common and representative terms in accounting according to our knowledge - to be the research subject terms that would be investigated in the resulting clusters. When doing so, we were careful to retain words that might not have an accounting meaning in themselves (for example, "call"), but which might appear in accounting terms (for example, in "call options"). These 600 token subjects would then be used in the rest of this research to compute their affinities and then to perform clustering analysis.  Although the remaining 744 tokens' affinities were not studied to form clusters, the 600 selected tokens still kept their original locations in the 1,344 token structure - the context including all 1,344 tokens did not change. In other words, in order to trace any two tokens' relative distances in the context of 1,344 tokens, the program still kept the 1,344 token IDs  -33 -  for marking the tokens' positions in the tokenized files. The program then assigned another set of IDs for these 600 final token subjects (see Appendix B - Section 5.2.3.1, List 8), which were different from the 1,344 token structure IDs. The new set of 600 token IDs were ordered first by tokens' descending frequencies in the whole collection, second by the descending number of documents the tokens appear in, and last by ascending alphabetical letter. We can see that since these 600 tokens were chosen from the 1,344 tokens, the tokens' frequencies and the numbers of documents in the 600 token list were the same as those in 1,344 token list, with only the token IDs being different. 5.2.3.2 S u m m a r y o f t h e Steps to O b t a i n the 6 0 0 T o k e n s  Table 5.1 summarizes the progress thus far in our experiment to obtain the resulting 600 tokens. Step  Description  Method  # Tokens in Results  1  Initial Preprocsssing  Computer  N/A  2  Forming Term-phrases: see Section 5.1.4; Removing Stop-words: see Section 5.1.5  Manual (/, 774 controlled-phrases: "accounts payable " -> "accounts_payable "; 728 domainspecific stop-words: "approximately", football") + Computer  14,437 tokens  3  Manual (manually checked the first 10,000 tokens to 2,052 wanted Removing Unwanted tokens Tokens: see Section 5.1.7 further eliminate non-accounting related tokens, discarded thel0,001th to 14,437th tokens that appeared only once) + Computer  4  Notes: No Stemming because different stems refer to 1,344 final Consolidating Wanted tokens Tokens: see Section 5.1.8 different concepts Manual (consolidated 994 tokens by: converting plurals to singulars: "yields" -> "yield"; combining past and present tenses to most representative roots: "taxed" and "taxing" -> "tax. This step kept other different forms with varied meanings: "account", "accountant", and "accounting") + Computer  5  Generating 600 Tokens: see Section 5.2.3.1  600 subject Manual (chose only 600 - the most representing terms tokens in accounting according to our knowledge to be the subject terms to form clusters. This step retained words which had no accounting meaning in themselves: "call", but might appear in accounting terms:" call options ") + Computer  Table 5.1: Summary of the Steps to Obtain the 600 Tokens  -34-  5.2.3.3 The Token IDs of Any Two Terms  To study affinities between any two terms A and B, in this study we always treated the token with the smaller number ID as token A , and the other with a larger ID number as token B. For example, consider two ID numbers "4" and "5." We treated "4" as token A , and "5" as token B. So #<A, B>/#A should be #<4, 5>/#4, #<B, A>/#A should be #<5, 4>/#4, and #<A, B>/#B should be #<4, 5>/#5. Since the program assigned IDs to tokens first by their descending frequencies in the whole collection, second by descending number of documents that the tokens appeared in, and last by ascending alphabetical order, the "#Frequencies" or "#Documents" (if the first column, "#Frequencies," was used constantly) of the smaller ID, token A , would thus be higher than those of the larger ID, token B (see an example in Table 5.2). ID  TOKEN  #FREQUENCIES  #DOCUMENTS  1  asset  15878  510  2  accounting  13950  797  3  cost  10548  434  4  loss  6993  389 261  5  6739 tax Table 5.2: Example o f Comparing T w o Token IDs  5.2.3.4 A n Example of a Dummy File to Illustrate Affinity Calculations  Before we implemented the program to calculate the real affinity values, we created three dummy files using five tokens, ranging from number 1 to number 5, as examples for the three levels: within a sentence, within a paragraph and within a document. Note that the numbers 1 to 5 in the dummy files only represented the token's name and that we randomly located them in the texts; therefore, these numbers (1 through 5) were not assigned in any order. Thus, for example, token l ' s frequency may or may not be greater than the other tokens' frequencies in the dummy files. In each dummy file, we manually  -35 -  computed statistical affinity values and then compared each figure with outputs from the program so that we could make sure we got the correct results from the program when it processed the larger volume of real documents. Below is the dummy file for the ParagraphNoRestriction level to illustrate how we calculated the three probabilistic affinity patterns for any two tokens that were our research subjects: <Document 1>: 1 24 5 3 1. 3 1 42! 4 1.2 4. . 5 2 1 1. 3 2 5. ! 234 3. .7 13124 3. 2 5 4 3 4 3 2 4! 3 1 5 <Document 2>: 2 4 1.3 2 3 4 5! 1 32. ? 43 5 1? ! 2 3. 2. <Document 3>: 42 14 1.2 5 3. ?4. 1 2 5! 3? 3 1 ! 4 2! . ? 3 1 .? <Document 4>: 3. ! . 1 4. ?23 1.  .44  We only studied 600 tokens' affinities our of the 1,344 various token context; in this preliminary test, we only selected only three tokens (tokens "1," "3" and "5") as our subjects out of the total five token context. In other words, these three tokens were still in a fixed context comprised of a total of five tokens, so their co-occurrences (see Table 5.3) and statistical affinity values (see Table 5.4) would not change: Two Tokens Whole paragraph Two Tokens Whole paragraph Two Tokens Whole paragraph  l->2  l->3  l->4  l->5  3->4  3->5 2  4  7  3->l  3->2  4  5->l 2  5->2  5->3  5->4  4 Table 5.3: Dummy File - Two Tokens' Co-occurrences in ParagraphNoRestriction  -36-  Token 1 and Token 3:  Probabilistic Affinity Pattern Probabilistic Affinity Value  #<A, B>/#A  #<B, A>/#A  #<A, B>/#B  l->3/#l  3->l/#l  l->3/#3  0.7 (7/10)  0.4 (4/10)  0.7 (7/10)  l->5/#l  5->l/#l  l->5#5  0.4 (4/10)  0.2 (2/10)  0.5714286 (4/7)  3->5/#3  5->3/#3  3->5/#5  Token 1 and Token 5:  Probabilistic Affinity Pattern Probabilistic Affinity Value Token 3 and Token 5:  Probabilistic Affinity Pattern Probabilistic Affinity Value  0.4 0.2857143 0.2 (2/7) (4/10) (2/10) Table 5.4: Dummy File - Two Tokens' Statistical Affinity Values in ParagraphNoRestriction  ParagraphNoRestriction #<A, B>/#A: # of paragraphs having B A F T E R A but across different sentences with no sentence restriction / # of paragraphs carrying A #<B. A>/#A: # of paragraphs having B B E F O R E A but across different sentences with no sentence restriction / # of paragraphs carrying A #<A,  B>/#B:  # of paragraphs having A B E F O R E B but across different sentences with no sentence  restriction / # of paragraphs carrying B  5.2.3.5 The Program's Affinity Calculation of Fifteen Schemes  As illustrated in Table 4.1, because the sentence level analysis required information about the number of sentences including each token, while the ParagraphNoRestriction condition also required the number of paragraphs that each token appeared in, the program counted them and added these numbers with two more columns added to the 600 tokens list. Then the program processed all of the tokenized files, and, for each of the fifteen schemes, the program computed each pair of terms' statistical affinity values for all 600 token subjects.  -37-  5.3  Clustering 600 Tokens  Grefenstette (1993) declared that a domain specific thesaurus presents "a hierarchical view of the important concepts in the domain, as well as suggesting alternative words and phrases that may be used to describe the same concept in the domain." The clustering technique we adopted in this study was the hierarchical clustering produced by Matlab, which links together pairs of terms that are in close proximity. The "Tutorial - Cluster Analysis" section of Statistics Toolbox for Use with Matlab User's Guide (Version 3, 1-54) (http://www, busim. ee. boun. edu. tr/-resources/statsJb.pdf) defines and explains the Matlab linkage function, which uses "the distance information to determine the proximity of objects to each other. As objects are paired into binary clusters, the newly formed clusters are grouped into larger clusters until a hierarchical tree is formed." We then used the "linkage function" in Matlab to form our clusters. 5.3.1  Hierarchical Clustering Using Matlab  As noted above, the linkage function of Matlab identifies the distance values, and links pairs of objects that are close together into binary clusters. It then creates larger clusters by linking those newly formed clusters to other clusters until all the objects are linked together in a hierarchical tree (for example, see Appendix B - Section 5.3.1). 5.3.2  Affinity Values Fit into Matlab  Matlab first groups the objects that have the closest proximity to each other. In other words, the first two terms that are linked together by Matlab always have the smallest relative distance from each other in the whole data set. As discussed above, we assume that the higher the statistical affinity value is, the closer the terms' affinities are and the  -38-  smaller the distance between them should be. The most closely related terms (with the highest affinity values) should be grouped first. Hence, in order to fit the affinity values into the Matlab linkage function which links minimum distance first, we computed the distance value of two terms using the formula (1 - Statistical Affinity Value). 5.3.3  Clustering Outputs  The program processed the entire collection and used the Matlab linkage function to cluster the 600 tokens for each of the fifteen schemes. Table 5.5 shows the top ten clustering data outputs, which had been replaced by token names in the scheme of SentenceOtdAB_A (see Appendix C for the top 50 clustering outputs for each of the fifteen schemes): Cluster Index  Object in First Group  Object in Second Group  Distance  601  261 ;stock_dividend  346;stock_split  0.62791  602  392;accounts_payable  429;accrued_expense  0.74286  603  102;short  139;duration  0.8034  604  114;taxable  122;deductible  0.80942  605  41;derivative  46;financial_instrument  0.82476  606  601; stock_dividend;  269;split  0.84884  607  10;future  27;cash_flow  0.85654  608  5 5 3 ;unco 1 lectib le_account  595;unearned_discount  0.85714  609  170;direct_cost  209;direct_financing_lease  0.8599  610  38;receivable  70;payable  0.87065  stock_split  Table 5.5: Top Ten Clustering Data of SentenceOtdABA  As noted above, Matlab grouped together the first set of related terms with the least distance between them to form the first cluster. Because there were a total of 600 terms in the data set, Matlab assigned a cluster index number starting with 601 (600+1) to this new cluster. Note the example shown in Table 5.4: Sentence0tdAB_A first grouped two terms with token IDs of 261 and 346 and used the number 601 to represent this newly formed cluster. We can also see that token 261 and token 346 in cluster 601 had the closet  -39-  proximity (distance value = 0.62791) in the entire 600 token data set of SentenceOtdAB_A., Matlab continuously formed other related terms into new clusters in ascending order of their distance value. Let us examine cluster 606, which grouped clusters 601 and 269. As Matlab guide illustrated, here token 269 formed a larger cluster with cluster 601, which had already linked tokens 261 and 346. That is, in this step, these three terms of tokens 261, 346 and 269 were grouped together into cluster 606.  5.4  Evaluation  To justify our methodology and to assess the clustering performances of all fifteen schemes, we invited an accounting expert to evaluate the clustering outputs. The expert used her domain knowledge of the accounting industry and an external online website (http://www, investorwords. com) which listed thousands of definitions of current authoritative accounting terms that the expert could consult to evaluate automatic concepts from all fifteen schemes. In addition, in order to borrow another independent accounting thesaurus to further check the accounting concepts produced by us, we also compared the expert's evaluation results with the Price Waterhouse Thesaurus (1974). 5.4.1  C l u s t e r i n g T e r m s to B e E v a l u a t e d  We know that the distance between objects in the cluster becomes greater as the newly formed cluster number gets larger. That is, in the clustering data tables, only the top clusters in each scheme can group closely related accounting concepts with small distances between them to each other. Regarding the special scheme DocumentNoRestrictionAB_B, only the top 25 clusters needed to be evaluated. Because in this scheme each larger cluster was formed by continuously adding one more term to  -40-  the previous newly-formed clusters, the process was different from the clustering outputs used in the other fourteen schemes. Furthermore, the hierarchical tree grew unmanageably high after only a few steps due to noise, namely distances affected by random occurrences; therefore, we believe the top 25 clusters would be good enough to represent this scheme. For the remaining 14 schemes, only the first top 50 clusters in each scheme needed to be evaluated. In total, there were 725 clusters of fifteen schemes to be evaluated. Let us examine the example of the first top three clusters to be evaluated for the scheme SentenecUpTo5tdAB_A (Table 5.6), where the object number in each cluster had been automatically replaced with term names by the program.  Cluster No.  Terms in the Cluster Separated by ";" Total # of terms Terms in the first group  Terms in the second group  1 (601}  stockdividend  stocksplit (346)  2  2 mi}  accountspayable (392)  accruedexpense (429)  2  3  mm  (26])  stockdividend; stocksplit (601)  split (269)  3  Table 5.6: Sample Evaluation Terms - Top Three Clusters of SentenecUpTo5tdAB_A Note: The number, for example, (601). means the original cluster index; the number beside the term, such as (261) means the original token ID.  5.4.2  Expert's Evaluation  Since this was only a preliminary study, in this stage we invited only one accounting expert to evaluate the clustering outputs. Furthermore, the final results of this research were based not only on the expert's evaluation, but also on the analysis of the clustering data itself and the Price Waterhouse Thesaurus (1974). Therefore, a single expert was adquete to evaluate the outputs of the clusters produced by each of the fifteen schemes in  -41 -  order to see what portion of every 25 or 50 clusters each scheme automatically groups together relevant terms so that meaningful accounting concepts can be derived. If identical clusters appeared in the results of two different schemes, the expert could cut and paste the evaluations. B y comparing the statistical evaluation results using Excel worksheets, we can summarize the similarities and differences among the fifteen schemes, and be able to recognize the optimal schemes. 5.4.2.1 I n s t r u c t i o n s f o r E x p e r t ' s E v a l u a t i o n  The expert was provided with detailed instructions (see Appendix D - Section 5.4.2.1) to select the relation types among the terms in each cluster, to provide explanations about her choices, and to score the relevance ranging from 1 to 5 by how closely related the terms in each cluster were. As we introduced earlier in the Literature Review, that thesauri normally contain semantic relationship as Narrow Term (NT), Broad Term (BT), and Related Term (RT). By further classifying more potential relationship types in accounting, we created the following relation type alternatives that the expert could choose from. We also gave her space to explain her choices (Table 5.7):  -42-  11. Explanation  I. Relation Type Alternatives  No Distinct, a new direct Broader Partial j Other term Subgroup concept relation ! relation relation Syn Ant d) e) c) b) 0 g) !  Table 5.7: Sample Evaluation Relation Type Alternatives  With relation types a) "Synonyms" (same meaning); b) "Antonyms" (opposite meaning); c) "Broader term" (any term's meaning broader than all the others); d) "Subgroup" (if the terms are all related, are they both/all subgroups of another broader concept?); e) "Distinct but forming a new concept" (if they are all distinct, can they together form a new concept?); f) "Partial relation" (if there are more than two terms in the cluster, are only some of them related?); g) "Other relation" (if relation type is not listed previously); h) "No direct relation" (none of the terms is directly related).  5.4.2.2 Sample Evaluation Results  The expert then assessed each cluster of every scheme in order of cluster number (from cluster 1 to cluster 25 or to cluster 50) to choose relation type alternatives, provide explanations, and give the relevance score on the terms' relations grouped by that cluster. Table 5.8 shows an example of the expert's evaluation results regarding the top three clusters from the scheme SentenecUpTo5tdAB_A.  -43 -  Terms in the Cluster Separated tt.it  by  Relation  Total #of terms  Cluster No. Terms in the first Terms in the second group group 1  stockdividend  2  stocksplit  Type  II. Explanation  III. Relevanc e Score (1-5)  a)~h) e)  The new concept is "stock". "stock_dividend" is distribution of retained earnings while  5  "stock_split" is increasing the number of shares outstanding  2  accountspayable  3  stockdividend; stocksplit  accruedexpense  split  2  3  d)  f)  Broader concept is "liability" because "accounts_payabl e" belongs to "liability". "accrued_expense " is also "accruedjiability " which is also a type of "liability"  5  "split" is her "stock_split" in the accounting sense and, which means the increasing shares of outstanding, "split" is not related to "stock_dividend" because "stockdividend" is the distribution of retained earnings to shareholders  3  Table 5.8: Sample Expert's Evaluation Results - Top Three Clusters of SentenceUpTo5tdAB_A With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) "No direct relation".  -44-  5.4.2.3 Statistical Analysis of Expert's Evaluation Results for A l l Fifteen Schemes  After the expert finished her evaluation, for each relation type column in every scheme we totalled the number of " l s " in all 25 or 50 clusters (how many clusters the expert believed had this relation type). The statistical percentage of each relation type was acquired by using the totals divided by the number of clusters (25 or 50) that had been evaluated. For example, in a specific scheme, the expert chose two clusters having relation type "a" in all 50 clusters; thus, the statistical percentage figure was 4% for type "a". Meanwhile, we totalled all of the relevance scores divided by the number of clusters (25 or 50) that had been evaluated to obtain the mean relevance score for each scheme. The expert then evaluated every scheme individually for all fifteen schemes and filled out the answers in the evaluation table. Then we calculated statistics for each of the fifteen schemes (see Appendix D - Section 5.4.2.3). The statistical analysis of the expert's evaluation data for all fifteen schemes is presented in Table 5.9.  -45-  I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h)  (%)  AH 15 Schemes  a)  b)  O  EvaluatcSentenccOtclAB A  0%  4%  46%  EvaluateSentenceOtdBA  A  0%  :  KvaluateSentenceOtdAB_B  0%  >  EvaluhicSetttencelJpTo5hlBA  4%  4  1  g)  42% '  6%  2% .  42%  2\,  0%  44%  28%  i  -36%  42%  2%  2 % ' ' 34°o  54%'  2%  - EvaluateSentenceUpTo5ldAB A ; „ -.  e)  d)  0%  III. Mean Score (1-5)  h)  iiflm  4.58  2%  4.40 •  0%  :  4.06  8% ,'  0%  2%  2%  2%,  2%  ()•'<•  0%  20%  ' 0% .  -4.46 4.52  I'.valuutcScntenccUp l Y o t d A B _ B  o%  ()%•  4<>%  12%  4%  36%  EvaluateSentenceNoRestriction A B A  0%  10%  36%  44%  2%  6%  EvaluateSentenceNoRestriction  4%  2%  3ir>,  5VX.  •2%  •2%  '2%  EvaiualeSentenccNoRcstrictionAB_B  ()'\,  ()"..  18°..  8".,  6%  -14" o  ()".,  4%  3.94  EvaluateParagraphNoRestrictionAB_A  4%  4%  50%  34%  4%  2%  0%  2%  4.22  EvaluatePdragraphNoRestriction  4%  4%  42%  42%  2%  2%  0%  4%  4.30  EvaluateParagraphNoRestrictionAB_B  0%  0%  40%  14%  0%  46%  0%  0%  3.58  EvaluateDocumentNoRostrictioiiAB A  0%  4%  38%  22%  0%  26" o  0%  10%  3 60  EvaluateDocumentNoKeslriciii'iiB \  \  2%  0°-o  58%  18%  0%  12%  0%  10°n  3.82  l.vdluateDocumentNoRestiictionAB  B  0%  0%  20%  0%  4%  76"-..  0%  0%  3 08  BA A  BA_A  0% '  •I.0J 2%  4.52  Table 5.9: Statistical A n a l y s i s o f the Expert's Evaluation Results - A l l Fifteen Schemes W i t h relation types a) "Synonyms"; b) " A n t o n y m s " ; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) " N o direct relation".  5.4.3  Comparing the Expert's Evaluation Results With Price Waterhouse  Accounting Thesaurus The only existing accounting thesaurus we have seen so far is the one created by Price Waterhouse & Co. in 1974, which Hill, the National Director of Accounting and Auditing Services, called "the first comprehensive Thesaurus of accounting and auditing terminology." This thesaurus includes 3,741 main term entries and uses the unabbreviated  -46-  4.46  form of a term rather than its acronym or abbreviation. It disposes of synonymous or closely related terms using the following abbreviations: U = Use  (as cross reference)  UF = Used For  (as cross reference)  BT = Broader Term NT = Narrower Term RT = Related Term  Here is an example of terms classified into the same concept shown in their hierarchical positions: Accounts payable UF Advances payable BT Trade accounts payable RT Assumed liabilities Confirmation Creditors Cutoff tests Freight bill payment services Search for unrecorded liabilities Voucher system  5.4.3.1 Sample Comparison After the expert finished her evaluations, we checked all 725 clusters against the Price Waterhouse Thesaurus to count those similar relations assessed by these two methods. For every scheme, we compared the relations of each cluster or concept evaluated by the expert with the corresponding concept listed in the Price Waterhouse Thesaurus. If there were some matching relations in the Thesaurus, we then marked number of "Is" for those matching concepts in order to calculate the matches for that scheme. Meanwhile, for  -47-  purposes of clarity we also explained how the Price Waterhouse Thesaurus interpreted similar relations of the concept. Table 5.10 shows the comparison example of the top three clusters from the scheme SentenecUpTo5tdAB_A: Comparison with the W a t e r h o u s e Thesaurus  T e r m s i n the C l u s t e r S e p a r a t e d by " ; "  Cluster No.  Total #of terms  Price  Relation Type a)~h)  Put"1" here i f y o u can find a match, otherwise do n o t h i n g  If m a t c h i n g , then identify the relationship d e s c r i b e d in the Price Waterhouse Thesaurus they are Related Terms  T e r m s i n the first group  T e r m s i n the second g r o u p  1  stock_dividend  stocksplit  2  e)  1  2  accountspayable  accruedexpense  2  d)  I--.:  "accrued expenses" Uses "accrued liability'' and "liability" is the Broader Term of " accounts_payable" and "accrued_liability"  3  stockdividend; stocksplit  3  split  0  Table 5.10: Sample Comparison of Expert's Evaluation Results with the Price Waterhouse Thesaurus - Top Three Clusters of SentenecUpTo5tdAB_A With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) "No direct relation".  5.4.3.2 Statistical Comparisons of All Fifteen Schemes  For each scheme we compared all relations evaluated by the expert with the Price Waterhouse Thesaurus: again, we obtained the statistical matching percentage values (total number of matching clusters, compared against either 25 or 50 clusters in that scheme). See Table 5.11 for statistical comparison results for all fifteen schemes.  -48-  % Clusters Containing Relations Matching with Relations in the Price Waterhouse Thesaurus  AB B  AB A  BA A  EvaluateSentenceOtd  36%  36%  ':  20%  EvalualeSentenceUpToStd  36%  38%  *  14%  EvaluateSentenceNoRestriction  42%  38%  ;CI6%.».  EvaluateParagraphNoRestriction  20%  40%  14".,  *'!  1 (>•'.. 32% . • 28% EvaluateDocumentNoRestriction Table 5.11: Comparing Fifteen Schemes of the Expert's Evaluation Results with the Price Waterhouse Thesaurus  We are aware that accounting terminology has been changed and updated in the 30 years since the Price Waterhouse Thesaurus was published. Although the Thesaurus and our study adopted very different sources of documents to construct the vocabulary collection, from the above data we still can find that the matching between these two methods is visible (between 14% and 42% accuracy). Therefore, we can see that our technique actually identified some valid accounting concepts that can be verified by the Price Waterhouse Thesaurus, which as noted, remains an authoritative accounting thesaurus.  5.5  Discussion  By analyzing the clustering data itself, incorporating the statistical analysis of the expert's evaluation results and comparing them with the Price Waterhouse Thesaurus for all fifteen schemes, some interesting findings as well as the answers to our previously proposed research questions emerge:  -49-  5.5.1  Can Relevant Concepts be Extracted Automatically from a Set of Documents in a Given Domain? ( Q l . l ) What Type of Semantic Relations Can Be Identified in the Extracted Concepts? (Q1.2)  From the statistical analysis of the expert's evaluation results (Table 5.9), we noticed the maximum percentage of clusters conveying no direct related concepts was only 10%, which was at the EvaluateDocumentNoRestriction level. The maximal percentage of unrelated terms in a cluster in any other scheme was only 4% (see Table 5.12). Statistical % of "No Direct Relation" Relation Type "h)", all 15 Schemes  AB A  BA A  AB B  EvaluateSentenceOtd  h) 0%  h) 2% ..  11)2",.  EvalualeSeiitencellpToStd  h) 2%  h)0%  hi 2'!-,,  EvaluateSentenceNoRestriction  h) 2%  h) 0%  h)4%  Ii) 2%  h) 4%  h) ()"..,  E v a 1 u a tePa r aar a p h N o Res t r i ct i o n  .  hi ()".. h) 10% h) 10% EvaluateDocumentNoRestriction Table 5.12: Statistical Results of "No direct relation" Type for A l l Fifteen Schemes Using relation type h) "No direct relation".  Moreover, ten out of the total fifteen schemes had mean relevance scores greater than 4, indicating that two-thirds of the schemes grouped at least "somewhat related" concepts. The remaining five schemes' mean relevance scores were greater than 3 but less than 4 (see Table 5.13). This indicated neither of the remaining third of the schemes grouped "unrelated" clusters (for details regarding how the ranking was defined, see Appendix D Section 5.4.2.1).  -50-  Average Relevance Score in Each Scheme  AB A  BA A  \ B It  EvaluateSentenceOtd  4.58  4.40  *(«)  EvaluateSentencelipToStd  4.46  4.52  •'.AY--  EvaluateSentenceNoRestriction  4.46  4.52 4.30  EvaluateParaeraphNoRestriction  3 58  3.82 3 60 3.08 EvaluateDocumentNoRestriction Table 5.13: Statistical Results of Average Relevance Scores for All Fifteen Schemes  The above analysis proved that all fifteen schemes extracted compound concepts containing related accounting terms, which had already been confirmed by comparing our method with the Price Waterhouse Thesaurus. We hence can declare: our methodology is useful for identifying accounting compound concepts with meaningful relations. As well, we answered Q 1.1.  We further analyzed what were the top two most frequently chosen relation types included in the clusters and extracted these figures in Table 5.14. We can see the largest portion of relation type in every scheme was "Broader term" (one term is broader term than the other), coinciding with the "Narrow Term (NT)" and "Broad Term (BT)" relationships defined by prevailing thesauri discussed in the literature. Another of the most frequently chosen relation types was "Subgroup" (these are different from "Broader term," but they are subgroups of other broader concepts), which also coincided with the "Related Terms (RT)" relationship identified in most existing thesauri. Because our hierarchical clustering continually added new terms into the previously formed clusters, only some of the terms were related (relation type "Partial relation") when many terms were included in the cluster. This answered Q1.2.  -51 -  The Top Two Most Frequently Chosen Relation Types in Each Scheme  A B A -../-..  EvaluateSentenceOtd  c) 46%: d) 42%  EvaluateSentenceUpToStd  c) 36%; d) 42%  EvaluateSentenceNoRestriction  c) 36%: d) 44%  EvaluateParagraphNoRestrictinn  BAA  AB B  c)-52%; d) 42%  c) 1 l%:dP.8%  c)34%;d)54%  c)4d%: i") 3(>%  c)'30%; d)58%  O ~8%: l'i44%  c) 50%; d) 34% , , c) 42%; d) 42%  ci 10'... 1") 46' »  ; =  ±  c)38%;f)26% c)58%; d) 18% c) 20%Yt)'76% EvaluateDocumentNoRestriction Table 5.14: Statistical Results of the Top Two Most Frequently Chosen Relation Types in All Fifteen Schemes Relation types c) "Broader term"; d) "Subgroup"; and f) "Partial relation" 5.5.2  W h a t P a r a m e t e r s C a n Affect the Q u a l i t y o f the R e s u l t s ? ( Q 2 )  5.5.2.1 P r o x i m i t y - S e n t e n c e , P a r a g r a p h A n d D o c u m e n t L e v e l s ( Q 2 . 1 )  By examining differences in the statistical expert's evaluation results at three levels within a sentence, within a paragraph, and within a document - we detected that all three schemes belonging within the sentence level extracted the most related concepts, followed by the paragraph level, with the document level being the least effective.  Our results revealed the statistical average relevance scores as assessed by the expert for three schemes EvaluateSentenceOtdAB_A (4.46 ~ 4.68) > EvaluateParagraphNoRestrictionAB_A (4.22) > EvaluateDocumentNoRestrictionAB_A (3.60) (see Table 5.13). Likewise, the same findings were detected from the corresponding schemes of B A _ A and A B _ B . Furthermore, the relevance scores of all three schemes in EvaluateDocumentNoRestriction were all less than 4.00, which also indicated that the document level was the least effective textual level for producing valid concepts.  From the average relevance scores in Table 5.13 and the relation types in Table 5.14, we also found that though different scopes of sentence, paragraph and document level  -52-  extracted varied accounting concepts, the sentence level was a little closer to the paragraph level than to the document level. 5.5.2.2 Distance Within the Sentence Level (Q2.2)  We compared clustering data and counted the percentage of identical clusters in the same scheme in EvaluateSentenceOtd and in EvaluateSentenceUpTo5td with in the scheme EvaluateSentenceNoRestriction respectively (Table 5.15). This reveals that EvaluateSentenceUpTo5td and EvaluateSentenceNoRestriction generated similar results in clustering accounting terms (88% and 92% identical clusters respectively). The low level of identical clusters percentage (50%) was caused by the asymmetry factor that will be discussed later on. On the other hand, the percentages of identical clusters in EvaluateSentenceUpTo5td were all much higher than those in EvaluateSentenceOtd (88% > 42%, 92% > 42%, and 50% > 24%). This indicated that compared with EvaluateSentenceOtd, EvaluateSentenceUpTo5td level was much closer to EvaluateSentenceNoRestriction.  % of Identical Clusters Compared with Clusters in EvaluateSentenceNoRestriction EvaluateSentenceOtd A B A  SentenceNoRestriciton SentenceNoRestriciton SentenceNoRestriciton AB A BA A AB B .  42%  N/A  EvaluateSentenceOtd'BA A  N/A  42%  EvaluateSeiileneeOtdAB B  N/A  N/A  EvaluateSentencel)pToStdAB A  88%  N/A  E v a 1 u a leS en ie n ce U pToStd B A A  N/A  92%  N/A  EvaiuateSentencelipToStdAB B  N/A  N/A  50%  24"-,,  Table 5.15: Identical Clusters Percentages by Comparing the Same Scheme in SentenceOtd and SentenceUpTo5td with SentenceNoRestriction Respectively  However, the differences between clustering data in EvaluateSentenceOtd and those in EvaluateSentenceNoRestriction were not significant either, because the identical  -53 -  percentages were not low. This can be further supported by examining the statistical analysis of the expert's evaluation results between EvaluateSentenceOtd with EvaluateSentenceUpTo5td and with EvaluateSentenceNoRestriction. In Table 5.16 we can see that the selected relation types and mean relevance scores marked by the expert conveyed some variances, but these were not significant. Therefore, it is not necessary to further modify the relative location of the terms in relation to each other within the same sentence level to achieve better concepts. I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h)  III.  Score (1-5)  (%)  All 15 Schemes EvaluateSentenceOtd A B A  a)  b)  0%  4%  EvaluateSentenceNoRestriction A B A  0%  Ev(iluateSi'.ntenceOtdBA_A EvaluatcSenknceUpToStdBA  d)  «)  0  g)  h)  .42%  6%  2%  0%  0%  4.58  - 36%  42%  2%  • 8%  0%  ?%  4.46  -10%  ' 36% •  44%  .2%  6%,  . .0%  2%  4.46.:  2%  52%  8%  FvaliiatcScnlcnceUpToStdAB A  A  EvaluateSentenceNoRestriction BAA  •  1  EvaluateSenienceNoRestrictionAB _M  46% •  42% •  0% -  3-1%  1 V.ll.i.'lcS. IICIILCOIU \\\ U EvaluateSentenceUpro5tdAB_B  0  - •  •  4%.  2%  30%~  5S%  (  (  44%  28% .  (  0%  46%  12%)  0%  0%  38%  8%  2%  0%  '  .:  2%  0%  2%  0%  4.52  2%  4.06  18  0%  •  36%  0%  6"..  42%  (1%  4%  Table 5.16: Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EaluateSentenceOtd with EaluateSentenceUpTo5td, and with EvaluateSentenceNoRestriction With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) "No direct relation".  Similarly, when we contrasted the statistical analysis of the expert's evaluation results in EvaluateSentenceUpTo5td with those in EvaluateSentenceNoRestriction, we found very  -54-  t:40  similar outcomes for both the relation types and the relevance scores marked by the expert (see Table 5.17). This agreed with the findings by prior studies that performances of restricting windows to at most five words lead to almost identical results as examinations of entire sentences. I. Statistical Analysis of the Expert's Evaluations of  HI. Score (1-5)  Relation Types a)~h) (%)  AH 15 Schemes  b)  c)  d)  e)  0  R)  h)  EvaluatoSentenceUpTo5ldAB A  2%  . 8%  36%  42% .  2% •  • 8%  0%  2%  EvaluateSentenceNoRestrictionABA  ()"..  It)".,  .36%  44% •  2%  6%  0%  2%  2%  - 34%  54%  .  '  .2",,  30%  58%  :%  • 0%  46%  12%  4%  <)%  38%  8%  6%  LvafuateSenrenceL'pTo5(JBA EvuluuU'Si'nlcn<.i'\iiFit'\lrKiiun  a)  A•• B 1 1  4%  EvaluateScntenccUpToMdAB B  (  EvaluateSentenceNoRestrictionAB_B  (»"o  1  '  4.46 '  '  4.52  2"..  0%  36%* •  0% -  2%  42%  0%  4%  Table 5.17: Comparing the Statistical Analysis of the Expert's Evaluation Results of the Same Scheme in EaluateSentenceUpTo5td with EvaluateSentenceNoRestriction With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) "No direct relation".  5.5.2.3 Directionality A n d Asymmetry (Q2.3) Directionality - When we only compared the statistical expert's evaluation results for the same order ( A B _ A or B A _ A ) in each scheme, differences in the relation types and the relevance scores evaluated by the expert were small (see Table 5.18 below, as well as Table 5.13 above). We thus learned that given the same term, the directionality of two terms' co-occurrences has no significant impact on the performance of grouping-related accounting concepts.  -55-  4.46  if 4.02  I. Statistical Analysis of the Expert's Evaluations of Relation Types a)~h)  (%)  Compare Directionality  a)  c)  d)  e)  0  EvaluateSentenceOtd A B_ A  0%  4%  •w>%  42%  6%  2%  EvaluateSeiuencdtlldBAA  0%  2%  J2%  42%  2%  0"y„  III. Score (1-5)  h) 6%  0%  4.58 ~  2%  4.40  t  L.valuaicScntcnccl!pTo5ld.-\B A  2%  8%  36%  42%  2%  8%  EvaluateSentenceUpToStdBA A  4%  2%  34%  5-/%  2%  2%  M%  o%  I0"o  36%  44%  '2%", :  6%  ; 2%  4.46  EvaluateSentenceNoRestriction BA_A  4%  2%  30%  .55%  2% .  2%>  2%  0%,  4.52  EvaluateParagraphNoRestrictionAB A  4%  4%  48%  34%  4%  2%  0%  2%  4.22  EvaluateParagraphNoRestriction BA_A  4%  42%  42%  2%  2%  0%  4%>  4.30  '•38%  22°,.  0%  26%,  0%  10%,  3 60  58%  18%  0% '  12%.  ()" n  10%  3 82  EvaluateSentenceNoRestrictionAB  A  rvdluiilcDuuimcnlNoKcsti ictionAB  v  \  LwludteDocumentNoRosti icuonBA A  ....  llBllllfBill o% .4% 2'!'o  >i - • > i  0% |  0%,  4.46  , ' of the Table 5.18: Directionality - Comparing the Statistical Analysis of ••»' the Expert's Evaluation m Results 1  Scheme A B _ A at Five Levels And the Scheme B A _ A at Five Levels With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; f) "Partial relation"; g) "Other relation"; h) "No direct relation".  Asymmetry - Table 5.19 demonstrated, however, that different schemes of using either of any two given terms to predict their affinity actually extracted very dissimilar concepts (see relation types chosen by the expert) accordingly. Besides, the statistical relevance scores of relations grouped by the schemes of given term A (AB_A) was always much higher than given term B ( A B B ) (see Table 5.13 above). As noted above, the tokens were arranged (by ID) according to decreasing frequency in the collection. Hence, the concepts from the scheme using a term with a higher frequency (term A) as the given term always  -56-  involved better performance than the scheme using a term with a lower frequency (term B) throughout the entire collection. It can be further interpreted that the more frequently a term appears in a text, the more possible that diversified relevant terms will co-occur in various proximate contexts. In other words, the more common term seems to be a better "generator" of composite terms. Therefore, using the more frequent term can extract more meaningful and accurate concepts. 1. Statistical Analysis of the Expert's Evaluations of Relation Types  III. Score  (1-5)  a)~h)  (%) Compare Asymmetry  a)  b)  \  0°.,  4%  EvaluateSentenceOtdABB  0%  KvaluateSentenceUpTo5tdAB A  2%  EvaluateSentenceUpTo5tdAB_B  0%  1 v.ilualcSi-nlciKvOui \ B  c)  d)  e)  0  g)  h)  16"..  42",,  V„  2%  0%  0%  4.58  44" o  28%  6%  18%  0%  2%  4.06  8%  36% •  42%  2%  8%~  0%  2%  4.46  0°..  46%  12",,  4%  '36%  0%  2% '  4 02  •  -  EvaluateSentenceNoRestriction A B _ A  0%  10%  . 36%  44%  2%  6%  0%  2%  4.46  EvaluateSentenceNoRestrictionAB B  ()"o  0",.  . 38%  8",,  6%  42%  0%  4%  3.94  Eval uate ParagraphN oRestrict i on A B _ A  4%  4%  48%  34%  4%  2%  0%,  2%  4.22  EvaluateParagraphNoRestrictionAB B  0%  0%  40%  14%  0%  46%  0%  0%  3.58  ,,0%  10",.  5 fm  0%  3.08  • lA<ilii.<.:^l)tKuiiicni\()KcsiiicliiniAB  ()"'(.  \  EvalualcDocumenlNoRestriclionAB B ,=  -  :  4%  58"-,  0% | 0% I  I  • 20% I  L  22",,  0%  0%  •4%  ' j  ,:26% _J  76%  I  0%  1  Table 5.19: Asymmetry - Comparing the Statistical Analysis of the Expert's Evaluation Results of the Scheme A B A at Five Levels And the Scheme A B B at Five Levels With relation types a) "Synonyms"; b) "Antonyms"; c) "Broader term"; d) "Subgroup"; e) "Distinct but forming a new concept"; t) "Partial relation"; g) "Other relation"; h) "No direct relation".  This further substantiated that A B _ B was less reliable than A B _ A , explaining why in Section 5.5.2.2 we detected only 52% (26 out of 50 clusters) in SentenceUpTo5tdAB_B  -57-  that were identical with those existing in the top 50 clusters of SentenceNoRestrictionAB_B, although we found there should have been no obvious differences in the extracted concepts within a restricting distance of up to five words (SentenceUpTo5td) and the entire sentence (SentenceNoRestriction). Furthermore, for the scheme of DocumentNoRestrictionAB_B, we noticed that the clustering outputs made no sense at all because in each step the clustering mechanism in this scheme just kept adding one more term to the previously newly formed clusters. The expert also gave the lowest relevance score for this scheme compared with the other 14 schemes. These findings also proved that it is not a good practice to use the low frequency term (AB_B) to group concepts. In addition, because the document level was the least effective level when compared with the sentence and the paragraph level, therefore DocumentNoRestrictionAB_B was the most poorly performing scheme of the all fifteen schemes.  In summary, the schemes based on using high frequency words (AB/A or B A / A ) with up to five words between them, within a single sentence, produced the best results. However, using the low frequency words (AB/B) and instances of words co-occurring within a document resulted in very poor results.  6  AUTOMATICALLY IDENTIFYING PHRASES (Q3)  As noted above, in our main research, our vocabulary consisted of many term-phrases connecting single words into phrases that we believed could convey more accurate  -58-  meanings in the accounting domain. Despite techniques for including many term-phrases suggested and adopted by many researchers, we were curious about whether or not our own techniques could automatically form phrases. We thus conducted an additional smallscale project at only the SentenceOtd level to investigate possible ways of linking adjacent single words into meaningful accounting phrases in the resulting concepts. In order to obtain output clusters that grouped single words in the end, we still used the same program at every stage, although the content of the input files to the program may have been different due to modifications in some steps as below.  6.1 Single-Word Tokenizing  Initially, we used the same 835 documents from the FARS database in the domain dependent prepossessing stage from our previous main research activities. These imported files had already been processed through the following steps in our main research: file extraction, filename mapping, removing unusual characters, reformatting document structure, removing dots, changing all the words in the text into lowercase, and separating punctuation. We then modified the rest of the tokenizing steps used in the main research and tailored them to the task of extracting individual tokens with no phrases included. The changes are explained in the following sections. 6.1.1  Broken Phrases List  For the purpose of conducting a rudimentary study to test if an automatic way to link single words into phrases exists, we focused our experimental work on the single words that were composed of the 1,774 phrases we had manually created in our main research (see Experiment - Section 5.14 above). We manually separated every term included in  -59-  each of these 1,774 phrases into an equivalent number of single words. For instance, the phrase " A C C E L E R A T E D COST R E C O V E R Y S Y S T E M S " was broken into four single words: " A C C E L E R A T E D , " "COST," " R E C O V E R Y , " and " S Y S T E M S . " Similarly, the phrase "OFF B A L A N C E SHEET RIGHTS A N D OBLIGATIONS" was broken into six single words: "OFF", " B A L A N C E " , "SHEET", "RIGHTS", " A N D " , and "OBLIGATIONS." Consequently, there were a total of 1,073 resulting single words, which were retained at this step and sorted alphabetically (see Appendix E l - List 1). 6.1.2  Wanted Single Words  To approximate the clustering outputs consisting of the final 600 tokens for further analysis, we added the words that were among those 600 tokens but were not among the 1,073 broken single words. We thus obtained a wanted single words list consisting of 1,188 single words (see Appendix E l - List 2). Similarly to how we conducted the main research, all the words in this list were then converted into lowercase by the program. The program then scanned the collection and kept only these wanted tokens in the text. 6.1.3  Converting Plural Wanted Single Words into Singulars  The 1,188 single wanted words comprised words in plural or singular formats, and we created a list that converted 283 plural single words into their singulars (see Appendix E l - List 3). Again, the program processed the text and transformed these plural tokens into singulars accordingly. 6.1.4  905 Final Single Tokens  After the above processing steps, the program eventually extracted 905 distinct single tokens from the text while increasing its count of each token's frequency each time the same token was encountered. We then acquired a final list of 905 single tokens (see  -60-  Appendix E l - List 4 ) , which was sorted by descending order token frequency in the text collection. In the end, the tokenized files were represented by these 905 different tokens with their original structures maintained.  6.2 905 Single Tokens' Affinity Calculations  Since our study was only a preliminary exploration, we investigated the simplest cases, where tokens were located continuously so that there would be the highest chance to link them into phrases. In our research, that was at the SentenceOtd level in all four cases. 6.2.1  One Level - SentenceOtd  SentenceOtd: as long as any two tokens in the order of A->B / B->A were next to each other with 0 token's distance between (Otd) in the same sentence, regardless of how many of these situations were in that same sentence, we counted only one per sentence. 6.2.2  Four Cases - Estimating Affinities  In our main research, the three affinity probability estimations we had explored, which were #<A, B>/#A, #<B, A>/#A, and #<A, B>/#B, were sufficient to address our research questions regarding differences on distances, order and asymmetry between two terms. In automatic phrases generation, our research focus was to explore the feasibility of using all four cases to automatically from phrases in SentenceOtd level. Therefore we added the fourth case, #<B, A>/#B. In all four cases we defined their statistical affinities the same way as in the main research. The program then processed the text to calculate their affinities as follows: •  #<A, B>/#A: # of sentences having B A F T E R A with Otd / # of sentences having A  •  #<B, A>/#A: # of sentences having B B E F O R E A with 0 td / # of sentences having A  -61 -  •  #<A, B>/#B: # of sentences having A B E F O R E B with Otd / # of sentences having B  •  #<B, A>/#B: # of sentences having A A F T E R B with Otd / # of sentences having B  The program also counted the number of sentences that each token appeared in and then processed all of the tokenized files to compute any two tokens' probabilistic affinity values for all 905 single token subjects.  6.3 C l u s t e r i n g 9 0 5 S i n g l e T o k e n s  In this small project, we used the same Matlab hierarchical clustering techniques as in the main research. The program transformed all the affinity values using the formula (1 Statistical Affinity Value) to ascertain the distance values of any two tokens. These distance values were then entered into Matlab linkage function to form hierarchical trees in each of the four cases. Table 6.1 shows the top ten clustering data outputs for the case SentenceOtdA B A (see Appendix E2 for the top 50 clustering outputs for the four cases in SentenceOtd Single Token Cluster Data): Cluster Index 906 907 908 909 910 911 . 912 913 914 915  Object in First Group  Object in Second Group  Distance  635; representational  638; faithfulness  0.056818  827; safe  834; harbor  0.16667  356; health  362; care  0.19312  428; joint  440; venture  0.27984  123; balance  181; sheet  0.40346  377; pro  453; forma  0.4244  414; conceptual  431; framework  0.44304  105; foreign  107; currency  0.46311  660; growing  686; timber  0.47887  33; cash  73; flow  0.49396  Table 6.1: T o p Ten Automatic Phrase Clustering Data in S e n t e n c e O t d A B A  -62-  6.4 Evaluation - 905 Single Token Clustering  To maintain consistency with our main research, in our small project we again also only evaluated the top 50 clusters of each case. Because the clustering terms themselves actually revealed valuable information, therefore we further evaluated the resulting clusters. 6.4.1  A Sample Evaluation  We created evaluation tables by replacing the token numbers with term names. Then we assessed whether the terms in each cluster actually formed valid phrases. Table 6.2 shows the example of the top five clusters in SentenceOtdAB A :  Cluster Terms in the Cluster Separated by " ; " No.  1  2  3  1. AutomaticPhrase Evaluation  2. Explanatio n  Could Not Percentage | Match (%) of Original, But Ordering 'Ordering Not an Terms in the Terms in the Total # Matching Could Form Indication Indication Account first group second group of terms Original a New (put"1" (put "I" -ing Manual Accounting here for here for Phrase Phrases Phrase (put same opposite (put " 1 " (%) "1" here) order) order) here) representational faithfulness 2 100 1 safe  health  harbor  care  2  67  1 safe_harbor leases  2  67  1 health_care _providers  4  joint  venture  2  100  1  5  balance  sheet  2  100  1  .  Table 6.2: Sample Evaluation of 905 Single Token Clusters-Top Five Clusters in Sentence0tdAB_A  As to how we conducted the main research evaluations, the first column listed the cluster numbers, which also clustered terms in ascending order of distance from each other. The  -63 -  terms were also located in two groups and the total number of terms was also listed. In the "1 - Automatic Phrase Evaluation" category, we included the following possible alternatives: •  "Percentage (%) of Matching Original Manual Phrases": we combined the terms of the two groups in the order of the first group and then the second group. For example, in cluster #3 we got "health care" (two words). We then compared these terms in that order with the original 1,774 phrase control list in our main research (see Experiment - Section 5.14 above) to see whether we could find matching phrases there. In the 1,774 phrase list, we found a similar phrase, "health care providers" (3 words). There were only two words matching the original three words of the similar phrase, and so the matching percentage was 67% (2/3) and we entered the number "67" in this percentage column. We also needed to indicate the complete original phrases in the column "2 - Explanation," in this case "health care providers," ("_" was used by the program to link the phrases together as one phrase). Note that because the resulting cluster terms were obtained by the program that had already transformed all plural phrases into singulars, hence when we compared the terms with the original phrases list we could ignore the plural or singular patterns of each word. Another example is cluster #5: we combined the words of "balance sheet" and found the identical phrase in the 1,774 phrase list -"balance sheet." Therefore, the matching percentage was 100% (2/2), and so we put the number "100" in this column and there was no need to indicate the original phrase in the Explanation column. Here we counted the terms when they were combined in each cluster in any order, and we calculated what percentage of terms could match the original phrases. For example, in  -64-  SentenceOtdBA_A cluster #17, the first group had the term "entry" and the second group had the term "journal." When we combined them in that order we then obtained the new phrase "entry journal," which exactly (100%) matched our original phrase "journal entry" (by switching the order). •  "Could Not Match Original, But Can Form a New Accounting Phrase (put "1" here)' if we could not find a match in our original 1,774 phrase list (the matching percentage thus being 0%), we then checked whether combining these terms in the newly formed clusters could generate other meaningful accounting phrases that we could think of yet that were not included in our 1,774 phrases. This analysis was based on our common knowledge without reference to other resources. If such phrases were generated, we put "1" in this column, to be added together in the end. Meanwhile, if the order was different, we then indicated the phrase with the right order. For example, in SentenceOtdAB_A, the cluster #42 included two terms, "minority shareholder." Though we could not find a similar phrase in the list of 1,773 phrases, "minority shareholder" all together could compose another valid accounting phrase, and we then entered "1" in this column. If the newly formed accounting phrase was not in the right order but still could make sense when we switched the order among the terms, we also could enter "1" in this column, but we needed to write down the correct phrase in the Explanation column. As an example, SentenceOtdBA_A the cluster #34 included two terms, "date effective," which we believed could compose a new accounting phrase but not in the right order. However, when we switched their order, we then got a new meaningful accounting phrase  -65 -  "effective date," and so we entered'T" in this column and recorded the right phrase, "effective_date," in the Explanation column. •  "Ordering Indication (put" 1" here for same order)": if connecting the terms in the first group and then the second group of each cluster had the same order as in the original 1,774 phrases or had the same order as other accounting phrases we could think of, we then put "1" here. For example, in SentenceOtdAB_A cluster #3, we got "health care," which was the same order as the original phrase "health care providers" ("health" was in the first group, "care" was in the second group), though not as complete as the original phrase. We then entered "1" in this column. As a further example, in SentenceOtdAB_A the cluster #42 included two terms "minority shareholder," which we believed could compose a new accounting phrase and in the same order.  •  "Ordering Indication (put" 1" here for opposite order)": if connecting the terms in the first group and then the second group of each cluster had the opposite order as the original phrase or had the opposite order with other accounting phrases we could think of, then we put "1" here. In SentenceOtdBA_A cluster #17, the first group included the term "entry," and the second group included the term "journal." When we connected them in that order, we obtained "entry journal" ("entry" was in the first group, "journal" was in the second group) which was the opposite order to the original phrase "journal entry," though the percentage matching was 100% for this cluster. We then entered "1" in this column. To note yet another example, in SentenceOtdBA_A the cluster #34 included two terms, "date effective," which we  -66-  believed they could compose a new accounting phrase but in the opposite order because the new meaningful accounting phrase should be "effective date." •  "Not an Accounting Phrase (put" 1" here)": if the term matching percentage was 0% and any combination of terms in one cluster could not make up other meaningful accounting phrases either, we then entered "1" in this column. This also meant the terms in this cluster failed to form phrases.  6.4.2  Statistical Evaluation Results  Based on the sample evaluation input detailed above, we obtained statistical evaluation results in each of the four cases respectively (see Appendix E3 - SentenceOtd Evaluate Automatic Phrases). Table 6.3 summarizes the statistical results for all four cases. Statistical Automatic Phrase Evaluation  All Cases  Automatic Phrase in SentenceOtd A B_A  Percentage (%) of Matching Percentage Original of 100% Manual Matrhiii" Phrases Phrases  Not Matching Original, But Ordering Ordering Can Form a Indication Indication New (Percentage (Percentage Not an Accounting for the for the Accounting Phrase Same Opposite Phrase (Percentage) Order) Order) (Percentage)  61.8%  32%  6%  67.4% •  !(>'„  • 6%  Automatic Phrase in SentenceOtdABB  32.8%  12%  Automatic Phrase in SentenceOtdBA B  59%  38%.  ViiKimntic P h r a s e in  92%  0%  8"-,.  . 0%  88%  12%  2%  56%  0%  44%  4%  0%  86%  14%  SciilerK'WH(iU\_\  •  ,  Table 6.3: Statistical Evaluation of Automatic Phrases for Four Cases at SentenceOtd Level  The additional column of "Percentage of 100% Matching Phrases" was constructed by adding the percentage number of the completely matching original 1,774 phrases (100% matching regardless of the order), plus the percentage of generated terms that completely  -67-  matched other accounting phrases (if the terms could not find similar phrases in the 1,774 original phrase list). We then calculated the percentage of matches in each case.  6.5 Automatic Phrase Discussions - 905 Single Tokens Clustering  When we studied Table 6.3, we derived the following interesting findings: 6.5.1  Our Techniques Can Automatically Identify Some Accounting Phrases  The statistics showed the terms grouped by our automatic generating clusters can make up many meaningful phrases because both the percentage of matching original phrases plus the percentage of newly formed accounting phrases in these four cases were not low (3 out of four cases were above 50%). In addition, there were a small number of clusters that could not group accounting phrases in that the percentage of "Not an Accounting Phrase" was low (all were less than 50% and three out of four were even less than 20%). But we could still tell among the top 50 clusters that all the numbers in "Percentage of 100% Matching Phrases" category were less than 50%, which indicated that our techniques automatically grouped many terms into incomplete phrases, though these incomplete phrases still could make some sense. This can be easily understood because in this rudimentary study the text carried mostly single tokens that had been broken down from the original phrases and also because we only investigated cases where two tokens were immediately next to each other (in other words, there was no noise between them), and so it was very likely and reasonable that the resulting clusters could identify potentially useful accounting phrases.  -68-  Although our proposed techniques were valid in automatically identifying accounting compound concepts as well as automatically grouping accounting phrases, theses two effects results from different directions. As many prior studies to derive automatic accounting concepts have suggested and verified, we also highly recommend the inclusion of term - phrases in the research vocabulary when extracting automatic accounting concepts. Each extracted accounting concept consists of not only individual related terms but also of more common term-phrases that belong to the same topic. In contrast, the automatic phrase techniques have to combine the terms in each cluster to form termphrases and each term-phrase only represents a specific term, which is not a concept. 6.5.2  •  Directionality and Asymmetry  Directionality - When we compared "Automatic Phrases SentenceOtdAB_A" with "Automatic Phrases SentenceOtdBA_A" (see Table 6.3), there were no clear indications showing that the order a term appears in affects the quality of cluster generation, in that most evaluation statistics were not very different between these two cases. However, the order of the phrases formed by the resulting clustering terms varied significantly in accordance with the order of two tokens A and B that was used to derive clusters: SentenceOtd A B A grouped all same order accounting phrases, but SentenceOtdBA_A grouped all opposite order accounting phrases. This can be easily explained: •  Direction A B - we computed two term affinities in the order of token A appearing before token B (order A B ) in the original text, which also meant for this order the first token A ' s ID was always smaller than second token B's ID. For example, for token A , "financial," with token ID #14 and token B,  -69-  "reporting," with token ID #54, we computed the statistical affinity values for these two tokens which appeared in the order of "financial reporting" in the original text. Later, we derived clusters for this case. When Matlab linked two objects into each cluster it always placed the smaller object number (in our project, the smaller ID - token A) into the first group and placed the larger object number (in our project the larger ID - token B) into the second group, and therefore potential automatic phrase for SentenceOtdAB_A could be obtained in the same correct order A B by simply directly linking the higher frequency term (smaller ID - token A), which was located in the first group followed by the lower frequency term (larger ID - token B), which was located in the second group of each resulting cluster. So in "Automatic Phrases SentenceOtdAB_A" cluster #48, the automatic phrase "financial reporting" could be obtained in the same right order by directly linking the first group term "financial" (ID - #14, Frequency - 14,727) to the second group term "reporting" (ID - #54, Frequency 5,265). This is why "Automatic Phrases SenteneeOtdAB_A" grouped all same order accounting phrases in Table 6.3. Direction B A - we computed two term affinities in the order of token B appearing before token A (the order B A ) in the original text, which also meant for this order the first token B's ID was always larger than second token A ' s ID. As an example, for token B "financial," with token ID #14, and token A "statement," with token ID #7, we computed the statistical affinity values for these two tokens, which appeared in the order "financial statement" in the original text. Later, we derived clusters for this case. When Matlab linked two objects into each cluster it  -70-  always placed the smaller object number (in our project the smaller ID - token A) into the first group and placed the larger object number (in our project the larger ID - token B) into the second group, the potential automatic phrase for SentenceOtdBA_A could thus be obtained in the same right order B A by directly linking the lower frequency term (bigger ID - token B), which was located in the second group followed by the higher frequency term (smaller ID - token A), which was located in the first group of each resulting cluster. Hence, in "Automatic Phrases SentenceOtdBA_A" cluster #25, the automatic phrase "financial statement" could be obtained in the same right order by directly linking the second group term "financial" (ID - #14, Frequency - 14,727) to the first group term "statement" (ID - #7, Frequency - 30,033). This explains why "Automatic Phrases SentenceOtdBA_A" grouped all opposite order accounting phrases in Table 6.3. Exactly the same directionality patterns occurred for "Automatic Phrase in SentenceOtdAB_B" and "Automatic Phrase in SentenceOtdBAB". •  Asymmetry - We found "Automatic Phrase in SentenceOtdABA" grouped more meaningful accounting phrases than "Automatic Phrase in SentenceOtdAB_B" because the numbers in "Percentage of Matching Original Manual Phrases," "Percentage of 100% Matching Phrases" and "Not Matching Original, But Can Form a New Accounting Phrase" were all greater in SentenceOtdAB_A. Furthermore, the number in "Not an Accounting Phrase" was smaller in SentenceOtdABA than in SentenceOtdAB_B (see Table 6.4). Exactly the same patterns occurred for "Automatic Phrase in SentenceOtdBA_A" and "Automatic Phrase in SentenceOtdBA B . "  -71 -  Statistical Automatic Phrases Evaluation  AH Cases  Automatic Phrase in  SentenceOtdABA Automatic Phrase in  IPiiSlllliiiisilS Not Match Percentage Original, but Ordering (%) of Ijllliiilltllll! can form a Ordering Indication Matching Percentage New Indication (Percentage Not an Original Accounting (Percentage for the of 100% Accounting Manual Phrase opposite for the Matching Phrase Phrases (Percentage) same order) order) Phrases (Percentage) 61 8% 32% 6% 92% 0% 8% 32.8%  12%  2%  56%  0%  44%  67.4%  46%  6%  (•"..  88%  12%  59%  38%  4%  0%  86%  14%  SentenceOtdAB_B AiitoiiKilie Phrase in  ScntiiueOldBA  \  Automatic Phrase in SentenceOtdBA B  Table 6.4 (rearranged Table 6.3): Asymmetry - Comparing Evaluation Automatic Phrase in SentenceOtdABA with in SentenceOtd AB_B; and in Sentence0tdBA_A with in SentenceOtdBAB  The above contrasts revealed that using term A , which had a higher frequency as a given term, always yielded better performance in automatically grouping phrases than using term B, which had a lower frequency in the entire collection. This also agreed with our main research findings.  7 CONCLUSIONS AND FUTURE RESEARCH  7.1 Conclusions  Our research was an exploratory study comparing the use of different textual units in automatically identifying compound concepts and automatically forming phrases in a given context. We used the accounting context as a case study. Building from previous research, we developed our own domain-specific preprocessing automatic indexing and  -72-  statistical affinity-computing methods. The extracted hierarchical clustering outputs from fifteen different schemes were analyzed and evaluated by an accounting expert and also by reference to the Price Waterhouse Thesaurus. The results from our main experiment were differing in textual unit size using affinity measures.  The outcomes were encouraging and answered our research questions. Our proposed approach in this study was capable of automatically identifying potential accounting compound concepts (at least 90% of the clusters in most schemes grouped meaningful concepts) and automatic accounting phrase formation. The expert's evaluation revealed that the most frequent relationships suggested by the concepts were "Broader Term" (one term is broader term than the other) and "Subgroup" (these are different from "Broader term," but they are subgroups of other broader concepts).  We also studied several issues that could affect the quality of the results. Analysis of relationships between terms within sentences, within paragraphs, and within documents generated results with varied qualities: the sentence level produced the best results while the document level produced the least usable results. Regarding relationships of terms within the same sentence, the findings of previous studies were also verified by our research: those restricting the window to at most five words can group significantly similar concepts as concepts formed within the entire sentence. Restricting the process to terms separated by fewer than five other words within a single sentence does not seem to significantly improve clustering performance.  -73 -  The order in which any particular pair of terms occurred did not exhibit any explicit impact on the automatic concepts thereby generated. In cases where the higher frequency term (A) appeared before the lower frequency term (B) in the original text (A->B), the potential phrase could be automatically formed simply by directly linking the first group term and the second group term in each resulting cluster. On the other hand, when the lower frequency term (B) appeared before the higher frequency term (A) in the original text (B->A), linking the second group term and then the first group term could obtain the automatic phrase.  Moreover, in any two terms, we found that normalization based on the higher frequency term ( A B / A or B A / A ) yielded much better results than otherwise (AB/B or BA/B). This was corrected both for automatic identification of both compound concepts and phrases.  7.2 Contributions Our research studied a set of techniques for automatic compound concept-identification and automatic phrase generation. While some work in these areas has been done before, the techniques we have employed are novel. Our distance measures were based on affinities, rather than on the popular similarities studied by most researchers, and therefore our classified concepts identified composite relationships among terms (instead of only synonyms).  The thesis explored term proximities in different textual units and the effects of directionality and asymmetry between two terms. Again, little research had been  -74-  previously done in this field. Our preliminary study indicated that analysis at the document level and using low frequency words to generate composite terms led to poor results in generating meaningful concepts. This finding could direct future research.  Our research has demonstrated the realistic possibility of grouping terms into compound concepts, a process that can provide users with assistance in judging how concepts are constructed, and that can thereby assist them in searching for useful information about specific concepts.  7.3 L i m i t a t i o n s A n d F u t u r e R e s e a r c h  To obtain a useful index of terms to generate better results, there was a significant manual effort involved in our "preprocessing - tokenizing" stage. In particular, this included identification of domain-specific phrases and domain-dependent stemming (consolidating terms). In automatic phrases study, we used domain-depended stemming; in the future, however, more automatic term-identification techniques should be studied to further eliminate human resource costs while maintaining the accuracy of the terms selected by the automatic program. Although there is little theoretical literature to support our approach, it would be possible to extend this study to construct a domain-specific thesaurus based on the useful parameters identified in our research as influences on the extracted concepts.  -75 -  BIBLIOGRAPHY Accounting Dictionary - Accounting Glossary - Accounting Terms [Online] Available: http://www.ventureline.com/glossary.asp Besancon, R., Rajman, M . , & Chappelier, J. (1999). Textual similarities based on a distributional approach. 10 International Workshop on Database & amp; Expert Systems Applications, September 01-03, Florence, Italy. th  Callan, J. (1995). Controlled Vocabularies and Ontologies. Carnegie Mellon University, 95-778 Digital Libraries. [Online] Available: http://hartford.lti.cs.cmu.edu/classes/95-778/Lectures/03-CtrlVocab.pdf Caraballo, S.A. (1999). Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37 Annual Meeting of the Association for Computational Linguistic, College Park, Maryland, USA, 120-126 th  Chen, H., & Lynch, K.J. (1992). Automatic construction of networks of concepts characterizing document databases. IEEE Transaction on Systems, Man, and Cybernetics, 22(5), 885-902. Chen, H . , Lynch, K . J . , Basu, K . , & Ng, D. T. (1993). Generating, integrating, and activating thesauri for concept-based document retrieval. IEEE Expert, Special Series on Artificial Intelligence in Text-Based Information Systems, 5(2), 25-34. Chen, H., Martinez, J., Kirchhoff, A., Ng, T.D., & Schatz, B.R. (1998). Alleviating search uncertainty through concept associations: automatic indexing, co-occurrence analysis, and parallel computing. Journal of the American Society for Information Science, 49(3), 206-216. Chen, H., Ng, T.D., Martinex, J., & Schatz, B.R. (1997). A concept space approach to addressing the vocabulary problem in scientific information retrieval: an experiment on the worm community system. Journal of the American Society for Information Science, 48(X), 17-31. Chen, H., Hsu, P., Orwig, R., Hoopes, L., & Nunamaker, J. F. (1994) Automatic concept classification of text from electronic meetings. Communications of the ACM, 57(10), 56-73. Chen, H., Schatz, B.R., Yim, T., & Fye, D. (1995). Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Science, 46(3), 175-193. -76-  Church, K. W., & Hanks P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22-29. Crouch, C. J. (1990). A n approach to the automatic construction of global thesauri. Information Processing and Management. 26(5), 629-640. Curran, J. R., & Moens, M . (2002). Improvements in Automatic Thesaurus Extraction. In Proceedings of the Workshop on Unsupervised Lexical Acquisition Philadelphia, PA, USA, 59 - 67 Dagan, I., Marcus S., & Markovitch S. (1995). Contextual word similarity and estimation from sparse data. Meeting of the Association for Computational Linguistics. [Online] Available: http://acl. Idc. upenn. edu/P/P93/P93-l022.vdf. 164-171. Furnas, G. W., Landauer, T. K . , Gomez, L. M . , & Dumais, S. T. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 50(11), 964-971. Gangolly, J., & Wu, Y.F. (2000). On the automatic classification of accounting concepts: preliminary results of the statistical analysis of term-document frequencies. The New Review of Applied Expert Systems and Emerging Technologies, (6), 81-88. Garnsey, M . (2001). The use of latent semantic indexing and agglomerative clustering to automatically classify accounting concepts: a report of the preliminary findings. The New Review of Applied Expert Systems, (7), 129-140. Garnsey, M . (2002). Automatic classification of financial accounting concepts. In A.E. Baldwin and C E . Brown (Eds.) Collected Papers of the Eleventh Annual Research Workshop on: Artificial Intelligence and Emerging Technologies in Accounting, Auditing and Tax, 15-24. Gietz, P. (2001). Report on automatic classification systems for the T E R E N A activity portal coordination. [Online] Available: http://www, daasi. de/reports/Reportautomatic-classification.html Grefenstette, G. (1993). Automatic thesaurus generation from raw text using knowledgepoor techniques. 9th Annual Conference of the University of Waterloo, Centre for the New Oxford English Dictionary and Text Research, Oxford. Grefenstette, G. (1994). Exploration in automatic thesaurus discovery. Kluwer Academic Publishers Boston/Dordrecht / London.  -11 -  Jang, M . , Myaeng, S.H., & Park, S.Y. (1999). Using mutual information to resolve query translation ambiguities and query term weighting. Annual Meeting of the ACL, Proceeding of the 37 Conference on Association for Computational Linguistics, College Park, Maryland, USA, 223-229. th  Hauck, R., Sewell, R., Ng, D. T., & Chen, H . (2001). Concept-based searching and browsing a geoscience experiment. Journal of Information Science, 27(4), 199-210. K P M G Consulting L L C . (2000). Companies suffer from information overload, according to K P M G consulting knowledge management report. [Online] Available: http://web.lexis-nexis.com/universe/(4127/Q0). Lassi, M . (2002). Automatic thesaurus construction. http://www.adm.hb.se/personal/mol/gslt/thesauri.pdf  [Online]  Available:  Leory, G., & Chen, H . (2001). Meeting medical terminology needs - the ontologyenhanced medical concept mapper. IEEE Transactions on Information Technology in Biomedicine, 5(4), 261-270. Losee, Jr. R . M . (1994). Term Dependence: Truncating the Bahadur Lazarsfeld expansion. Information Processing & Management, 30(2), 293-303. Martin, W. J.R., A l B.P.F., & van Sterkenburg P.J.G. (1983). On the processing of a text corpus: from textual data to lexicographical information. Lexicography: Principles and Practice (Applied Language Studies Series), Hartman R.R.K, Ed. London: Academic. Merriam-Webster online dictionary and thesaurus [Online] Available: http://www.mw.com Miller, G. A., Beckwith, R., Fellbaum, C , Gross, D., & Miller K . (1993). Introduction to WordNet: an on-line lexical database. [Online] Available: http://wwwl.cs. Columbia. edu/~radev/cs4999/notes/5papers. pdf Milstead, J. L. (2000). About thesauri. indexing. com/Milstead/about, htm  [Online] Available:  http://www.bayside-  Price Waterhouse & Co. (1974). Thesaurus of accounting and auditing terminology. New York: Price Waterhouse & Co. Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes & R. Baeza - Yates (Eds.) Information Retrieval: Data structures and algorithms. Engelwood Cliffs, N J : Prentice Hall. -78-  Rungsawang, A . (1998). A distributional semantics based information retrieval system. The National Computer Science and Engineering Conference (NCSEC'98). Salton, G. (1989). Automatic text processing. Reading, MA: Addison-Wesley Publishing Company Inc. Salton, G., & Buckley C. (1991). Automatic text structuring and retrieval-experiments in automatic encyclopedia searching. In G. Salton A. Bookstein, Y. Chiaramella and V.V Raghavan, editors, Proceedings of the Fourteenth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, 21-30. SchUtze, H., & Pedersen, J.O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing and Management, 33(3), 307-318. Statistics Toolbox for Use with Matlab User's Guide (Version 3) Tutorial - Cluster Analysis. (1-53 to 1-67). [Online] Available: http://www, busim. ee. boun. edu. tr/~resources/stats_tb. pdf Stickney, C P . , & Weil, R.L. (1997). Financial accounting an introduction to concepts, methods, and uses (Eighth Edition). The Dryden Press, Harcourt Brace College Publishers. Sowa, J . F. (2000) Concepts in the lexicon: introduction [Online] http://www. jfsowa. com/ontology/lexicon. htm  Available:  Zhang, R., & Rudnicky, A.I. (2002). Improve latent semantic analysis based language model by integrating multiple level knowledge. Proceedings of ICSLP 2002 (Denver, Colorado), 893-896.  -79-  Appendix A : Domain Dependent Preprocessing - Tokenizing  5.1.2  Reformatting the Document Structure  Though the database had a filename for every document in the collection, the filename could not reflect its own content effectively. Therefore, a mapping file was created that automatically extracted the topic of each file (for example, A R B 49: Earnings per Share) in the database and linked the file with its associated filename (for example, fars-0008.txt) in the mapping file.  Since the paragraph level calculation was one of the core tasks of this study, we had to make sure the program could recognize paragraph sections. The original files had manual line-break-characters in each line within each paragraph that confused the division between paragraphs. In addition, the program could not recognize the phrase tokens if there was an end of line character inside the term phrases. We noticed in the original documents that if the ends of line characters appeared consecutively it meant the end of a paragraph. Thus, the program got rid of the extra ends of line characters so that there would be only one end of line character for each paragraph.  Since the sentence level calculation was also at the core of this research, we needed to make sure the program could recognize sentences as well. Punctuation marks like  "!"  and "?" can indicate the end of sentence. However, the period in a word like "E.g." when it appeared in the middle of a sentence did not mean the end of sentence. Therefore, there were two situations where the period that did not indicate the end of sentence was  -80-  automatically removed. First, if the first character of a word was a capital letter and the last character was a period, then the token would be removed, such as for "No.", "Mr." And "Messers." Thus, we could deal with all cases where the periods were located in the middle of sentences. Among these cases, though some periods were also located at the ends of sentences or paragraphs, because those cases were not statistically significant and because our method in this study was oriented toward automatic procedures, we simply ignored them. The second situation was that if the period was located inside the word but not in the last character, the word would be removed, such as in the cases of "1.4", "Ch.3A", and "e.g." There were statistically very few cases, though, where those words were also the last words in the sentences or paragraphs. Similarly guided by the trade-off strategy and automatic practice, we had to remove these words in order to automatically recognize the ends of sentences as well as paragraphs.  We then converted all of the words with periods already removed in the text collection into lowercase so as to make the next few procedures (such as removing stop-words) easier to do. This was to avoid cases where the stop-words in the documents could not be removed because of the problem caused by upper and lower-case confusion. Similarly, the program also converted uppercase words from the other two documents we manually produced, the abbreviation list (see List 1 below) and the term-phrases controlled list (see List 2 below), into lowercase. Moreover, the punctuation was also separated from the word so that the program could recognize the punctuation individually.  -81 -  5.1.3  Changing the Short-forms of the Words in the Abbreviation List into Termphrases  Since there were many abbreviations in the accounting texts, we produced an abbreviation-controlled list by consulting two external sources of Accounting Dictionary Accounting Glossary - Accounting Terms (http://www.ventureline.com/glossary.asp) and Financial Accounting an Introduction to Concepts, Methods, and Uses (8th Edition). AppAList 1: Top 10 out of 39 Manually Produced Abbreviations ACRS (ACCELERATED COST RECOVERY SYSTEM) ABC (ACTIVITY BASED COSTING) AICPA (AMERICAN INSTITUTE OF CERTIFIED PUBLIC ACCOUNTANTS) AICPASAS (AICPA STATEMENT ON AUDITING STANDARDS) AICPASOPS (AICPA STATEMENTS OF POSITION) AMT (ALTERNATIVE MINIMUM TAX) AROs (ASSET RETIREMENT OBLIGATIONS) BOM (BILL OF MATERIALS) CMOs (COLLATERALIZED MORTGAGE OBLIGATIONS) CPI (COST OF LIVING INDEX)  The list included 39 abbreviations in total that standardized the abbreviations into the term-phrases format. Note that all the letters in this list were already converted into lowercase in the previous step, so once the program found theses abbreviations in the texts, which were also in lowercase, it would automatically transform them into the matching term-phrases connected by the symbol of  For example, A C R S would be  converted to "accelerated_cost_recovery_system".  5.1.4  Converting Meaningful Words to Term-phrases  AppAList 2: First 10 Phrases of the entire 1,774 Phrase-Controlled  -82-  A B A N D O N E D PROPERTY A B N O R M A L COST A B N O R M A L COSTS A C C E L E R A T E D COST RECOVERY SYSTEM A C C E L E R A T E D COST R E C O V E R Y SYSTEMS A C C E L E R A T E D DEPRECIATION A C C E L E R A T E D DEPRECIATIONS ACCOUNTING ADJUSTMENT ACCOUNTING ADJUSTMENTS ACCOUNTING CHANGE  5.1.5  Removing Stop-words  AppAList 3: First 10 Stop-words of the entire 728 Stop-Word a able about above according accordingly accounted acquire across actually  5.1.7  Removing Unwanted Tokens  AppAList 4: First 10 Tokens of the whole 2, 052Wanted Tokens TOKEN  #FREQUENCIES  #DOCUMENTS  absence  291  130  absences  61  11  absent  94  57  accelerated_cost_recovery_system  21  4  accelerateddepreciation  29  12  accomplished  75  48  account  1242  318  account_receivable  2  2  accountant  50  30  accountants  158  56  -83-  5.1.8  Consolidating Wanted Tokens  AppAList 5: First 10 Tokens of the entire 994 Consolidating Wanted Tokens WANTED TOKENS TO BE CONVERTED  F O R M S OF T O K E N S A F T E R C O N V E R S I O N  absences  absence  account_receivable  accounts_receivable  accountants  accountant  accounting_adj ustments  accounting_adjustment  accounting_changes  accounting_change  accounting_concepts  accounting_concept  accounting_periods  accounting_period  accounting_policies  accounting_policy  accounting_principles  accounting_principle  accounting_principles_and_methods  accounting_principles_and_method  5.1.9  Generating the Final Reduced Token List  AppAList 6: First 10 Tokens of the Entire Final 1,344 Tokens TOKEN  #FREQUENCIES  #DOCUMENTS  asset  15878  510  accounting  13950  797  cost  10548  434  amount  10081  561  liability  7141  406  loss  6993  389  tax  6739  261  interest  6630  395  financial_accounting_standards_board  6527  763  entity  6252  369  -84-  Appendix B: Computing Term Affinities  5.2.2  Converting 1,344 Tokens to Token IDs  AppBList 7: First 10 Tokens of the Entire Final 1,344 Token IDs TOKEN  ID  #FREQUENCIES  #DOCUMENTS  asset  1  15878  510  accounting  2  13950  797  cost  3  10548  434  amount  4  10081  561  liability  5  7141  406  loss  6  6993  389  tax  7  6739  261  interest  8  6630  395  financial_accounting_standards_board  9  6527  763  entity  10  6252  369  5.2.3.1 Generating the 600 Token List for Clustering AppBList 8: First 10 Tokens of the Entire Final 600 Token IDs #FREQUENCIES  #DOCUMENTS  asset  15878  510  accounting  13950  797  3  cost  10548  434  4  loss  6993  389  5  tax  6739  261  6  interest  6630  395  7  financial accounting standards board  6527  763  8  fair value  6121  331  9  financial statement  5495  460  10  future  5456  422  ID  TOKEN  1 2  -85-  5.3.1  Hierarchical Clustering Using Matlab  The "Tutorial - Cluster Analysis" section of Statistics Toolbox for Use with Matlab User's Guide (Version 3, 1-57) (http://www.busim.ee.boun.edu.tr/~resources/stats Jb.pdfl showed an example on how to interpret the Matlab linkage function. "For example, given the distance vector Y from the sample data set of x and y coordinates, the linkage function generates a hierarchical cluster tree, returning the linkage information in a matrix, Z. Z - linkage(Y) Z= 1.0000 3.0000 1.0000 4.0000 5.0000 1.0000 6.0000 7.0000 2.0616 8.0000 2.0000 2.5000 In this output, each row identifies a link. The first two columns identify the objects that have been linked, that is, object 1, object 2, and so on. The third column contains the distance between these objects. For the sample data set of x and y coordinates, the linkage function begins by grouping together objects 1 and 3, which have the closest proximity (distance value = 1.0000). The linkage function continues by grouping objects 4 and 5, which also have a distance value of 1.0000. The third row indicates that the linkage function grouped together objects 6 and 7. If our original sample data set contained only five objects, what are objects 6 and 7? Object 6 is the newly formed binary cluster created by the grouping of objects 1 and 3. When the linkage function groups two objects together into a new cluster, it must assign the cluster a unique index value, starting with the value  -86-  m+\, where m is the number of objects in the original data set. (Values 1 through m are already used by the original data set.) Object 7 is the index for the cluster formed by objects 4 and 5. As the final cluster, the linkage function grouped object 8, the newly formed cluster made up of objects 6 and 7, with object 2 from the original data set."  -87-  Appendix C : Clustering 600 Tokens Data Outputs  AppC SentenceOtd Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceOtd  Cluster Index  SentenceOtd AB A  601 602 603 604 605 606 607 608 609 610  261 392 102 114  611  612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641  41; 601 10; 553 170 38; 546 106 16; 30; 147; 105; 18; 39; 613; 605; 34; 448 5; 51; 451 617 394 20; 63; 619 189 178 618 223 13; 494 337 174 635 626 395  346; 429; 139; 122; 46; 269; 27; 595; 209; 70; 576; 215; 33; 32; 208; 280; 614; 57; 28; 52; 603; 533; 25; 96; 498; 90; 502; 69; 80; 19; 246; 431; 81; 369; 26; 529; 372; 234; 35; 58; 455;  0.62791 0.74286 0.8034 0.80942 0.82476 0.84884 0.85654 0.85714 0.8599 0.87065 0.875 0.87905 0.87923 0.88738 0.88848 0.88973 0.89109 0.89262 0.8929 0.8944 0.90447 0.90476 0.90662 0.90762 0.90909 0.91117 0.91429 0.91476 0.91796 0.92165 0.92308 0.92547 0.92647 0.93277 0.93278 0.93333 0.93617 0.93659 0.939 0.93936 0.93939  SentenceOtd AB B  SentenceOtd BA A  149; 212;  4;  14;  0.63247  149;  194;  0.67611  196;  229:  0.7375  126;  602;  0.75068  5;  23;  0.80818  7;  0.81854  24;  66:  0.82776  209;  214;  0.85333  5; 2; 98 19 46 11 604;  596 578 488 312 517 562 435 123  0 0 0.0625 0.076923 0.18182 0.2 0.21739 0.24922  320  0.25  18;  27;  0.85561  606;  566  0.25  25;  29:  0.86167  72;  569  0.25  22;  65;  0.87587  131; 26; 128; 105; 24; 609; 613; 80; 261; 617; 54; 30; 174; 62 1;  575 551 553 280 347 271 589 588 346 267 545 351 525 119 139 223 497 501 254 291 546 581 591 623 599 582 579 568 550 25;  0.25 0.28571 0.28571 0.31765 0.32 0.33333 0.33333 0.33333 0.33333 0.34783 0.375 0.39216 0.4  32;  41;  0.87697  103;  165;  0.88048  86;  116;  0.88343  34; 27; 608; 40; 610; 7; 625; 632; 4; 634; 6; 628; 12 16 17  630;  -88-  369;  408;  0.88372  278;  298;  0.89535  132;  375;  0.89744  113;  258;  0.89899  611;  30;  0.90194  314;  334:  0.90196  609;  119;  0.90512  447;  516;  0.90909  249:  622;  0.91011  400;  512:  0.91177  0.4005  392;  494;  0.91429  0.41132 0.43698 0.45454 0.46154 0.48235 0.49275 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5  324:  350;  0.91525  603;  255;  0.9187  49;  121;  0.92078  54;  79;  0.92266  33;  71; 274;  0.92363  237; 89;  168;  0.92818  111; 406;  238;  0.93103  429;  0.93103  48;  64;  0.93322  625;  511;  0.93333  404;  434;  0.93333  621;  612:  0.93441  352;  460;  0.93478  63;  80;  104;  108;  0.92373  0.9357 0.93686  642 643 644 645 646 647 648 649 650  284; 639; 12; 107; 627; 124; 11; 299; 222;  637 85; 37; 141 441 152 123 373 264  0.93976 0.94053 0.94102 0.94104 0.94286 0.9435 0.94351 0.94444 0.94488  641 614 321 638 178  574 595 561 415 431  71; 633 616 635  494  68; 503; 427;  0.5 0.5 0.5 0.51852 0.52 0.53333 0.53719 0.53846 0.55556  366;  368:  0.9375  76;  608;  0.94032  604;  264;  0.94038  1;  34;  0.94094  485;  585;  0.941 18  35;  56;  0.94258  638;  6.19;  0.94284  16;  640;  0.94341  649;  630;  0.9441 7  AppC SentenceUpTo5td Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceUpTo5td Cluster Index  SentenceUpTo5td_AB_A  601  261 392 601 114 147 63; 395; 102; 41; 602; 16; 39; 170; 38; 604; 510;  602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632  10; 333; 23; 553; 19; 611 106 105 405 18; 622 487 544 546 626 196  346; 429; 269; 122; 208; 80; 455; 139; 46; 406; 33; 57; 209; 70; 126; 521; 27; 610; 25; 595; 28; 606 215: 280 545: 32; 621 537, 592 576 30; 224;  0.62791 0.74286 0.76744 0.76906 0.77695 0.77827 0.78788 0.79773 0.80984 0.82857 0.82909 0.83296 0.83575 0.83658 0.84173 0.84615 0.85191 0.85454 0.8571 0.85714 0.86687 0.86745 0.86825 0.86882 0.87097 0.87252 0.87277 0.875 0.875 0.875 0.87718 0.88125  SentenceUpTo5td AB B  11; 601 602 603 604 605 606 13  25 72 149 611 388 2; 5; 615; 616; 46; 98; 617; 105; 6; 622; 620; 3; 614 624 623 628 12; 17; 131;  -89-  578; 24; 212; 600; 22; 129 168 598 566 569 596 194 589 312 488: 607: 21; 435: 517: 123: 280 562 19; 424. 536: 320: 582 609 50; 579 610 575  0 0 0 0 0 0 0 0 0 0 0 0 0 0.046154 0.0625 0.0625 0.0625 0.17391 0.18182 0.18692 0.18824 0.2 0.2 0.21429 0.22222 0.25 0.25 0.25 0.25 0.25 0.25 0.25  SentenceUpTo5td_BA A 4;  14;  0.61043  149  194;  0.66802  209  214;  0.69333  196  229:  0.69375  126  602;  0.72629  170  267;  0.74879  406  429;  0.75862  23;  0.77909  7;  0.78074  456;  477;  491;  496;  0.8  24;  66;  0.81105  25;  29;  0.81777  32;  41;  0.82036  269;  346;  0.83333  561;  598;  0.83333  369;  408;  0.83721  383;  403;  0.83871  18;  27;  0.83911  277:  443;  0.8  0.84  63;  80;  0.84146  603;  282;  0.84507  86;  1 16;  0.84691  16;  621;  0.85226  392;  494;  0.85714  619;  614;  0.86056  39;  57;  0.86428  22:  65;  0.86434  628;  35;  0.87081  261;  615;  0.87209  67;  98;  0.87289  103;  165:  0.8745  633  609  634  13; 273  635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650  5; 299 530 326 617 640 417 634 456 377 383 67; 641; 89; 34;  52; 26; 372 619 373 535 337 631 58; 461; 20; 477 519 403 71; 90; 257; 608;  0.88154 0.88718 0.8875 0.8887 0.88889 0.88889 0.89286 0.89439 0.89521 0.89655 0.89709 0.9 0.90244 0.90323 0.90326 0.9034 0.90424 0.90447  134 627 626 26; 128 635 634 101 71; 625 642 643 639 636 646 647 648 44;  463 497 7; 551 553 271 347 510 447 593 629 590 633 613 107 108 371 592  0.26667 0.27273 0.28333 0.28571 0.28571 0.28736 0.3 0.30769 0.31818 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333  544; 545; 606; 625; 223; 31.4; 396; 471; 48; 54; 624; 629; 249; 604; ! 13; 374; 89; 626;  587; 595; 622; 607; 227; 334; 444: 523; 64; 55; 33; 30; 447; 224; 258; 636; 168; 119;  0.875 0.875 0.87923 0.88 0.88235 0.88235 0.88235 0.88235 0.88335 0.88353 0.88378 0.88592 0.88764 0.89375 0.89394 0.89474 0.89503 0.89521  AppC SentenceNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in SentenceNoRestriction Cluster Index 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623  SentNoRestrict A B A 0.61628 261 346 392 429 0.74286 80; 0.75055 63; 0.75336 114 122 0.76744 601 269 208 0.77695 147 0.78788 395 455 102 139 0.79773 0.80761 41; 46; 0.82378 33; 16; 0.82774 39; 57; 602; 406; 0.82857 0.82962 38; 70; 0.83575 170 209 612 0.83636 333 604 0.84173 126 521 0.84615 510 0.85065 27; 10; 0.85083 23; 25; 553; 595; 0.85714 610; 603; 0.85986 0.86056 28; 19; 0.86551 32; 18;  SentNoRestrict A B B 600 0 l; 0 601 16; 602 22; 0 24; 0 603 604 0 34; 605 129 0 0 606 168 384 0 607 578 0 608 609 0 ii; 0 610 212 0 611 582 598 0 13; 566 0 25; 0 72; 569 149 596 0 194 0 616 0 388 589 584 0 392 312 0.030769 2; 612 488 0.0625 621 0.0625 5; 622 21; 0.0625  -90-  SentNoRestrict_ B A A 0.60814 14: 0.66802 194; 0.675 229: 214; 0.69333 602; 0.72629 0.74879 267; 0.75862 429;  4; 149; 196; 209; 126; 170; 406; 5; 2; 456; 491; 32; 24; 369; 25; 269; 277; 392; 63; 561; 18; 383; 618;  23; 7; 477; 496; 41; 66; 408; 29; 346; 443; 494; 80; 598: 403; 607;  0.77656 0.77719 0.8 0.8 0.80729 0.80891 0.81395 0.81533 0.81944 0.82667 0.82857 0.83038 0.83333 0.83622 0.83871 0.84  624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650  621 106 105 299 405 196 326 452 487 544 546 623 5; 609; 475; 13; 615; 273; 627; 530; 618; 644; 639; 151; 417; 4; 67;  622 215 280 373 545 224 337 476 537 592 576 30; 619; 52; 494; 26; 459 372 383 535 635 58; 20; 159; 461; 14; 71;  0.86783 0.86825 0.86882 0.87037 0.87097 0.875 0.875 0.875 0.875 0.875 0.875 0.87524 0.87922 0.87971 0.88235 0.88244 0.88571 0.8875 0.88889 0.88889 0.89026 0.89026 0.8912 0.89573 0.89655 0.8967 0.89989  46; 623; 98; 105; 6; 628, 625 3; 630 632 620 629 635 12; 17; 131 134 633 634; 26; 128 642 641 101 261 71; 646;  435 123 517 280 562 19; 424 536 523 568 320 614 50; 579 615 575 463 497 7; 551 553 271 347 510 346 447 640  0.17391 0.17445 0.18182 0.18824 0.2 0.2 0.21429 0.22222 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.26667 0.27273 0.28333 0.28571 0.28571 0.28736 0.3 0.30769 0.3125 0.31818 0.33333  86; 16; 604; 621; 35; 39; 22; 54; 374; 67; 261; 223; 103; 544; 545; 625; 606; 603; 635; 314; 396; 471; 48; 630: 357; 249; 89;  116; 619; 282; 612; 65; 57; 628; 55; 623; 98; 616; 227; 165; 587; 595; 33; 626; 224; 614; 334; 444; 523: 64: 30; 397; 447; 168;  0.8427 0.84276 0.84507 0.85231 0.85851 0.85906 0.86294 0.86442 0.86842 0.86952 0.87209 0.87395 0.8745 0.875 0.875 0.87581 0.87923 0.88125 0.88235 0.88235 0.88235 0.88235 0.8825 0.88447 0.88636 0.88764 0.8895  AppC ParagraphNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in ParagraphNoRestriction Cluster Index 601 602 603 604 605 606 607 608 609 610 611 612 613 614  ParaNoRestrict_AB_A 0.82616 63; 80; 0.83333 369; 408 0.84 359; 492 0.84 383; 403 0.84263 57; 39; 0.85906 151; 158 0.86507 14; 4; 0.86735 223; 602; 0.86776 75; 73; 0.86777 196; 224 0.875 546; 576 0.87629 601 16; 0.87633 26; 20; 0.87743 612; 33;  ParaNoRestrict_ B A A  ParaNoRestrict_ A B B 563;  0  510;  521;  0.625  601  3;  0  590;  593;  0.66667  602  6; 30; 54; 81; 86; 88; 174; 396; 590; 47; 50; 593;  0  369;  408;  0.76667  0  63;  80;  0.81788  0 0 0 0 0 0 0 0 0  39;  57;  0.85006  127;  192;  0.85333  16;  604;  0.86426  0  75;  85;  2;  603 604 605 606 607 608 609 610 611 612 613  -91 -  20;  26;  0.86567  4;  14;  0.86776  243;  275;  0.87143  607;  JJ,  0.87228  18;  41;  0.87457  520;  545;  0.875 0.87657  627  13;  628  616  629  35;  41; 23; 502; 510; 32; 481; 27; 241; 98; 122; 214; 184; 613; 43; 609;  630  l;  3;  631  614  632 633  610 505  96; 229; 516;  634  621  635  644  630 628 629 637 631 324 337 634 261 54;  645  295  646  645  647  326  313; 640; 641;  648  24;  66;  649  237;  274;  650  639;  132;  615  18;  616  5;  617 618  475; 416;  619  615;  620  461; 619; 192; 67; 114, 209, 169  621 622 623 624 625 626  636 637 638 639 640 641 642 643  31; 2; 10; 85; 65; 28; 350; 352; 46; 269; 55;  0.87803 0.88213 0.88235 0.88462 0.8865 0.88889 0.88981 0.89362 0.89482 0.89891 0.89923 0.90071 0.90178 0.9022 0.90736 0.90841 0.90894 0.90909 0.90909 0.90922 0.90946 0.90984 0.91004 0.91008 0.91065 0.91111 0.91177 0.9119 0.91304 0.91452 0.91489 0.91489 0.91489 0.91522 0.91579 0.91667  614; 9; 616; 13; 618; 619; 1; 615 622 623 624 19; 32; 621 628 629 630 631 632 633 634 635 636 637 638 639 640 641 620 643 644 80; 5;  642, 648, 626 ,  40; 571; 62; 594, 35; 85; 544 572 29; 573 574 485 480 625 160 519 587 12; 25; 131 135 579 589 22; 23; 36; 56; 136; 49; 65; 113 583 471 510 562 27;  5; 359; 67; 612; 261; 618; 196; 621; 416: 48; 73; 54; 396; 13; 605; 35; 326; 619; 209; 515; 516; 635; 1; 366; 151; 86; 24; 223; 304; 620; 114; 615; 155; 156; 295; 646;  0  0 0 0 0 0 0.16667 0.25 0.25 0.25 0.25 0.3 0.3 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.35294 0.375 0.4 0.4  23; 492; 98; 27; 346; 32; 224; 229; 601; 81; 614; 55: 444; 608; 40; 65; 337; 269; 214; 580; 553; 595: 368: 158; 116: 66: 603; 347; 46; 122; 43; 237; 271; 313; 10;  0.8799 0.88 0.88147 0.88288 0.88406 0.8842 0.8843 0.8843 0.88462 0.88819 0.88856 0.88876 0.88889 0.88908 0.89095 0.89192 0.89362 0.89474 0.89923 0.9 0.9 0.9 0.90332 0.90476 0.90604 0.90775 0.90816 0.90816 0.90909 0.90961 0.90984 0.90984 0.91257 0.91333 0.91489 0.91494  AppC DocumentNoRestriction Cluster Data: Top 50 Clustering Data of the Entire 600-Token Clustering Data in DocumentNoRestriction  Cluster Index  DocumentNoRestrict_AB A  601  336; 480; 602; 603;  602 603 604  471; 481; 495; 509;  0 0 0 0  DocumenfNoRestrictAB B i; 601; 602; 603;  -92-  120; 2; 7; 41;  0 0 0 0  DocumenfNoRestrict BA A 336; 480; 602; 603;  471; 481; 495; 509;  0 0 0 0  605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650  604; 605; 606; 607;  515; 539; 540; 565;  2; 1; 281; 299, 601 613 614 615 67; 617 187 452 463 618 622 121 610 58; 625 627 117 63; 236 238 632 616 634  7; 609, 487 373 337 406 523 538  635 636 382 , 422 , 479 640 , 641 ; 642 ; 643 ; 628 ; 623 ; 645 ; 647 ; 18; 39;  71; 74; 190 608 488 78; 98; 161; 3; 59; 5; 23; 118; 80; 312 267 290 514 544 578 596 442 440 510 485 517; 521 ; 585; 9; 101; 4; 6; 41; 40;  0 0 0 0 0.16437 0.16667 0.2 0.2 0.2 0.2 0.2 0.2 0.24138 0.24138 0.25 0.25 0.25 0.27586 0.27586 0.28571 0.29263 0.3 0.30268 0.31034 0.32143 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.341 0.34483 0.34704 0.34937 0.35484 0.3625  604; 605; 606; 607; 608; 609; 610, 611, 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649  43; 67; 72; 84; 91; 103; 106, 110, 111 112 114 123 16; 127 130 131 133 134 139 143 154 ii; 155 157 162 164 168 169 175 183 184 192 195 4; 9; 58; 197 200 202 ; 204 ; 205 ; 212 ; 213 ; 3; 214; 215;  -93 -  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 o o o o o o o 0 0 0  604; 605; 606; 519; 563; 578; 1;  515; 539; 540; 541; 571; 596;  4; 611; 613; 67; 614; 616; 43: 242; 41; 281; 621; 622; 623; 601; 625; 626; 91; 133; 525; 629; 39: 628; 58; 63; 615; 617; 382; 638; 639; 640; 641; 485; 486; 498; 514; 561; 125; 636; 123;  7; 612;  ->•  3; 74: 6; 5; 68; 248; 1 19; 379; 395; 455; 487; 406; 423; 461; 312; 161; 536; 157; 40; 236; 59; 80; 98; 23; 415; 433; 442; 446; 466; 585; 500; 511; 523; 586; 619; 71; 282;  0 00 0 0 0 0.047059 0.087404 0.098039 0.10138 0.10345 0.10886 0.12644 0.16 0.18182 0.18557 0.2 0.2 0.2' 0.2 0.2 0.2 0.2 0.21212 0.23913 0.25 0.27273 0.2875 0.29167 0.3 0.30864 0.31034 0.31801 0.33333 0.33333 0.33333 * 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.33333 0.3421 0.34314 0.34375  Appendix D: Statistical Evaluation Results of All Fifteen Schemes - Clustering 600 Tokens  5.4.2.1 Instructions for Expert's Evaluation  Below are the instructions we showed the expert to do the evaluations: Please evaluate the relations of terms in each cluster based on your knowledge of and experience in the accounting industry. For additional reference, you can also check online the most comprehensive financial glossary at http://www, investorwords. com, where you can find quite accurate definitions for 6,000 current financial terms. The terms in each cluster were edited as follows: •  AH terms in the attached table were accounting terms.  •  The symbol "_" was used to connect the term-phrase, which was here treated as a single term. Different terms were separated by "; ". For example, in cluster No. 1, "'stock_dividend" was one term, yet ''stock_dividend; stock_split" were two different terms separated by ";".  •  Each cluster consisted of two groups that were separated in two columns. The expert should evaluate the relations among all the terms which reside only in different groups. If a single group within a single cluster contains more than one term, since these terms have been linked together by the previously formed clusters and have already been evaluated by the expert, the expert should therefore not evaluate the terms within the same groups. As an example, see Table 5.8 cluster #3: the first group contains two terms ' 'stock^dividend" and "stockjsplit, " and since they were linked by cluster #1 and had already been evaluated by the expert when evaluating clusterM,  -94-  so in cluster#3 the expert should only evaluate all terms residing in the first and second group respectively, that is, evaluate relations between ''stock_dividend" (in the first group) and "split" (in the second group), and also evaluate relations between "stockjsplit" (in the first group) and "split" (in the second group). •  The column "# of terms " counts the number of terms in two groups of each cluster.  •  In addition, all terms in the attached table were singulars and if originally they were in plural forms they have already been consolidated into their singular forms, for example, "options -> option. " So, in our attached table the only term you will see is "option " which actually meant either "options " or "option " in the original text. Similarly, the term "financialjstatement" actually originally included ' financial_statement" and "financial statements. ".  The evaluation categories were illustrated as follows: I.  Relation Type Alternatives (column I in the spreadsheet): For each relation among the terms, please choose the type of the relation from among the alternatives by entering the number "1" in the cell in its appropriate relation type column. a) Synonyms? : Are they all synonyms (same meaning)? b) Antonyms? : Are they all antonyms (opposite meaning)? c)  Broader term? : Is any term's meanins broader than all of the other (s)? Example: "accountingJerminology; terminology. "  d) Subgroups? : If none of them is the broader term of others but these terms are all related, are they both or all subgroups of another broader concept? Example:  -95-  first "notes_payable " and "accounts_payable " are related terms and then they are both subgroups of the broader concept "payables. " e)  Though distinct, forms a new concept? : If these terms are both or all distinct (not related), can they tosether still form a new concept? Example: "sun" and "lotion ": though individually "sun " and "lotion " are two distinct things, together they can form a new concept, "sun lotion. "  f)  Partial relation: If there are more than two terms in the cluster, are only some of them related? For the clustering containing more than 2 terms, if not all but several terms' relations belong to the relation type a) ~ e), you can choose "Partial relation " here as their relation.  g)  Other relation: If the relation type is not listed previously.  h) No direct relation: If none of the terms are directly related. II. Explanation (column II in the spreadsheet): Please describe the relationships among the following terms for each relationship type: a)  Describe the meanins of each synonym  b) Describe the meanins of each antonym c)  Describe which of the terms is the broader or broadest. Example: "... is the broader term "  d) Describe the broader or broadest concept to which all terms belong. Example: "the broader concept is .... " e)  Describe the new concept that is formed. Example: "The new concept is.... "  -96-  f)  Describe -which terms are related. If only some but not all terms meet the relation type a) ~ e), here you should list which terms belong to which relation type. Example: "... is the broader term of..."  g) Describe the new relationship type that is different from a) ~f). h) Describe the reason for no direct relation if it is not very obvious. III. Relative Relationship Score (1-5) (column III in the spreadsheet): Please give your score from "1" to "5" regarding how closely related are the terms are in each cluster, based on the percentage of "how closely related" for the clusters containing only two terms, and based on "the percentage of terms related" for the clusters containing 3 or more terms, where "1" - Mostly to completely unrelated: only 0% (inclusive) ~ 20% (exclusive) related. "2 " - Somewhat unrelated: only 20% (inclusive) ~ 45% (exclusive) related. "3 " - Hard to decide: 45% (inclusive) ~ 55% (exclusive) related. "4 " - Somewhat related: 55% (inclusive) ~ 80% (exclusive) related. "5" - Mostly to completely related: 80% (inclusive) ~ 100% (inclusive) related.  5.4.2.3 Statistical Analysis of the Expert's Evaluation Results for All 15 Schemes (Provided in C D - R O M ) •  AppD SentenceOtd Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceOtd  •  AppD SentenceUpTo5td Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceUpTo5td  •  AppD SentenceNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results in SentenceNoRestriction  •  AppD ParagraphNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results in ParagraphNoRestriction  -97-  AppD DocumentNoRestriction Evaluation Results: for the Top 50 Statistical Evaluation Results DocumentNoRestriction  -98-  Appendix E : Automatically Identifying Phrases Appendix E l : Single Word Tokenizing AppElList 1: First 10 Single Words of the entire 1,073 Broken Phrase List ABANDONED ABNORMAL ABSENCE ABSENCES ACCELERATED ACCELERATION ACCEPTED ACCOMPLISHMENTS ACCOUNT ACCOUNTANT  AppElList 2: First 10 Words of the entire 1,188 Wanted Single Word List ABANDONED ABNORMAL ABSENCE ABSENCES ACCELERATED ACCELERATION ACCEPTED ACCOMPLISHMENTS ACCOUNT ACCOUNTANT  AppElList 3: First 10 Plural Words of the entire 283 Plural Wanted Single Word To Singular List ABSENCES ABSENCE ACCOMPLISHMENTS ACCOMPLISHMENT ACCOUNTANTS ACCOUNTANT ACCOUNTrNGSACCOUNTING ACCOUNTS ACCOUNT ACQUISITIONS ACQUISITION ACTIVITIES ACTIVITY ADJUSTMENTSADJUSTMENT AFFILIATES AFFILIATE AGREEMENTS AGREEMENT  -99-  AppElList 4: First 10 Tokens of the entire 905 Final Single Token ID List TOKEN  ID  # FREQUENCIES  # DOCUMENTS  of  1  137347  835  to  2  75746  834  in  3  71086  835  and  4  66441  833  for  5  48864  834  or  6  34588  801  statement  7  30033  802  an  8  22802  789  not  9  21576  803  asset  10  20467  535  Appendix E2: Clustering 905 Single Tokens Data Outputs AppE2Sentence0td 905 Single Tokens Cluster Data: Top 50 clusters of the entire 905 Single Token Clustering Data in SentenceOtd SentOtdAB _A  Sent0tdAB_B  906  635  638  0. 056818  907  827  834  908  356  909  Sent;0tdBA_A  879  0  0. 16667  3; 906;  886  0  362  0. 19312  907;  902  0  428  440  0. 27984  5;  903  910  123  181  0. 40346  6;  840  911  377  453  0. 4244  912  414  431  0. 44304  Sent0tdBA_B 1 ; 887  0  0. 20213  2;  0  570;  0. 2807  6;  895 904  898;  0. 33333  899  0  346;  9; 0. 41.09.1. 11  905  0  83: 607  86;  0. 030615  629;  0  510 890  0  304  0  910;  896  0  17;  45;  0. 46262  13  876  0  904  0  141  202;  0. 4929  25  862  0  913  105  107  9; 0. 46311 912;  905  0  848  878;  0. 5  27  903  0  914  660  686  0. 47887  52;  893  0  774  825;  0. 52  30  896  0  915  33;  73;  0. 49396  123;  181  0. 00263.16  390  448;  0. 53.134  126  824  0  916  62;  101  0. 55368  635;  638  0. 023529  449  489;  0. 57.143  312  860  0  917  233;  335  0. 60447  742  0. 025641  288  298;  0. 57843  316  745  0  918  118  228  2; 0. 61431 377;  453  0. 03125  46:  106; '  0. 6.138  390  727  0  919  74;  106  0. 64889  1;  542  0. 037879  454  552;  0. 62745  47'  902  0  920  593  611  0. 68932  33;  73;  0. 042204  11.9  128;  0. (34028  658  874  0  921  704  805  0. 69767  118;  228  0. 043702  332  500;  0. 646  716  883  0  922  664  687  0. 70769  89;  515  0. 064286  536  549;  0. 65693  848  878  0  923  520  554  0. 70779  704  805  0. 071429  355  459;  0. 65991  890  898  0  924  408  909  0. 7293  155  506  0. 073.171  22;  23;  0. 70032  510  570  0. 0080645  925  115  260  0. 73136  474  712  0. 078431  107  301;  0. 70968  83  86;  0. 018659  926  155  367  0. 74691  919  648  0. 083333  483  513;  0. 72105  45;  0. 025109  927  26;  35;  0. 75974  827  834  0. 090909  127  0. 027473  568  922  0. 76423  926  739  0. 11905  30;  156; 94;  500  928  0. 73905  176  644  0. 028169  929  422  430  0. 78467  928  789  0. 125  520  554;  0. 74026  24-  796  0. 05  930  474  712  0. 78539  160  850  0. 125  n . 1 ,  14;  0. 75219  60  804  0. 052632  931  382  395  0. 79825  356  362  0. 14084  84;  200;  0. 75266  912  81.4  0. 058824  932  188  404  0. 80714  255  626  0. 14493  -161  544;  0. 75566  20'  576  933  364  388  0. 80916  289  578  0. 14563  61;  81;  0. 76022  1.10  604  0. 063063 0. 065934  - 100-  17 0. 7271 332  934  318  405  0.81022  917;  838  0. 15385  149;  .155;  0. 76487  220;  648  0. 069444  935  255  476  0.81139  233  0. 16294  924;  87;  0. 77587  0. 071979  497  525  0.8125  464  0. 16742  167;  221;  0. 77728  10; 607  228  936 937  4; 922;  629  0. 074074  30;  50;  0.81994  62;  777  230;  242;  0. 78074  0. 076923  149  179  0. 82908  166  470  340;  373:  0. 78495  935 107  177  938  0. .18518 0.18889  301  0. 078212  939 940  343  446  0. 82965  934  457  0. 19512  31:  95;  0. 78637  69;  526  0. 079646  280  303  0. 84355  911  874  0.2  80:  236;  0. 79561  15;  241  0. 084408  941  926  506  0. 84362  25;  854  0.2  64;  121:  0. 79781.  906  757  0. 088235  942  112  215  0. 84866  189  880  0. 2  .190;  233;  0. 79847  201  677  0. 090909  943 944  692  763  862  0. 22222  1  0. 80067  21; 820  51; 826  773 701  0. 22222 0. 22642  287; 487;  407;  539 371  0. 092308 0. 095109  661;  0. 80363 0. 80612  931 939 914  800  289  578  0. 84906 935 0. 85964 940 0. 86667 943 0. 86866 115  260  0. 22947  933;  138;  0. 82066  119  490  0. 095238 0. 10615  376 167  434  440 688  235; 5;  252; 12;  0. 82609 0. 82643  137 59;  853 706  0. 11111  908  0. 87 428 0. 87209 944  0. 24242  948 949  22;  25;  0. 87331  908  882  0. 25 416;  562;  0. 83092  74;  522  0. 11972  950  426  488  0. 87455  50;  884  0. 25  56;  1.18;  0. 83377  141  202  0. 11982  951  550  759  0. 875 945  722  0. 26087  623:  821;  0. 83544  907  735  0. 13636  952  166  470  0. 87793  951  788  0. 26087  53;  66;  0. 83582  943  167  0. 14043  953  14;  54;  932  91;  0. 83946  4; 774  0. 14286  939  934; 232;  239  258  0. 26238 0. 26667  0. 83906  183  476 572  75;  954  0. 88178 0. 88707  825  0. 14286  955  394  647  0. 89137  36;  484  0. 28934  935;  36;  0. 83971  39:  585  0. 14563  945 946 947  0. 2459  1 •67;  0. 11321  Appendix E3: Single Tokens Evaluation Results (Provided in CD-ROM) AppE3SentenceOtd Evaluate Automatic Phrases: for Top 50 SentenceOtd Evaluate Automatic Phrases of 905 Single Tokens  -101 -  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0091604/manifest

Comment

Related Items