UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using semantic document representation to increase performance in information retrieval Rongione, Nicholas 1999

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1999-0593.pdf [ 3.21MB ]
JSON: 831-1.0051616.json
JSON-LD: 831-1.0051616-ld.json
RDF/XML (Pretty): 831-1.0051616-rdf.xml
RDF/JSON: 831-1.0051616-rdf.json
Turtle: 831-1.0051616-turtle.txt
N-Triples: 831-1.0051616-rdf-ntriples.txt
Original Record: 831-1.0051616-source.json
Full Text

Full Text

Using Semantic Document Representations to Increase Performance in Information Retrieval by Nicholas Rongione B.Sc, Villanova University, USA, 1995 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia October, 1999 © Nicholas Rongione, 1999 In presenting this thesis/essay in partial fuCfiCCment of the requirements for an advanced degree at the University of 'British CoCumBia, I agree that the Lihrary shad make it freely availaBCe for reference and study. I further agree that permission for extensive copying for this thesis for schoCarCy purposes may Be granted By the 3-fead of my department or By his or her representatives. It is understood that copying or puBCication of this thesis for financialgain shaCCnot Be allowed without my written permission. Computer Science Tfrxe University of'British Columbia 2366 Main mall yancovuer, HC Canada 1Z4 Abstract The first computational Information Retrieval projects were straightforward encodings of card catalogues and look-up tables that had existed before. Soon after, these electronic indices evolved into more advanced structures. Primary among those are the Inverted Index (II) and the Vector Model (VM). These new structures expanded the domain of indexing science to large sets of texts heterogeneous with respect to content and length. This thesis presents an overview and critique of these techniques. It is found that these word driven methods are limited because they deal statistically with linguistic phenomena that resist that type of analysis. It goes on to suggest that semantic analysis of the target documents is needed to go beyond the keyword barrier' of traditional methods. After a discussion of three ways to encode the semantic content of sentences, JackendofPs Lexical Conceptual Structure (LCS) is selected as most appropriate for this domain. The FINDER semantic information retrieval system, which uses the LCS representation for target documents and user queries, is described. This description, along with example searches, support the claim that a semantic information retrieval system has practical advantages over one that uses traditional methods. ii Contents A B S T R A C T .n C O N T E N T S H I A C K N O W L E D G M E N T S V 1 I N T R O D U C T I O N 6 1.1 T H E D O M A I N O F I N T E R E S T 6 1.2 A C O G N I T I V I S T A P P R O A C H 2 2 R E L A T E D W O R K 1 : W O R D D R I V E N M E T H O D S 5 2.1 I N T R O D U C T I O N 5 2.2 O V E R V I E W O F W O R D D R I V E N M E T H O D S 6 2.2.7 Inverted Index..... 7 2.2.2 Vector Models 9 2.3 S U M M A R Y A N D C R I T I Q U E O F W O R D D R I V E N M E T H O D S 13 3 R E L A T E D W O R K 2 : A P P R O A C H E S T O R E P R E S E N T I N G M E A N I N G 1 5 3.1 I N T R O D U C T I O N 15 3.1.1 Meaning Representation Criteria for a Cognitivist Project 16 3.1.2 Meaning Representation Criteria for IR 16 3.1.3 Choosing a Meaning Representation 17 3.2 T - T H E O R I E S A N D M E A N I N G 18 3.2.1 Tar ski's view ojTruth 18 3.2.2 Legacies of T-theories for Meaning Representation 21 3.2.3 Using T-theories to Give a Formal Account of Meaning 23 3.2.4 Assessing T-theories relative to the goals ofIR and Cognitive Science 25 3.3 C O N C E P T U A L D E P E N D E N C Y G R A P H S 27 3.3.1 Motivations for Hypothesizing a Conceptual Structure and Canonical Meaning Representation 27 3.3.2 Outline of Schank's Approach 28 3.3.3 Strengths of CD Theory 29 3.3.4 Weaknesses of CD Theory 31 3.4 L E X I C A L C O N C E P T U A L S T R U C T U R E 35 3.4.1 Defining Assumptions of LCS 36 3.4.2 Developing the LCS Notation 37 3.5 C O N C L U S I O N 44 4 T H E ' T I N D E R " I R S Y S T E M 4 6 4.1 O V E R V I E W 46 4.1.1 Search Interface Client 48 4.1.2 The Semantic Encoder 49 4.1.3 Handling User Queries 57 iii 5 EXAMPLES 58 5.1 EXAMPLES 59 5.1.1 Polysemy 59 5.1.2 Synonymy 61 5.1.3 Semantic Structure 62 6 CONCLUSION 64 6.1 FUTURE DIRECTIONS 64 BIBLIOGRAPHY 67 iv Acknowledgments I would like to thank Professor Richard Rosenberg for his work as my research supervisor. He alternated letting me wander and guiding me at what I think were the right times. I owe a great debt to the research community in the LCI lab and to Professors Henry Davis and Andrew Irvine. Any understanding of linguistics and philosophy I have accumulated is due to their efforts. Lastly, I would like to thank my parents who taught me that learning is fun and Constance who has subsequently tolerated all my learning. NICHOLAS RONGIONE The University of British Columbia October 1999 v Chapter 1 Introduction This paper begins by tracing the development of Information Retrieval (IR) techniques from the early 1950's to the present time. In Chapter 2, the most important extensions to the core techniques are discussed. Next, an assessment of the current state of this technology is offered and it is concluded that this current situation suffers from not using a representation for the target documents1 that encodes semantic information. In Chapter 3 of the paper I describe Jackendoffs lexical conceptual structure (LCS) representation for sentence meaning and contrast it to important alternatives. This theory has advantages over others because it is well grounded in research from linguistics and it matches the needs of IR well. In Chapter 4, the FINDER IR system that uses the LCS representation for documents is described. In Chapter 5,1 work out examples supporting the view that a system using LCS representation will have better precision and accuracy2 than the systems that use non-semantic document representations. Chapter 6 is a summary of the project and areas for future work are suggested. 1.1 The Domain of Interest Information retrieval systems are built to operate over many domains. Often the elements of the domain have a predictable format and may even have been produced with indexing guidelines in mind. For example, systems that search for precedent cases for law firms can count on the case summaries being in a certain format. Other systems that search over a 'target documents' are just those that comprise the search space. 2 Precision and accuracy are defined below. .1 set of scientific documents are (usually) dealing with texts that have already been abstracted and have keywords assigned from a canonical set (by human readers). There are other domains where the format is less specified but the content of the documents is limited and well defined. In all of these cases, information retrieval techniques specific to the situation can, and probably should, be used. The domain of interest here is different; it is the situation where the document set is heterogeneous with respect to format, content, and length. An important set of this sort is the odd lot of documents which form a group only in that they comprise the World Wide Web (WWW). This domain is interesting for two reasons. Firstly, it is one for which people want a fast, precise, accurate, and flexible tool to search over. Secondly, by considering this less tractable problem, the project of IR becomes difficult enough to suggest the use of techniques from Artificial Intelligence like text understanding and knowledge representation. In more limited domains, well-tuned domain specific heuristics are often sufficient. 1.2 A Cognitivist Approach I pursue the goal of providing a practical IR system for a heterogeneous document set while taking, what I will call, a cognitivist approach. For as long as people have thought about intelligence, they have used machines as metaphors for talking about the brain [matl89]. First wax tablets, then mechanical looms and later clockworks were thought to be good models for our mental processes. Recently, people have been making a stronger claim, saying that the brain is a type of computer. Investigating that claim, which can be called the cognitive hypothesis, is what the cognitivist approach consists of3. This is not to say that all researchers deliberately test this hypothesis when they talk about intelligent behavior. However, we can call work cognitive if it seeks to give a formal description for thought, or if it equates thought with the manipulation of symbolic objects in a procedural way. This is because a natural way to test the correctness and scope of those sorts of theories is to try and implement them on a computer. As projects in 2 linguistics, psychology, philosophy and computer science all fit this description, cognitive studies are interdisciplinary. It is important to note that these theories are interesting only when they are predictive and explanatory of the empirical data of intelligent human behavior, not simply because they are formalizable. To give some examples, I mention three projects of note. The first project is Chomsky's [chom57] theory of generative syntax. In hypothesizing that our ability to produce and recognize an infinite number of sentences is the result of a finite combinatorial mechanism, Chomsky is making a cognitive claim. In his theory, symbols are operated on successively to account for our tacit understanding of syntax. The structures that he uses to explain observed linguistic phenomenon embody theories about language in an abstract sense, but they are also offered as an account of things existing in the human brain4. The second project is that of Shepard and Metzler [shep71]. In.it the authors tested how long it took for subjects to judge that two drawings of a three dimensional object were depicting the same thing, seen from different viewpoints. They found that the time needed to make this determination was precisely correlated with the number of degrees that the object in one image needed to be rotated so that it would be in the same position as the object in the second image. This study gives important evidence for the hypothesis that mental operations are similar to operations we would perform on physical objects. Ultimately, this is a restatement of the cognitive hypothesis. The last project I will mention is the theory of semantics given by Larson and Segal [lars95]. In it they attempt to account for the judgments of sentence meaning made by speakers of language. This project is in many ways parallel to Chomsky's syntax project in that it accounts for an irifinite ability with finite combinatorial means. The authors emphasize a functional characterization of these abilities. Furthermore they claim that the functions used are computable functions. This is another restatement of the cognitive hypothesis. 3 For a more thorough and philosophically supported statement of this hypothesis, which I have given in its barest terms, see Larson and Segal. [Iars95]. 4 Perhaps in a different but still recognizable form. 3 In designing this project I have been motivated by practical and cognitive concerns. To be practical, a project must suggest methods that will lead to better performance according to established measures than the methods that have come before. To be cognitive, the assumptions, techniques, and formalisms used in the project must, at least, not contradict the empirical data of intelligent human behavior. In fact, the project should draw its inspiration from non-contradicting cognitive theories, from whatever disciplines, that have a well established ability to account for and predict empirical psychological data. I will make frequent reference to both of these concerns, the cognitive and the practical, throughout this thesis5. 5 The question begged at this point is "Why should we think cognitive approaches will be practical?". I leave this question to the side, only remarking that human cognition seems like a plausible guide in areas where humans excel and non-cognitive computational approaches have faired poorly. Text understanding, which this project is a species of, is this sort of area. 4 Chapter 2 Related Work 1 : Word Driven Methods 2.1 Introduction There is work from philosophy, linguistics and computer science that is relevant to this project. The following section reviews some of just those computer science projects that set out to provide an information retrieval method over unstructured texts. By unstructured texts I mean texts that don't have the same structure as one another and that we don't know anything about the structure that they do have. By information retrieval task I mean the job of selecting documents that are pertinent to a user's query from a large set. Although other systems are mentioned, the ones that receive emphasis are those that are completely automated. These systems have the practical advantage of not depending on expensive human labor. Also, they are more interesting theoretically because they push the burden of 'knowing', at least somewhat, onto the machine. Current word driven methods are found to be inadequate because they do not represent the meanings of the target documents. Because of this, they miss documents on topic because they use different words than the query and return documents not on point because they coincidentally use the same words. It is hypothesized that translating the target documents from lists of sentences to lists of representations of the meanings of those sentences will allow for better performance. With that in mind the second related work chapter reviews approaches to representing meaning. 5 2.2 Overview of Word Driven Methods Most, if not all, of the information retrieval systems used today rely on inverted indices (II) or vector models (VM) or a combination of them. I will first describe those two important techniques and then I will mention some extensions. There are two basic tasks in information retrieval. The first is document pre-processing, where the documents are indexed and classified with the goal of building a database that will support user queries. This can be seen as providing a representation for documents. The second task is search (or retrieval), where a set of documents that is hoped to be relevant to the user's query is somehow obtained. These goals were first approached with the tools that had been already developed in library science. As such, the early work was done with encodings of keyword in context (KWIC) indexes6 and supported brittle Boolean queries [spar97]. The field took its first large step forward when H.P. Luhn gave evidence [luhn58] for the idea that the keywords themselves for the KWICs could be arrived at computationally and that even document abstracting could be done in the same way. Luhn's scheme used word frequency characterizations to select the keywords from target documents. Luhn pushed his technique further using it to create abstracts. His method was to select sentences that had a certain density of the keywords (found computationally) and to use them to cobble together an abstract. From there it was a smaller step to develop more modern inverted indexes. The next notable advance in the development of modern IR came with Salton's SMART system [salt75], in which he developed the Vector Model of document retrieval for IR. Arguably, there has not been another significant advance since that time [bren96], [frak92]. Essentially, lis and V M models are two possible ways to represent documents . They are described below. Later in the thesis when I claim that there is a limit to 6 These are indices where human selected keyword lists were created for each document. These are drawn from a predetermined set. The KWIC is an alphabetized list of these words with references to documents. A perhaps better known version of a KWIC uses the title words as the keys. 6 performance given these methods, I am claiming that overall results can only be as good as the representation that is used. 2.2.1 Inverted Index An II, in its simplest form, represents the set of target documents as a matrix. The rows are labeled with all the terms in the language7 and the columns are labeled with references to each document. So the (m,ri) entry in the matrix, where m = a term and n= a document reference, is 1 if word m occurs in document n and 0 if it doesn't. The following is an example where instead of a '1' the position of the term in the document is stored. That extension allows phrase searches, as words in arbitrary sequences can be checked for. Also, it adds the ability to search for a term 'near' another term. Reading a row out of the matrix gives a list of all the documents that contain the term that that row is labeled with. Reading a column gives the sorted list of the terms that it contains.8 Domain of unstructured texts: doc 1: the cat and the hat are back doc 2: Gretzky scored a hat trick and beat the Devils. Inverted Index (partial): Terms Ii Doc. #s => #1 #2 and (3) (6) cat (2) hat (5) (4) 7 This is usually interpreted to mean all the terms used in the documents 8 This foreshadows the Vector Method interpretation of documents. 7 In the case of the Web, the references to the documents, here '#1' and '#2', would be replaced with URLs. There are many ways to refine and extend this structure. These include: 1. Not indexing a 'stop list' of small words that carry no content. 2. Not sticking strictly to the words that are in the document when it is encoded. Examples include: adding synonyms, adding stems only, and not adding words that are too rare, instead adding broader terms.9 3. Using a term weighting scheme and including the weight of each term in the matrix entries. Straightforward set union and intersection operators can be used to find all the documents that contain all of the words in a given query or all the documents with some words but without others. Retrieving the set of result documents is typically fast because the number of documents that contain a given term is usually far less than the number of documents in the database [maul86].10 The inverted index is the back end data structure that serviced most early Boolean query systems. Those systems allow queries of keywords strung together with Boolean connectives. They are fast, precise and well understood. However, lis can be criticized even without bringing up cognitivist concerns about their representation. One disadvantage of the II system is that given the location in the matrix of a relevant document, nothing is known about the position of other relevant documents. That is a serious flaw because having the relationship between documents recoverable from the representation of target documents underwrites many desirable IR functions. These include: 9 These ideas foreshadow my suggested move toward a conceptual representation for documents. These methods are partly conceptual in that they use heuristics to abstract away from the surface form of documents. 1 0 Union operations take constant time and intersection and will be no worse than O(nlgn); n=average number of documents in which a term appears. The size of the index will be 0 ( the number of indexing terms times the number of documents). 8 1. ranking results 2. controlling result set size 3. using relevance feedback 4. organizing the documents into a database where like documents can be retrieved together. The Boolean approaches that IPs most naturally support do badly with all of those tasks. Of course, Boolean systems can be pushed to get better results. Trained experts can formulate the queries, as they used to do, or more functionality could be obtained by enhancing the index with weights and other statistics. However, none of those things can be accomplished with the ease and efficiency that the vector model, described below, makes possible. 2.2.2 Vector Models Gerald Salton pioneered the field of contemporary information retrieval. He did the early work on vector models and, over the course of his career, investigated almost every conceivable refinement to that technique and others. What follows is a discussion of the fundamental aspects of V M theory and a few core extensions. I describe this in some depth as I adapt these techniques for use by the FINDER system. The main idea in vector models is to interpret each document as vector of its sorted terms and to interpret the set of target documents as a vector space. A vector space Wi x w 2 x w3 x ... wn , where n = the number of words in the language, and Wj is an element of the words in the language is created. Each document vector has a 1 in the Wi position if the word Wi from a list w i , w 2 , w 3 , . . . w n of the n words in the language occurs in the document. Otherwise, there is a 0 in that position. A user query gets 9 represented in the same way11. Any two documents can be compared and given a similarity statistic by comparing the vectors that represent them. Intuitively reasonable ways to compare vectors that are used are the cosine of the angle formed by the vectors and/or their inner product. Which one is chosen depends on whether the aim is to emphasize the magnitude or the direction of the vectors. Using this model one can organize the target documents into a tree that allows for fast retrieval with more flexibility and power than lis. The Document Cluster Tree The best way to describe this structure is to explain how it is created. First a threshold value C, is chosen.12 Then the following algorithm is used: 1. select unprocessed vector V. 2. Compare this vector to all existing centroid vectors using cosine.13 3. If V is within C of one or more centroids do: i. Add V to the cluster of vectors that is associated with the centroid it is closest to. ii. Recompute the centroid for that cluster. (The centroid is some measure of the central tendency of the cluster; it can be a vector that represents the mean, mode, etc.) 4. If V is not within C of any centroid do: iii. Create a new cluster iv. Set the initial centroid of that cluster equal to V. This basic scheme is easily extended14 to use layers of super-centroid vectors to allow logarithmic time access to the reasonable sized document clusters at the leaves. [salt83] 1 1 These vectors are essentially the same as the column vectors of an II. The advantages of VMs are realized by emphasizing the vector space interpretation. 1 2 Experience and/or experimentation informs this choice. 1 3 Or whatever function you have chosen. 10 Refinements As with lis, the Os and Is can be (and usually are) replaced with term weights. The traditional measures for weighting are some function of density in the document and density in the database. If a term is dense in a document and sparse in the database, then it is weighted highly because it is thought to be a good discriminator for the document. Other rationalizations support different formulations of this heuristic15. Once the document cluster tree has been created, queries are handled in the same way that new target documents are added to the space; that is, the tree of centroid vectors is descended. At the leaves, there are small clusters of documents that are deemed relevant to either the query or the new target document. In the retrieval case this small set is compared to the query to rank the results. As may have been noticed, a possible drawback to this otherwise robust method is that it doesn't actually guarantee that the target documents it deems relevant will actually contain all the query words. However, this is not necessarily bad. For example, consider three sets of terms, A, B and C and imagine that there is a topic that is typically talked about using terms from (A and (B or C)). If the query contains terms from B only, the system can still return documents that use terms from A and C because documents using terms from A and C will be close to documents using terms from A and B in the vector space. That is good because they are on the same topic. Extensions Salton described automated methods for most aspects of information retrieval systems. A particularly neat enhancement suggested by Salton is called the dynamic document space. It is yet another idea that is easily implemented given the vector representation of documents. In the case where there is relevance feedback16 from users, the target vector can be shifted towards or away from the query that it was deemed relevant to by adjusting 1 4 See chapter 4 for a more precise statement of this algorithm. 1 5 See Strzalkowski [strz96] 11 the weights. Future queries that are similar to the one in question will retrieve documents more or less easily depending on the relevance judgments of past users. Salton achieved some success with this method in test domains but it remains unexplored on a large scale. Ideas like this one are not tied necessarily to the vector model but the vector model provides the most intuitive way of realizing them, hence its prominent place in the field17. In both industry and academia other extensions are attempted. Strzalkowski [strz96], for example, thinks that enhancing the inverted index with weights18 assigned in a principled manner can yield better results. Other researchers don't try to modify the mechanisms of VMs, they add another layer on top. This approach is driven by the observation that results sets often contain relevant documents; it is just that they usually contain a lot of irrelevant documents or noise. These systems are often a sort of filter. One researcher, Etzioni [etzi96] calls this "moving up the information food chain". One of his systems, Ahoy! uses the low-level search engines as tools to accomplish a goal for the human user. It finds home pages. First, it queries the Web's main search engines with the information it is given. It does not return the resulting glut of information though. Instead, having learned based on previous experience what form home pages addresses often conform to, it throws out URLS that are unlikely to be relevant. Another system, On Point [coop97] is a sort of expert system, creating good Boolean queries from a user's natural language input. The queries are incrementally less restrictive, and the first time a sufficient number of results is returned by the traditional search engine it interfaces with, they are provided to the end user. User judgments as to which documents in the initial result set are actually relevant to the query. 1 7 At this point I should mention that there is no reason not to use both techniques at once. Since the vectors in the V M are the columns of the II, the V M tree can be built on top of the II. There are innumerable ways to tune these systems to meet the demands of real time retrieval. I have ignored these concerns in favor of presenting the data structures that are needed to begin tuning. This dual procedure relates documents to one another in meaningful sets and also supports Boolean and phrase order features. Many commercial search engines seem to be employing both simultaneously. 1 8 He calculates these weights based on linguistic features in the text and lexicon and in this way his methods are different than the weighting schemes mentioned above. 12 2.3 Summary and Critique of Word Driven Methods The methods above have been successful, with variations of the vector model being the most used currently. These methods actually provide surprisingly good results in other domains, even those that have generally been thought to require more elaborate processing and representation. For example, in one study, it was found that grading of freshman English essays with a vector model was as consistent and accurate as using human TA's. The success of these techniques in information retrieval and elsewhere seems to imply that a huge amount of information about a document is contained in the unstructured words or lists of words that are used in that document. That seems like a fair enough statement. And if true, it would be part of a case for not looking any further for an IR method. But is it true? I claim that it is only contingently true. I think it is easy to see why. To begin, an 'A ' essay will get the same 'A ' with the words scrambled if it is graded with a vector method. Also, an 'A' with all of the content carrying nouns and verbs translated to less used synonyms will get a low grade. What is happening is that the program is taking advantage of the empirical fact that knowing the vocabulary of a subject is highly . correlated with learning its conceptual content. But those two things are not the same. One could know the words but misunderstand them and know the concepts, but for whatever reason, use words for them that are less typical (or descriptive phrases could be used). It seems that any approach that is based only on words as atomic symbols19 will never overcome these limitations. This means that no matter how much interesting processing is done to the words, it is just the case that the words themselves only carry so much of the meaning of a text, considered by themselves. Nothing more can be squeezed out of them20. Even without considering those higher level problems, there are problems with words themselves that hamstring word driven methods. Consider synonymy and polysemy: To use Frege's example in an unintended context: a search on: "the evening The techniques below arguably still use words, but as entry points into articulated cognitive structures. 2 0 Again, I admit that systems with good performance can and have been built in this way, it is just part of my claim that they work contingendy, or almost by accident. 13 star" could be fruitless simply because all of the information is in documents whose authors referred to Venus as " the morning star". Another problem in traditional IR occurs when a word has more than one meaning. For example, a search including the word 'Java' could return information on coffee beans, a country, and a programming language and there is no way to limit the results correctly. (See [maul86] for a more complete discussion.) These ills are compounded when an even larger problem is considered which is that word driven methods fail to encode any of the relationships that hold between the entities in a sentence. The study of both syntax and semantics, almost regardless of particular approach, reveals a predicate/argument type structure for language that encodes the relations between words or between concepts. This structure is missed by word driven methods. To take a simple example, queries for "college juniors" will not be able to be distinguished from queries for "junior colleges" using traditional methods.21 To see that this problem is not easily solved, one only needs to consider the case where no ordering of the words that are actually in the document are part of a well defined query criterion. An example is searching for documents with a certain theme.22 Ultimately the reason that automated IR and its extensions are dissatisfying23 is that they lack a meaningful representation for the target documents. Using clever heuristics, weights, and vector interpretations, one can edge away from a strict adherence to the exact terms in the documents but there is still a large gap between that and representing the concepts the documents express. The remainder of the thesis deals with the challenge of developing a meaningful document representation that can be computed and used for automated IR. This criticism is not entirely accurate as one could search for the phrases themselves in this case. 2 2 Because of these shortcomings researchers have often tried to build more structure on top off lis and VMs. Extensions include NLP front ends that form expert Boolean queries, linguistically motivated weighting schemes and others. It is also important to point out that fully automated IR is not clearly the dominant option even today as indexes and hybrid systems abound. 2 3 Witness the popularity of Yahoo, an old fashioned index, over AltaVista, a high powered automatic IR system. 14 Chapter 3 Related Work 2 : Approaches to Representing Meaning 3.1 Introduction Researchers from the various disciplines of Linguistics, Philosophy and Computer Science all attempt to give an adequate formal description for language. A complete theory includes an account of the meaning of language. Despite their common goals, the theories offered by the above mentioned three fields are often quite different. These differences arise because there are different criteria that one might use to characterize the search for an adequate formal description of language. One point that separates the approaches is the presence or absence of concern with psychological reality. Computer Scientists are often unconcerned, favoring approaches that work, even if only in small test domains. I suggest that we should avoid non-cognitive approaches (those unconcerned with psychological reality) for the following reasons: 1. The number of possible knowledge representations is large. Taking psychological data into account constrains the number of representations under consideration. 2. Because human cognitive processes interface with each other, there is reason to believe that a system that mirrors one sub-part of human cognitive ability will be able to integrate smoothly with other parts built later. 3. To evaluate the Cognitivist Program, described in the introduction, it is necessary to hypothesize theories that are consistent with the data of empirical investigation and then test them by creating systems and evaluating their performance. Ad-hoc approaches do not inform the study of human cognition in this way. 15 3.1.1 Meaning Representation Criteria for a Cognitivist Project Many meaning representations are possible. One could use scripts or frames like Schanks[scha72] to encode the meaning of entire documents or episodes, or use d-structure or LF from syntactic theory to encode the meaning of sentences. Other possibilities include semantic nets, logical formalisms, and probabilistic representations. Here I will only be interested in systems that are cognitively licensed. By that I mean systems that are adaptations of a theory that attempts to provide explanatory and predictive power for human linguistic ability. For cognitivist theories measuring them against the prospect of implementation is a valid assessment of worth. The specific behavior that our theory must explain is linguistic. Specific examples of the phenomena it must explain will be discussed below but briefly, this account must at least be compatible with these observations concerning human language: 1. Sentences mean things about the world. That is, there is a connection between language and the world.24 2. Humans can speak and understand an unbounded number of sentences. 3. Humans can speak and understand more than one human language while seeming to think in neither25. 3.1.2 Meaning Representation Criteria for IR While deciding to consider only those theories that are informed by philosophy and linguistics and that are consistent with empirical studies of the nature of human cognitive processes constrains our selection of a meaning representation considerably, it does not This can be cast in different terms, as Jackendoff does, but this intuition must be addressed. 25 Here I am repeating claims made in [mart96] and [dorr92]. This claim is not uncommon in the literature and may be partially implied by any project that seeks to describe a language-independent representation for thought. It is contestable though, as most authors offer only introspection for support. 16 narrow the field completely. To select from among the remaining alternatives, we should keep the criteria of this specific project in mind. A successful meaning representation for IR is just one that supports an IR system that achieves greater accuracy and precision than advanced vector model systems, without sacrificing performance and flexibility. The representation must support a system that: 1. Is usable over a set of documents with heterogeneous length and content (like the WWW). 2. Supports the construction of a data structure that allows for searches that execute as quickly as those supported by inverted indexes and vector spaces. 3. Has greater accuracy than V M systems. Relative to users queries, accuracy is the percentage of relevant documents in a database that the system actually returns. 4. Has greater precision than V M systems. Relative to users queries, precision is the percentage of returned documents that are relevant. 3.1.3 Choosing a Meaning Representation Jackendoffs Lexical Conceptual Structure (LCS) meaning representation is the one that is used in this project. It meets the demands of both the information retrieval task and the general cognitive project. To support this choice it has to be made clear why other, perhaps more obvious, choices were passed over. Here I am talking about two classes of influential alternatives; Davidson-inspired theories of meaning26 and Schank-inspired theories of meaning. Since, LCS theory not only contrasts those theories but is informed by them, it is fitting to give a brief account of each and then to present Jackendoffs work. Which are, in turn, inspired by Tarski. 17 3.2 T-theories and Meaning Many current approaches to representing the meaning of a sentence hinge on the idea that a formal structure that encodes the meaning of a sentence can be computed systematically from the syntactic structure of the sentence. (See [hobb93] and [alle95] for good examples). This intuitively plausible method has roots in the theory of truth presented by Tarski in [tars44], hereafter called a T-theory. As such, these methods inherit the strengths and weaknesses of Tarski's program. 3.2.1 Tarski's view of Truth Tarski presented his work against a backdrop of persisting positivist skepticism towards metaphysical terms like 'truth'. His goal was to provide a formal account for the term true' that would be acceptably precise and material. In his words, the discipline of semantics should be "sober and modest" [tars44] seeking only to show how the truth of sentences was dependent on their structures and to reveal that certain problems in analysis arise simply for want of a meta-language. He proposed that any adequate theory of truth would have to be able to prove sentences of form (T): (T) X is true if and only if p. To explain this sentence schema, let us consider a concrete instance of it: 'Snow is white' is true if and only if Snow is white. The quoted sentence on the left hand side (LHS) of the biconditional is the name of a sentence in what is called the object language. The object language is just that, the object of inquiry. Another way to interpret the LHS of the biconditional is as the structural description of the sentence [lars95]. The rest of the clause, including the right hand side (RHS) of the biconditional is a sentence of the meta-language, that is, the language used 18 to talk about the object language. Since, in this case, the meta-language contains the object language, the RHS of the biconditional alone is a sentence in both languages. Now we can see that X ' is the name of a sentence (and its structural, that is syntactic, description) and p' is equivalent to the sentence itself27. The notion of truth is then taken to be built by the conjunction of all of the possible (T) sentences in a language.28 This introduction to T-theory has been abstract. To see how a T-theory actually works, consider this simple example for the test language L. Syntax of L: (1) S -> Jack cries (2) S -> Jill cries (3) S -> S and S Semantic Rules for L: terminal nodes in structure (1) 'Jack cries' is true iff Jack cries. (2) 'Jill cries' is true iff Jill cries. non-terminal nodes in structure (3) [s SI and S2 ] is true iff SI is true and S2 is true Production Rules for L: (SE) Substitution of Equivalents: 2 7 Again, if the meta-language contains the object language then 'p' is just the sentence itself. 2 8 Since there are an infinite number of sentences in interesting languages this raises a logical question about the interpretation of a conjunction of an infinite number of clauses, but it is one that Tarski mentions that he is not interested in [tars44]. 19 Given F(alpha), alpha iff beta The inference to F(beta) is valid (UI) Universal Instantiation Given For Any S, F(S) The inference to F(alpha) is valid Given the syntax and semantics above, it is easy to see that with a structural description of the sentence 'Jack cries and Jill cries', (say in the form of a labeled bracketing such as [s [s Jack cries] and [s Jill cries]]), one can prove biconditionals of the hoped for form: [s [s Jack cries] and [s Jill cries]] is true iff Jack cries and Jill cries29. Using this theory Tarski achieved two of his purposes. First, he could say something relating form to meaning for every well formed expression in a language. Notice that the expression has to be well formed, that is, completely rule-following in its structure for a T-theory to work on it. In Tarski's view, this excluded natural languages as an unproblematic candidate for this kind of analysis. Secondly, it unraveled problems that had been vexing philosophers at the time that resulted from using semantically closed languages30. Using closed languages, which are self-referential with impunity, it was difficult to identify the cause of paradox in sentences such as: 'The sentence on the 12th line of page twenty-two of this paper is not true.' One could be forgiven for thinking that this looks like a trivial result. For a discussion of its non-triviality see [Iars95],[davi70],[tars44]. One key to its non-triviality is that the important information is revealed by considering the whole derivation or proof of a T sentence not by just considering its final form(Davidson points this out). That is particularly the case when the object is contained in the meta-language as the T sentences are homophonic. If the meta-language is a logical formalism then the T sentences are superficially more interesting, at least for computer scientists, (as we will see below). 3 0 If a language "... contains, in addition to its expressions, also the names of these expressions, as well as semantic terms such as the term "true" referring to sentences of this language... all sentences which determine adequate usage of this term can be asserted in the language. " [tars44] it is called semantically closed. 20 from which one could derive: 's' is true iff's' is false. (Here's' is a replacement for the sentence in question) For a full account of this problem see (tarski 44). The important thing here for the future development of theories of meaning is the introduction of the meta/object language distinction that allowed this problem to be solved. 3.2.2 Legacies of T-theories for Meaning Representation The legacy of T-theories accounts for the strengths and weaknesses of many current meaning representation techniques. As such, the following are three important points to note: 1. The structure of the LHS of the biconditional has no necessary relation to the structure on the RHS. 2. The grounding of the system in some sort of word-world relation (like reference, meaning, or a more colloquial use of truth) is left somewhat open-'0 ended. 3. The applicability of T-theories depends on the formalizability of the object language. To see that the first point is true, consider our language L with a different and plausible syntax as such: s -> [s conj] conj -> and s In this case our example sentence would receive this structural description: 21 [s Jack cried [conj and [s Jill cried ]]] with this reasonable replacement of semantic rule (3): [s SI [conj and S2 ]] is true iff SI is true and S2 is true we would still be able to prove biconditionals of our desired form (T). The important difference is that during the proof procedure a structure would be built on the RHS that was not the same as the structure on the LHS. This is seen even more clearly if we consider a T-theory that uses a quantificational logic as the metalanguage and follows a Russellian analysis of definite descriptions. For example, here is the T sentence for 'The daughter of Sara cried": 'The daughter of Sara cried" is true iff 3 x : daughter ofSara(x) A cried(x) A ( V y daughterofSara(y) -> x=y) We see here that the structures on the right and left hand sides are clearly divergent. The RHS introduces conjunctions and existential quantifiers that are not present in the LHS. The second point is self evident when attention is focused on meaning rules (1) and (2) from our sample language L. Here the sentence 'Jack cries' and 'Jill cries' are unanalyzed primitives31. Thus, this t-Theory contributes to the understanding of truth of compound entities but, at some point, analysis stops. In a more thorough system the unanalyzed primitives correspond roughly to the words in the language but the limitation remains.32 Since these theories don't nail down a strong theory about how the words stand with respect to the world, it is open to the criticism that it has left an important aspect of 'truth' unaccounted for. Tarski develops a fairly technical notion of satisfyability (his term) to partially ground his notion of truth but it is too philosophical to be of interest here. 3 2 Researchers have tried to interpret these symbols in various ways with set theoretic models and appeals to reference but none of these attempts have been wholly satisfactory [alle95]. 22 The third point is also of interest. As is clear, the object language needs to be a formal language for the machinery of T-theory to work on it. As such, using a T-theory to help build a meaning representation carries with it the assumption that natural languages are formal languages. This assumption is made in a modified form by many modern linguists and computational linguists. Typically wrapped in their claim that the generative study of syntax is the study of the implicit competence of speakers (contrasted with their performance) is the claim that this implicit competence is fully formalizable [chom57]. 3.2.3 Using T-theories to Give a Formal Account of Meaning It would be fair to wonder at this point why is it interesting to talk about truth when what we want is a theory of, and representation for, meaning. The reason for this is that well behaved looking T-theories often give exactly the object language meta-language pairs that we would hope the phrase 'means that' would yield. Witness: 'snow is white' is true iff snow is white, 'grass is green' is true iff grass is green. convert favorably to 'snow is white' means that snow is white, 'grass is green' means that grass is green. The intuition is that we can use all of the machinery given to us by T-theories and by simply replacing Is true iff with 'means that' and we will have a working theory of meaning. When Davidson proposed this [davi70], what have become standard objections were raised. An important one is that there is little assurance that a T-theory will not prove T-sentences like: (1) 'snow is white' is true iff snow is white and 2 + 2 = 4. 23 (2) 'snow is white' is true iff grass is green. Davidson acknowledges that these are potentially real problems. He even stresses that if a full system could be built that consistently paired truths with truths and falsehoods with falsehoods that derived (T) sentences like (1) and (2) then there would be no grounds from which to object to them as good theories of truth or meaning [davi70]. This result is clearly intolerable. Philosophers are somewhat inconsolable on this and other points concerning the use of T-theories to underwrite theories of meaning [visi99]. Davidson, however, hypothesizes that our intuitions here are good and that the restriction that these T-sentences must exist in a full T-theory, (one that can prove correct biconditionals for all the sentences in the language) insures that we would not have to accept T-sentences like (1) and (2). So a theory that could also handle indexicals in sentences such as That is white' and That is snow' would not prove untenable results like those in (2) above. Many computational linguists have been happy to step in at this point and adopt this approach to meaning representation without further thought given to the assumptions about language that it requires or about its limitations. This is understandable as the machinery of T-theories provides a satisfactory account of at least these important features of language: 1. That we can speak and understand novel sentences. 2. That we can speak and understand an infinite number of sentences. 3. That the meaning and structure of language are related. 4. That sentences have logicosemantic relations with other sentences. For example, the sentence "Bill persuaded Tom to jump" may imply "Tom jumped". The first two points are both concerned with the generative character of language. That is, that language is infinite while our means to understand it is finite. These features of language strongly imply that a theory of meaning must be compositional, accounting for the meaning of sentences based on the meanings of its parts. T-theories have this 24 character. The third and fourth points are also strengths of T-theories. If the meta-language used is a formalism over which inference procedures have been defined, automated reasoning can proceed over the RHSs of the biconditional. 3.2.4 Assessing T-theories relative to the goals of IR and Cognitive Science We must, however, recognize the limitations of T-theories if only to be sure of what work they leave undone. Traditionally, semantic theory has also been concerned with facts about language such as these: 1. Sentences 'mean' things. That is, a sentence like 'Snow is white'has a relationship to facts in the world. 2. Sentences are sometimes ambiguous. For example, Mad dogs and Englishmen stay out in the noonday sun' has at least two meanings. 3. It is easy to construct anomalous sentences like the famous "Colorless green ideas sleep furiously" which are, nevertheless, still sentences. T-theory underwritten theories of meaning leave these problems to other levels of analysis. Reference, or grounding of terminal symbols, is not approached. Ambiguity is attributed to structural or lexical ambiguity which are not taken to be part of semantics [lars95]. That is, the T-theory begins analysis with a structural description and lexical items inserted and once those choices are made, ambiguity is no longer possible [lars95]. Lastly, T-theories have nothing to say about anomalous sentences as in 3 because they embody no theory of human conceptual structure that would, for example, disallow green things to also be colorless. They will happily remark that '"Colorless green ideas sleep furiously' is true iff Colorless green ideas sleep furiously"33. The attempted solution to this problem is often to enhance the syntax with selectional restrictions that encode facts about our legal conceptual formations. It continues to be controversial whether or not this is appropriate for the syntax [haeg94]. 25 None of these limitations stop T-theories from being useful for IR. In that they conform to modern syntactic theory which assumes that speaker competence is formalizable, they are consistent with linguistic and cognitive theory. I mentioned them just to point out what T-theories don't do, but in these cases it seems reasonable to let them off the hook. They fall short though in capturing facts about language that have to do with thematic and conceptual semantic relations such as sentence synonymy and thematic parallels. For example, there seems to be some shared element of meaning not only between: "Jack ran to the store" and "Jack walked to the store" but also between "Jack went from London to Paris" and "His mood went from good to bad". T-theories dont capture this as they dont capture even the clearer relation between "The ball was hit by the bat" and "The bat hit the ball". These problems might not be unsolvable within the context of a modified T-theory but, as they are conventionally constructed, they dont address these features of language. A potentially more serious cognitivist problem for T-theories is that they leave syntax unmotivated. This is not to say that it is ignored. The problem is that the semantic structures built during derivations depend on the syntactic structures systematically but no necessary relationship between the structures is specified. This contradicts empirical data from the study of language acquisition where many researchers have concluded that unless there is a constrained relationship between syntax and semantics, language would be unlearnable [jack83].34 The general form of Tarski's program is quite good and is used, in an important sense, by any compositional system that systematically links syntactic and semantic structure35. In those regards Jackendoff follows Tarski in his development of LCS. 3 4 This is not to say that the situation is hopeless for t-theoretic conceptions of meaning. In Knowledge of Meaning, Larson and Segal hope to meet some of these concerns. They reformulate the biconditionals to include a Frege inspired truth value for each constituent of sentences and for sentences themselves. This truth value changes the structure of their semantic rules allowing them to place the restriction on them that they are stricdy local and purely interpretive. This has the effect of constraining the semantic forms tightly enough by the syntactic forms that they satisfy the empirical expectation that these forms be related, while leaving enough flexibility for it to be productive to do semantic analysis. However, questions about human conceptual structure are left open. 3 5 Another good example of a system that owes this debt to Tarski is Woods'. His Lunar system employs a process that" can be convenientiy thought of as a process that applies to parse trees produced by a parser 26 Because of the concerns noted above along with the fact that the meta-language often used in the formulation of t-theories is first order logic, I do not adopt them for use in this project. A brief look at Schank's CDG theory is in order now as it emphasizes an analysis of conceptual structure and in that way answers some of the concerns left unaddressed by T-theory motivated systems. 3.3 Conceptual Dependency Graphs 3.3.1 Motivations for Hypothesizing a Conceptual Structure and Canonical Meaning Representation As productive as Tarski inspired semantics are, they say nothing about a great many important and obvious facts about language. Issues such as these, (some of which were mentioned above), go unanalyzed: 1. Sentences are sometimes synonymous. 2. Sentences' meanings are related. For example, there seems to be some shared element of meaning not only between: "Jack ran to the store" and "Jack walked to the store" but also between "Jack went from London to Paris" and "His mood went from good to bad". 3. Sentences are often ambiguous with respect to meaning of lexical items, reference of pronouns, and other matters when considered out of context. 4. Sentences imply more information than they strictly reveal. For example, 'John flew in from Boston' could reasonable be understood to be a sentence concerning an airplane. to assign semantic interpretations to nodes in the tree The basic interpretation process is a recursive procedure that assigns an interpretation to a node of the tree as a function of its syntactic structure andthe interpretations of its constituents" . This is a good example to mention because it shows how T-theoretic approaches have had wide influence, and it also shows how one can add another level of interpretation onto the semantic forms of the RHS. Woods uses a notational variant of first order logic for his RHS and then gives a procedural interpretation for all of the predicates, while the terms are interpreted as arguments for those procedures. In this way, English is converted, through an intermediate step of semantic analysis, into a database query language. [wood78] 27 As mentioned above, ambiguity and default assumptions (the problem of the airplane in point 4) are simply not thought to be part of semantics in Tarskian theories. The status of meaning relations is less clear. One way to reveal the relations like synonymy between sentences is to use a canonical meaning representation language. Then sentences like: 'John flew from New York to Boston', 'John took a plane to Boston from New York', and 'John went from Boston to New York by Plane' are all encoded into one form and the fact that they mean the same thing is clear.[maud89] Although it is not typically done, there is no reason why a Tarskian system couldn't use a canonical representation as its meta-language to address this issue. To reveal that sentences have thematic similarities requires something more than just a canonical representation. It is necessary to construct a theory of the structure of human concepts. If one concept for non-mechanized human self transport has been posited and various instances of it are differentiated by argument values in the structure then the similarity between "Jack walked to the store" and "Jack ran to the store" would be very clear. Again, there is no reason, in principle, why a Tarskian system could not use a meta-language that embodies a hypothesis about the structure of human concepts. The practical barrier, in addition to simple tradition, to this is that on the Tarskian view, syntax is a stark business where the lexical items convey very little information about themselves other than what structural combinations they can exist in. That is, the lexicon doesn't encode information about how its members project and participate in semantic structures. 3.3.2 Outline of Schank's Approach To address these problems Shank and other researchers ([pust96], [jack83], [jack90], [fill68]) look to the lexicon and to theories of conceptual structure. Schank's view is that a 28 syntactic description of language doesn't capture its most important or essential characteristics. What is needed is a semantic description. He says: "The key issue from my point of view, then -and my philosophy has not changed on this- was the creation of expectations about what slots need to be filled in a conceptualization and the embodiment of those expectations to guide both parsing and generation." A typical CD parser scans sentences from left to right ignoring syntactic structure until it finds a word that is one of a category that he calls structure building (these are usually verbs). These words carry with them an argument structure. The sentence is then scanned again, this time searching for words whose characteristics allow them to be arguments for the predicate that the structure building word represents. His lexicon encodes all of the slots that need to be filled for each verb. For example, the verb 'give' needs to have a giver, a receiver, and an object. Once this verb is encountered the sentence is processed again, this time in search of words that could fill those slots. Selectional restrictions are employed to decide whether or not a certain word could fill a certain slot. Using this method, all three of the sentences about John flying to Boston above would be encoded into this form [maul86]: (cd (actor (*john*)) (<=> (*ptrans*)) (actor (*john*)) (inst (*airplane*)) (from (*boston*)) (inst (*new-york*))) 3.3.3 Strengths of CD Theory The first strength of this system is that, as demonstrated above, it is canonical. Schank's other major contribution is that his system posits conceptual structure. This allows 29 meaning similarities to be encoded efficiently. For example, part of the representation for crime36 is: (*crime* (agent (*human*)) (victim(s) (*human*))) (The *'s signal that the symbol is itself a concept with its own structure. For example, analyzing the concept *human* reveals that humans are things, animate, mammals, and intentional.) The representation for juvenile crime is: (*crime* (agent (*human* (age (*less-than* 18)))) (victim(s) (*human))) If some new form of crime comes to be known by the system, there is no need to re-encode that the agent of a crime must be human or that crimes have agents. As with 'juvenile crime' both semantic facts follow from clear inheritance relations that the shared 'crime' structure allows. In the same way meanings that are sub-species of a shared root: concept are systematically related. By having a limited number of primitive concepts (Schank's original system had twelve atoms from which all verbs were constructed) the thematic relations between meanings are made clear. In summary, having a theory of conceptual structure and a canonical meaning representation addresses the problems mentioned above. This allows a CD system to: 1. Support translation. (The idea is that the CD theory is canonical with respect to human concepts generally so it is canonical across languages as well as within them.) This concept of crime says that crimes are committed by humans against humans 30 2. Reveal synonymy; 'the bat hit the ball' and 'the ball was hit by the bat' get the same representation. 3. Reveal thematic relations: the meaning overlap of words like buy' and 'sell' is clearly described and transparently observed. Because of these strengths, many systems continue to take inspiration from Schank's work or to resemble it while being independently motivated.37 3.3.4 Weaknesses of C D Theory Disregard of Syntax To see that syntax is necessary to do semantic interpretation we can compare Schank's work to that of Noam Chomsky. Schank and Chomsky agree that a significant part of the meaning of a sentence is encoded directly in its form. They disagree on how much of the structure of an utterance must be revealed for us to interpret its meaning. In Aspects of the Theory of Syntax [chom65], Chomsky develops a system of selectional restrictions and strict sub-categorizations that is not really so different from the slots and slot restrictions in Schank. The lexical entries include a set of features that an entry either has or lacks. For example: (boy, [+N, +Det _ , +Count, +Animate, +Human,...) (frighten, [+V, + _ NP, +[+Abstract] Aux _ Det [+Animate], +Object-deletion,...) (the '_' represents the item itself in a sequence) The important difference with Chomsky's restrictions is that they are embedded in a theory that analyzes syntactic structure (See Section 3.4 for more on that syntax). This See [mart88],[li98],[li99],[wilk75] and [hind91] for examples. 31 allows for the interpretation of sentences with difficult syntax and for the identification of badly formed sentences. For example, "Sincerity may frighten the boy" is licensed because the features of boy show that it is animate and the features of sincerity show that it can appear in a context with an animate noun in a certain position relative to it. In contrast, Schank's system will accept ungrammatical input and it has trouble with complex or unanticipated syntactic constructions. He will interpret strings of words that are not syntactically well formed if the words and slots fit together the right way. So 'Frighten may sincerity the boy' would probably work out. As the syntactic constructions get more complex, the problem worsens. Lack of a Principled Limit on the Lexical Projection of Structure In CD formalism as much default and implied information as possible is coaxed out of utterances. Sentences like: "I want money" are represented including an unbound variable representing the originator of the money. So "I want money" is understood as "I want money to be transferred from some entity to me". Accordingly, the entity, unmentioned in the original sentence, shows up in his representation. This is an advantage in that it primes inferences about other discourse entities, the meaning of subsequent utterances, etc. The problem is that Shank does not describe any limit on how much structure should be projected. Shank himself says that in certain cases: "... it should be obvious that we can never finish diagramming a given conceptualization. For the sentence [John ate the ice cream with a spoon], for example, we might have 'John ingested the ice cream by transing the ice cream on a spoon to hid mouth, by transing the spoon to the ice cream, by grasping the spoon ... and so on... we shall retain the ability to retrieve these instruments should we find this necessary" [scha73] This problem not only inflates representations, it often means that CD systems over interpret. In [ries82], 'Pope's death shakes Western Hemisphere' is interpreted to mean: There was an earthquake in the Western Hemisphere. 32 The Pope died. The problem is that hitting the verb 'shake' projected a lot of structure and allowed the inference that things could die if the shaking is due to an earthquake. (The system encodes earthquakes as possible default shakers.) Because all the elements of the sentence are licensed in the projected structure, this interpretation is accepted. This is not only not a good understanding of the sentence, it is a nearly complete misunderstanding of the sentence. The problem gets worse when we think more about what it means to project so much structure and default information. If no limit is enforced then CD theory is burdened with being a theory of total cognition. And, this is, to some extent, what Schank tried to provide, letting his structures grow until they included elaborate scripts and representations of long term memory. These large scripts typically make the system less automated as they are written and maintained by humans. The other disadvantage of these scripts is that they make the systems that use them brittle, as performance suffers in cases that are not handled by a pre-existing script. This alone constitutes a disqualification for use as a meaning representation language for the heterogeneous documents of the World Wide Web. Poorly Motivated Conceptual Structure Even if we were to allow that it might not be so bad to accept ungrammatical input,38 the disregard of syntactic structure remains a problem from a cognitive point of view. The data that linguistic theory hopes to describe constitutes some of the most well studied and systematic evidence we have about how a mechanism of the brain works. For example, observe what words the pronoun 'one' can replace in the following pairs: a large dog 3 8 Indeed it is a problem for fully formalized grammars of English that they are too restrictive with respect to speakers' performance. That is, we often speak and write ungrammatical sentences, seemingly 33 a large one an old white house an old one the car with the white roof in the garage the one in the garage If we were to treat sentences as if they had no structure, as Schank does, we might conclude that 'one' can replace any nominal elements in a sentence from the above data. But cases such as these demonstrate that this is not correct: the king of Sweden *the one of Sweden the side of the hill *the one of the hill (* indicates a construction that is ungrammatical) Analysis of this type of data has led linguists such as Chomsky and Jackendoff to propose that noun phrase structure conforms to this template from x-bar theory (described in 3.4.2: (NP (DET (NBAR ([AP] [(NBAR N [PP])] [PP]) indifferent to the fact that linguists have hypothesized that our tacit knowledge of language (our competence) is perfectly ordered. 34 (I expanded the template a little from its most general configuration to make it clear how it applies to the examples above. For a full account of this example and X-bar theory see [cowp92].) Now the behavior of 'one' can then be described easily by saying that it can only replace NBAR nodes in the final phrase marker. Without this analysis there is no way that a system can hope to properly interpret pronouns. Syntax occupies a special place in cognitive science as it gives a direct empirically verifiable look at how the brain is working. Given that other results [wexl73] from linguistics indicate that semantic and syntactic structure must be similar for language to be learnable, we should treat any non-falsified theory of syntax as a reasonable guide for developing a semantic theory. We can see evidence for this even in our simple example concerning the behavior of 'one'. It implies that a system that maps syntactic structure into conceptual structure should not posit conceptual items that correspond to a noun without also corresponding to its sister under an NBAR node in the syntactic tree. The evidence from syntax indicates that ignoring that sister leaves the concept incomplete. Disregarding this kind of information leaves Schank's CD theory unconstrained and incompatible with other cognitive results. The general point I am making is that hypotheses about the conceptual structure must be empirically grounded. That is, they should be supported by observations we have about how the brain works. 3.4 Lexical Conceptual Structure T-theories are desirable because they systematically relate syntax to semantics, they have an articulated predicate argument structure that supports inference and they have a tightly constrained formalism. CD theory's strengths he in their ability to encode thematic parallels in meaning, in their canonical representation, and in the fact that they posit a primitive level of organization for our thoughts which provides us with a cognitive hypothesis to evaluate. However, in addition to not dealing well with the others areas of strength, both theories lack a good integration of the facts of syntax. Jackendoffs Lexical 35 Conceptual Structure (LCS) provides a synthesis of the strengths of these approaches and integrates syntax. 3.4.1 Defining Assumptions of LCS Jackendoffs theory of LCS contrasts with both of the previous proposals and, at the same time, shares many of their goals and techniques. One way to begin the description of LCS is to look closely at the assumptions it makes that are different from those of the theories reviewed above. The important assumptions of this type are: 1. The assumption that syntactic structure and semantic structure are not just systematically related but that they are also similar. 2. The assumption that the objects of conceptual structure do not correspond to objects or types in the natural world, but to objects and types that exist in our internal mental schema. Both of these points need elaboration. To support the first point, Jackendoff reasons that if the goal of language is to communicate information, it follows that it would function in as efficient a manner as possible. Consequently, if it is claimed that syntax and semantics have very different structures than the difficult burden of explaining why language should have the structure that it does is inherited. That is, why would a sentence have a form that is so different then the form of the thing that it expresses? Having a complicated and subtle syntax that doesn't directly reflect the structure of the meaning of the sentences seems unlikely. Jackendoff sites the results of [wexl73] who conclude that human language syntax would be unlearnable if it weren't reflected in its semantics. Jackendoff calls limiting consideration of meaning representations to those that are constrained by syntax obeying the grammaticality constraint. The second point is partially philosophical but has an important practical consequence. Jackendoff believes that we do not experience the world directly, instead, raw stimuli are processed by our brain and understood in terms of innate concepts. It is 36 difficult (or perhaps impossible) for us to think outside of the terms and categories that this innate structure makes available to us. Jackendoff also assumes that our language reflects this innate conceptual structure. The important consequence of this is that we can use data from studies of language, vision, and cognition generally to propose the existence of items in our conceptual ontology. This gives a principled way to motivate the primitives of LCS and avoids a number of philosophical problems concerning the relationship of language to the objective world39. 3.4.2 Developing the LCS Notation Verbs Most theories interpret verbs as having predicate structure, T-theories and CD theory included. Jackendoff accepts this intuition. In fact, he, following Chomsky and other linguists, sharpens this idea with syntactic analysis. The point of their analysis is that one can characterize the semantic arguments to the verb syntactically. In the language of the Government-Binding approach to syntax [chom81] there are two main principles at work in this process. X-bar theory X-bar theory constrains the trees that describe the internal structure of sentences. Lexical items from any of the major categories, Noun, Verb, Adjective, Adverb, or Preposition can head phrases of the corresponding type. To head a phrase means that the lexical item projects structure above it that maximizes its modifiers. The general template is as follows, where X is a variable spanning the major lexical categories: [major phrasal category [X 2 [X 1 [X «lexical item » ] ] ] ] 3 9 On Jackendoffs account we need only be concerned with matters in the conceptual world that our 37 Concrete examples are as follows: (1) The man put the book on the table [S [NP the man] [V2 [V1 [V put][NP the book][PP on the table]] [PP for no reason]]] (2) The nice pictures of Sally from Texas [NP [Art the][N2 [AP nice^N1 [N pictures] [PP of Sally]] [PP from Texas]]] (3) down the road near the tavern [PP [P2 [P1 [P down][NP the road]][PP near the tavern]]] Notice that the major phrasal category corresponding to a verb is a sentence. The rule40 describing the structure projected by heads is: X n - > ( C 1 ) . . . ( C j ) X n - 1 ( C j + 1 ) . . . ( C k ) X° -> lexical item n<3 d's can be any major phrasal category, that is, all non-head material must be a maximal projection. At this point we can see the beginnings of the syntactic characterization of semantic arguments to the verb. Consider (1) from above. 'S' is the maximal projection of V (here 'put'). The main point of X-bar theory, for our purposes here, is that lexical heads cognitive apparatus overlays on the natural world. 4 0 Chomsky's formulation of X-bar theory is not the same as Jackendoff s which is given here. Jackendoff s theory results in the verbs subcategorizing their subjects which is important for Jackendoff s semantics. In Chomsky's system this is not so. Instead, an unvoiced lexical element INFL (from inflection) is hypothesized as the head of sentences. Chomsky also adds an additional rule: X 1 -> ( Q ) ... ( Cj) X 1 (C j + 1 ) ... ( Qc). Chomsky doesn't have as fine grained an analysis of what syntactic positions are arguments of the verb and which are merely restrictive modifiers. Jackendoff disagrees with this formulation but does not think that his semantics is ultimately incompatible with it [jack83]. 38 govern all the sisters of the X"s in that they project the structure that supports them. But it doesn't fit with our intuitions about the necessary parts of the meaning of 'put' that the PP 'for no reason' is an argument to it. This problem is a major concern as there is no theoretical limit on how many modifiers of this sort, (called restrictive modifiers, described below), can be present in a sentence. This situation is addressed by 6-theory. B-theory According to 6-theory, part of what it means to know a verb is to know what 0-roles41 it assigns. The verb 'sold' assigns three 6-roles; agent, theme, and goal. Not all of these roles need to be obligatory but, if they are present in the sentence, then they are arguments to the verbs. For example, the 'goal' of 'sold' is optional: Mary sold the book. Mary sold the book to John. Here "to John" bears the optional 6-role. In contrast, a restrictive modifier, one that is not an argument to the verb, does not bear a 0-role. The PP 'for no reason' in (1) above is such a modifier. Thus far all that has been done is to add to the richness of the lexical entries of verbs and to accept that added structure as a reasonable part of the syntax. What is needed is a syntactic description of where in the structure predicted by X-bar theory the 0-role bearing maximal projections will occur. Jackendoffs thesis is that all 0-role bearing phrases must be sister to X 2 or X which are called the subject and complement position respectively.42 4 1 0-roles are a grammatical abstraction of thematic roles. 4 2 This evidence is based on similar arguments to those given about the behavior of the pronoun 'one' above. That is, he shows that this hypothesis explains the empirical data about how constituents undergo syntactic movement [jack77]. Also, he draws on evidence about the possible order of modifiers citing the fact that most speakers find "John put the book for no reason on the table" to be ungrammatical. The 39 First Approximation to LCS With 0-theory and X-bar theory in place, we can see how to lift preliminary semantic structures off of syntactic structures. Given a syntactic description like: [S [NP the man] [V2 [V1 [V put][NP the book][PP on the table]] [PP for no reason]]] and a 0-grid43 for 'put' of: (agent, theme, location) we know to look in subject and complement position for phrases to fill those roles and that we can treat phrases in other positions differently. Other grammatical considerations tell us that the agent is in subject position (because no passive transformation has operated on the sentence) and that the theme is adjacent to the verb44. These syntactic characterizations of the arguments to the verb's semantic structure are embodied in the 'linking rules' of Dorr's UNITRAN system [dorr92] which I follow. This leaves us with the preliminary form:: [put('the man', 'the book', 'on the table') restrictive modifier ( 'for no good reason')] conclusion is that the arguments and the verb must form a constituent without restrictive modifiers separating them. 4 3 A 0-grid is just that part of a lexical entry that lists the 0-roles a verb assigns. 4 4 For this sentence we also know that the theme will occur adjacent to the verb because it is expressed by a noun phrase. A l l noun phrases need to be assigned case and this can only be done by adjacent verbal elements (V, PP, and INFL i f you syntax includes it). This part of syntactic theory is summarized as the Case Filter. [haeg94] 40 Predicate Structure for Other Lexical Categories: The syntactic justification for treating verbs as having predicate structure, as long tradition in philosophy [mart95], [gamu91] and our intuition would have us do, hinges in part on the fact that they project structure and govern the constituents of the S-phrase (sentence) that they project. By analogy, not only verbs but all major lexical elements should be treated as predicates in the semantic structure because they have the same syntax as verbs. (That all lexical items from the major categories have the same syntax is a primary revelation of X-bar theory.) For example, the interpretation of this noun phrase: 'the nice pictures of Sally' [NP [Art the][N2 [AP nice^N1 [N pictures] [PP of Sally]]]] is properly: [pictures('of Sally') restrictive modifier('nice')] and 'down the road near the tavern' [PP [P2 [P1 [P down][NP the road]][PP near the tavern]]] is : [down('the road') restrictive modifier('near the tavern')] That is, nouns and pronouns should be interpreted as predicates. In fact, Jackendoff finds that lexical heads from all major grammatical categories should be interpreted as predicates that have 0 or more arguments positions. This position is relatively unique to 41 Jackendoff s approach. Other techniques treat nouns as referring constants and prepositions are often evidenced only implicitly. The LCS Ontology The final questions to consider are which elements of the formalism can refer and to what can they refer. A typical interpretation is that noun phrases refer. That is, they pick out objects that exist in the world. A standard interpretation of 'The man put the book on the table' is: 3 x : ( Man(x) A 3 y : ( Book(y) A 3 z : ( Table(z) A PutOn(x,y,z)))) Notice that there is one existential quantifier and variable for each noun phrase signaling that those elements refer. But for Jackendoff, existence in the objective world is not an issue. He wants to know what sorts of objects exist in our conceptual ontology; that is, the things that exist in our innate conceptual schema. For him, part of what is missed in the above analysis is that we behave as if locations (among other things) exist and that we use different types of phrases to refer to them. Consider these two uses of 'put': I put the car [where the truck stopped]. She had to put the book [lower]. In these examples, the location is expressed by a sentence and a adjective phrase respectively. In fact, Cowper reports on page 63 that " any category that can express location can occur with 'put'" [cowp92]. Jackendoff takes linguistic evidence of the sort given below (pragmatic anaphora) to further support his dual position that there are many types of objects in the conceptual world and that all maximal projections can refer to them. 42 These questions: 1. What did you buy? [THING] 2. Where is my coat? [PLACE] 3. What happened next? [EVENT] can be answered with these phrases: 1. A fish. (a noun phrase) 2. In the bathtub. (a prepositional phrase) 3. Billy fell. (a sentence) Drawing on this evidence and evidence from other cognitive systems, Jackendoff proposes the existence of at least THINGS, PLACES, DIRECTIONS, EVENTS, MANNERS, ACTIONS, and AMOUNTS. Furthermore, he claims that the conceptual category that a lexical head maps into is part of the meaning of that head, not a consequence of its grammatical category. These considerations result in Jackendoff concluding that in LCS not just nouns but all maximal projections refer to objects in the conceptual ontology. Final L C S Representation Under this analysis, part of knowing a lexical item is knowing not only its theta-grid but also the ontological type of the arguments it specifies. So for 'put' we know not only that it takes (agent, theme, location) but that they are of type (THING, THING, PLACE). Additionally, we know what ontological type of thing the maximal projection of the word picks out. In the case of 'put' this is an EVENT 4 5 . We are now in a position to give the final LCS representation for 'The man put the book on the table': j-EVENT « p u t , ( |-THING ^ ^ [ T HING ^ ^ 4 5 Overall, analysis is similar to the one Davidson gives to motivate the addition of events into his ontology [davi67]. Jackendoff could be understood as extending his reasoning and in doing so positing a fuller ontology. 43 J-PLACE . o n . ( [ ™ i N G « t h e t a b l e ' ] ) ] ) ] 4 6 The primary strength of LCS is its ability to recover an appropriate argument structure for the semantic dependencies in a sentence using just a well described syntax. 3.5 Conclusion The LCS representation given here has advantages over the considered alternatives while retaining their strengths. Quantificational logic formulations such as are typical47 of T-theories like: 3 x : ( Man(x) A 3 y : ( Book(y) A 3 z : ( Table(z) A PutOn(x,y,z)))) are lacking in at least three ways. First, the predicate PutOn has subsumed the preposition, adding bulk and complication to the representation. Now 'put in' , 'put out', etc. all need distinct predicates and the common semantic content of each, the contribution of 'put', is obscured. Second, only noun phrases are aloud to refer to objects in the ontology and this contradicts normal language use. Lastly, all resemblance to the syntactic form has been lost. In contrast, LCS adheres to the syntactic form and referring use of language. It preserves the same embedding structure as the grammar thus making it a more plausible cognitive theory. 4 6 The LCS form I have given here is actually still an approximation. In pure LCS there are not lexical items, (here they remain and are in single quotes), as they are mapped into canonical symbols. More of the formalism will be introduced below, here I dealt only with what was needed to make Jackendoff s theoretical points. 4 7 Most formulations would also evidence the influences of Davidson and Russell [russ05],[davi67]. A Russellian interpretation would treat all of the noun phrases as definite descriptions. For Russell a definite description like "The author of the Waverly" has at least this much structure: There is an x and it is not always false that Wrote(x, waverly) and if x and y wrote Waverly then x and y are identical is always true. This interpretation is motivated by Russell's concern with the analysis of sentences like "Scott is the author of the Waverly". He wants definite descriptions to make a claim for each way in which a statement containing them should reasonably be considered false. So our sentence could be false if no-one wrote the Waverly or if more than one person wrote the Waverly. (It could be false in other ways but they correspond to the semantic contributions of the rest of the sentence.) Davidson would add the existence of an EVENT. Neither addition effects the criticism offered here so I ignored them for clarity. 44 The LCS approach and CD theory both have canonical representations and a primitive conceptual structure. LCS is preferable though because the semantic primitives are motivated by empirical evidence and because the way that the syntax informs semantics is not ignored. LCS meets the criteria for a cognitivist project. That is, for the reasons described in the previous section it is compatible with the empirical data of linguistics and is therefore a valid hypothesis about how the brain works. The other criterion that our meaning representation must meet is that it be well suited for the IR task over a domain of heterogeneous documents. This is true for three reasons. First, it operates sentence wise not needing human created and tuned scripts to interpret documents.49 Second, since it follows the syntax closely, it is not prone to over-interpret50. And lastly, as we will see in the next chapter, it allows the creation of data structures that support efficient querying and retrieval, ; 4 81 have not given examples of this property. For examples of this sort see [dorr87] who uses LCS as an interlingua for machine translation. 4 9 Here I am passing off a limitation as a strength. LCS is a framework for representing sentence meanings only and that happens to be a nice fit with this IR task. Because it is a good theory LCS could form the first level of processing in a system that did interpret document meanings but that task is simply not in its scope. 5 0 This is actually false. LCS can over-interpret but to a lesser extent. What I have presented here is a subset of LCS theory. In the full account more interpretation that is based less concretely on syntax is developed. The 'put' EVENT is defined in terms of two more primitive events, the CAUSE EVENT and the GO EVENT. The resulting structure for our example sentence is [EVENT CAUSE ([ T I D N G 'the man'], [ E V E N T G 0 ( [ T H I N G . foeta^.] f [ P L A C E o n ( [ ™ i N G toble»])])])] jackendoff gives an elaborate motivation for this structure and its continued relation to empirical data. These more advanced structures have proved useful for those researchers who use the LCS formalism for inference, translation, and discourse understanding. These extensions to the well motivated core of LCS don't have a stable definition yet. I have avoided them to make exposition about the Grammaticality Constraint as clear as possible. I add these elements to the notation in the Examples section. 45 Chapter 4 The "Finder" IR System 4.1 Overview This section describes how the LCS framework can be put to use in an IR system. Up to this point I have stressed advantages of LCS from a theoretical cognitivist perspective. That is, I have demonstrated that the framework fits with empirical data about cognition51, embodies a testable hypothesis about that data, and could plausibly be integrated into more ambitious text understanding systems because of its cognitive character. In this chapter and the next the case will be made for the practical value of LCS for IR. Here I describe how LCS can be used to create data structures that support IR. In the next chapter I support the hypothesis that the resulting system would perform better in that task than traditional methods. What follows is a description52 of the finder IR system for use over large sets of heterogeneous text like the WWW. Mostly from linguistics. 5 2 This is a description of the system as I envision it being implemented. The examples reported in the next section were done using Randy Sharp's graciously supplied parser from which I could hand generate LCS structures. The LCS linking rules that I used as my guide in this endeavor were Dorr's [dorr92]. 46 Search Interface (web page) User Query initiated here Raw Target Documents (lists of sentences) Results (web page) The comparison a can be used to rank at the relevant leaf document address to the user. lgprithm he result set r|ode. The are returned L The interface submits the query phrases to the encoder which creates the query LCS. Semantic Document Comparison Algorithm (computes scaler measure of similarity between two semantic documents) C Semantic Encoder english sentence is input 3 X-bar Syntax Parser Phrase Marker — ^ LCS converter LCS Lexicon LCS encoding of sentence is output 1 The LCS query is sent to the database and the corrparison algorithm is used to descend the tree; to the relevant leaf noda Database of 'Semantic Documents' (stored in a document cluster tree) Semantic documents are added to the database using the comparison algorithm to descend the tree. Semantic Documents (lists of LCS forms) The Finder IR System The main components of the system are: A. The Search Interface Client (SIC). B. The Finder Server, which consists of: 1. Database of Semantic Documents. This database is a document cluster tree. 2. The Semantic Encoder, which translates English sentences in to LCSs. This component has these sub-components: 47 i. X-bar conforming syntactic parser ii. A Phrase Marker to LCS converter iii. An LCS lexicon that stores the LCS structure for each word. I describe each in turn. 4.1.1 Search Interface Client The SIC is what the user sees on the FINDER web page. It has a simple interface that asks for phrases or sentences that have to do with the topic that the user is interested in. Here the example text 'The speedy destruction of Alexandria' has been used: Finder Document Retrieval System Please enter one or more phrases or sentences that are relevant to your topic: (separate with colons) The speedy destruction of Alexandria [SEARCHI The SIC sends the user query to the Finder server. There it is translated into an LCS form by the Semantic Encoder (SE). 48 4.1.2 The Semantic Encoder The first thing that the SE does is parse53 an input phrase, sentence or word. (In the case of a word, the Lexical Conceptual Structure associated with it is projected but all the arguments are empty.) For our example the LCS lexicon entry for destruction is: destruction: category: noun sub-categorization: _PP (0-roles, L C S positions, restriction): ((patient), complement, ( T H I N G A P L A C E ) ) Ontological type: E V E N T The category and sub-categorization fields are the traditional syntactic characterization of a lexical item. They are present in the LCS lexicon because it is shared between the syntactic parser and the LCS converter. The (8-role, LCS position, restriction) triplets encode the information that is needed to link the syntax and the semantics. Optional 6-roles are put in parenthesis. The restriction is similar to, and sometimes partially redundant with, the sub-categorization, but it doesn't exclude exactly the same set of phrases in the given position. The syntactic description for the example phrase is: What method is used to parse the input is not important. What is essential is that the final phrase marker be licensed by a form of X-bar theory and 0-theory that is compatible with realization of LCS that is being used. Dorr has provided such a system for use in her machine translation systems [dorr87],[dorr88],[dorr92]. In the examples I give here I follow Dorr but have converted the syntax back to Jackendoff s notation because that is the one that I used to describe LCSs originally. (She uses Chomsky's.) 49 The speedy destruction of Alexandria (restrictive modifier) (complement) Example Parse This phrase marker is then processed by the phrase marker to LCS converter which produces a predicate argument encoding of the semantic dependencies in the sentence. The work of this process is done almost entirely by the syntax. All that remains is to read the arguments from the positions that they are predicted to appear in. As mentioned above, I use the linking rules given in [dorr92] for this purpose. The LCS structure also adds labels to the bracketed structure it creates that correspond to the ontological categories that each predicate in the structure maps into. In this case the LCS produced is: |-EVENT d e s t m c t i o n ([PLACE Q f ( [ P L A C E M e x m M & ] R M ( [ M A N N E R QUICKLY])] 5 4 The semantic encoder is also used to convert raw target documents into semantic documents. The SE converts the raw document from a list of sentences into a list of corresponding LCS forms. Unparsable sentences are skipped. 5 4 A typographic convention I have been using for the LCSs is that symbols of the canonical LCS formalism are in all capitals whereas words from the actual sentence are follow normal capitalization rules. The 'RM' (meaning restrictive modifier) component of the notation follows Jackendoffs treatment which he calls provisional. It stretches, but obeys the grammaticality constraint but deals with the fact that words have open ended polydacity despite the fact that only a few arguments are strictly part of their meaning. For a full account of the details of LCS see [jack83] and [jack90]. 50 The Semantic Document Database A Semantic Document (SD) is a list of LCS forms. The goal is to organize these documents into a tree like the one described in section 2.2.2 with an algorithm similar to the one given there. With respect to creating a document cluster tree there are significant differences between the vector document representation and the LCS document representation. They are: 1. There is no easy way to 'average' a set of SDs to create a centroid LCS. 2. There is no clean mathematical similarity measure that can be computed between two SDs. The project is still possible thought, even considering these differences. What is needed is a method for building the tree that chooses centroids without averaging and a definition for a similarity measure. First, I will assume that we have a similarity measure and describe the algorithm for constructing the SDD. Then, I will describe the similarity measure. The SDD Construction Algorithm The algorithm is initialized by finding two SDs for whom similarity(SD1, SD2)> some threshold value. These are used to create the initial tree: Each node in the tree has: a pointer to the left sub-tree's root node a pointer to the right sub-tree's root node 51 a 'centroid' SD called C a set of semantic documents, empty unless the node is a leaf, called S Assume the following: closest(SD: A, *node: P, int: Val) A is a Semantic Document. P is a pointer to a node in the SDD tree. Return value: pointer to the child of P whose 'centroid' LCS is closest to A. This is computed by comparing the values of the similarity measure between A and the 'centroids' of the children of P. The value of the similarity measure between A and the closer child is placed in Val. splitter(SDSet: Fat, SDSet: Sliml, SDSet: Slim2) This sub-routine splits a large set of SDs into two smaller sets. Threshold = t A constant chosen based on experience and tuning that gives a quantitative meaning to the idea that one SD is close to another. That is, their similarity measure is less than Threshold. SizeLimit =10 A constant that limits the number of SDs that are aloud to be grouped at a leaf node in S before the spliter routine is called. 52 The algorithm then proceeds: 1. Select an unprocessed SD n e w . 2. Set node pointer Current = root 3. Compute Closest = closest(SDnew, Current, Val) 4. If (Val < threshold) Then If (Closest to is a leaf) Then add SD n e w to Closest.S . If that action makes count(S) > SizeLimit then GOTO BRANCH 5 6 . Else GOTO 1. Descend to Closest. (Set Current = Closest and GOTO 3 ) Else If leaf?(Closest) Then Create 2 new nodes A and B: A: left = right = null; C= SD n e w ; Set S = {SD n e w}. B: left = Closest; right = A; C = SD n e w ; Either Current.left or Current.right == Closest; whichever does set to point at B. GOTO 1. Else Descend to Closest. (Set Current = Closest and GOTO 3 ) This algorithm builds a binary tree with clusters of no more then SizeLimit (here 10) documents at each leaf. As the tree is constructed the leaves are pushed away from the root and more internal nodes are added. In this way, the tree can be built incrementally I describe the cases where the two nodes are both leaves and where neither of them are. The one-leaf one-node case are straightforward extentions. 5 6 B R A N C H is a sub-routine that uses spliter mentioned above. It compares all the elements in a set of SDs and finds the two that are the most dissimilar. It then creates two new nodes in the tree using those SD's as the C's. These are made the left and right children of the current node (which is no longer a leaf). The members of S are re-assigned to these two children based on which they are closer to. 53 without knowledge of how many documents need to be represented. The centroid vectors of section 2.2.2 are replaced with the C vectors which are pseudo-centroid. The process of creating the tree guaranties that they will be similar to the documents clustered around them but they may be towards the 'edge' of the set. As the set size is small this should not be a problem, however it is certainly a matter for testing. The tree only stores the addresses of the documents in question, except in the case of the SDs that are Cs. When the system needs to access the documents it must go get them and reprocess them into an SD. This makes BRANCH an expensive procedure. It is done off line though, and not often once a large tree is in place. The time cost of constructing the tree is linear in the number of documents to represent. That the constants are large should not pose a problem as construction is executed off-line. The space needed for the tree in also linear with respect to the number of documents represented. Since this is the same as what inverted indices and vector models require I assume that this is reasonable. The Similarity Measure I now turn to a description of how to compute the similarity between two SDs. Remember that a SD is a list of LCS forms. To compute the similarity between SDi and SD 2 the similarity measure compares each LCS form in SDi to each form in SD 2 . This algorithm is 0(n2) in the number of LCS forms. However, it is just linear when one of the SDs contains only one form. That is typically the case when a query LCS is being compared to ' C SDs. The similarity measure is equal to the sum of all of the similarities between the LCS forms. The similarity score between two LCS 5 7 forms is calculated by the following rule: A few more remarks on the syntax of LCSs are in order at this point. A verb like 'was' maps into an LCS primitive that picks out something of ontological type STATE. The ontological type is sub-scripted on the square brackets. A primitive of the type STATE is BE. Primitives are in all capitals and are not sub-scripted. Primitives are organized in the lexicon into fields. The field a primitive is in predicts 54 Starting from the outermost predicate and working in: +0 points for having the same ontological type58 +1 point for having the same restrictive modifier (RM). +1 point for having the same primitive +2 points if the primitives are the same and they have the same field. +1 point if the primitives differ, but have the same field. +3 points if the primitive, field, and lexical item are the same59. +1 point for each argument to the primitive that has a counterpart (is of the same type and position) in the other LCS. The scheme is recursive, evaluating the arguments mentioned in the last line according to the same measure. Lastly, if one LCS is a potential sub-part of the other LCS that LCS and the relevant sub-part are compared using the scheme. An LCS is a potential sub-part of another LCS if it matches one of its arguments by having the same ontological type, primitive and field. This added measure handles the case where one LCS is contained deep within the other which would be missed using the flat grading scheme from the top level only. This grading scheme is admittedly ad-hoc. It is an area for future research to develop a more principled approach to making similarity judgments. It may have practical value nonetheless, which awaits empirical evaluation. Here, and in the following section, I provide examples designed to suggest its plausibility. Example Use of the Similarity Measure Consider the three sentences: 1. John was very happy. 2. John was so very happy. 3. John ate very happily. restrictions on its argument structure. The field is sub-scripted in capitals on the primitive, for example BErDENTrrY-5 8 There are very few ontological types so giving points here is too permissive. 5 9 In pure LCS notation there are no lexical items left. I leave nouns and verbs in the notation. The nouns are left because a canonical representation for every thing in the world of discourse has not been provided. If it was we could map all references to J.F.K. into one thing, but lacking that I keep the nouns that don't subcategorize (vs. 'destruction' which does) in the representation. The verbs are left in the notation because LCS is mostiy concerned with predicate argument structure. As such its treatment of tense seems to be lacking to me. As a stop-gap I leave the lexical verbs in the representation to provide inflectional information. 55 And their corresponding LCS forms: (1) [ S T A T C BEiDENTrrY was ( [ r a i N G John], [PROPERTY H A P P Y ( [ I N T E N S I F I E R VERY]) ] ) ] (2) [STATC BETOENTLTY was ( [ r a i N G John], [PROPERTY H A p p Y ( [ I N T E N S I F I E R S Q ^INTENSIFIER V E R Y ] ) ] )])] ( 3 ) [ E V E K T ( [ T H I N G j Q h n L [ T f f lNG ^ 6 0 ^ RM([M A N N E R HAPPILY ( t ™ ^ VERY])])] The similarity measure computes the following values: similarity of: (1) to (2) 13 (2) to (3) 4 (3) to (1) 4 These values correspond with our intuitions concerning which of the pairs of sentences are most closely related in meaning. One might also note that word driven systems would not have much with which to distinguish among these three sentences if the keywords used were 'John' and 'happy'. The LCS driven method distinguishes these sentences easily61. More examples are given in the Examples section below. This is my notation for an unbound variable which occurs when an optional argument to a predicate is absent. 6 1 We can see that this heuristic is just a hack though when we consider that no points were added to similarities due to the occurrence of the MANNER HAPPILY and the PROPERTY HAPPY. Although it is right that we judge (1) and (2) to be less like (3) and more like each other it seems to be wrong to miss the fact that happy and happily are related. This is a difficult issue though, which needs further study. Note that not even relying on the lexicon will help, as sometime words from the same root are not closely related. (For example, fun vs. funny.) 56 4.1.3 Handling User Queries The process for returning documents relevant to a users query is similar to that used to create the tree. The tree is descended in the same way as described in the construction algorithm using the query LCS as SD n e w . When a leaf is reached the document addresses there are returned. This process is logarithmic in the number of documents stored in the tree62. It is possible to use the similarity measure to rank the results at this point. However, the idea is that the result sets are small and precise so ranking is not necessary. Ranking would also be expensive as the documents themselves would have to be retrieved, reprocessed and then compared to the query LCS, all while the user waited. 6 2 That this is logarithmic in the number of documents represented in the tree is not guaranteed. For a discussion see Future Directions below. 57 Chapter 5 Examples A thorough test of the proposed system would, of course, consist of a broad comparison to other, (particularly word-driven), methods for information retrieval. Lacking a complete implementation that has not been possible. However, we can return to the issues raised in the analysis of word driven methods above, and give examples that suggest the potential advantages of this program. Revisiting the LCS Notation Preliminary to seeing more LCS forms we should remember, in summary, the grammaticality constraint. For Jackendoff, it can be summarized by these two rules: 1. All major phrasal projections must correspond to an item from one of the ontological categories of conceptual structure. (Denoted in LCS by square brackets.) 2. All lexical heads must be interpreted in LCS as 0 or more place predicates. Notice that these constraints do not restrict the formalism from predicting more structure than exists in the syntax of a given sentence. Consider the LCS representation for "Joe buttered the bread" which is: •EVENT CAUSE ([ THING k] , [ •EVENT GO ([ -THING JOE]k, [ THING BUTTER], •PATH TO([ •THING BREAD])])])])] 58 Notice that although this form doesn't mirror the syntax as tightly as the examples I was careful to choose above it still obeys the grammaticality constraint. It reveals that Jackendoff s theory includes the idea that words, as part of their meanings, convey information about intentionality and causation and that this is reflected in the semantic structures associated with them. Another thing to notice is that there are no lexical items left in the notation. That is because they are all converted to canonical forms. As mentioned above, I leave the lexical items in the representation in addition to the canonical forms as a way to get some tense and quantification information that Jackendoff concedes [jack83] LCS doesn't encode well. This is also as a concession to the fact that a canonical set of symbols for all the physical objects in the world, and a mapping into them, would be difficult to provide. My version of the above example is then: ,-EVENT C A U S E [^-THING y ^ [-EVENT Q Q ([THING [-THING g T J X T E R ] , [-PATH T ( ) ^ [-THING ^ b r e a ( J] )])])])] 5.1 Examples In section 2.3 the claim was made that word driven methods don't deal well with features of natural language such as polysemy, synonymy, and the structural relations between entities in sentences. Here I give examples that suggest that the LCS representation does better. (Note that I have left the field subscripts out here since they don't come into play in these examples.) Each example gives a keyword and phrasal query that are meant to be equivalent with respect to the users intentions. 5.1.1 Polysemy This example gives two target documents that keyword systems would not distinguish between. Then the LCS analysis is given showing that if these documents were ranked by FINDER the more pertinent document would be returned. What is at issue is the fact that the word 'Java' has more then one unrelated meaning. 59 queries: keyword query: Java computer language phrasal query for FINDER: 'the Java computer language' target documents: (1) Java is the new computer language for the web. (2) Java has two computers in total. query LCS: f ™ 3 language, RM([ Java]), RM([ computer])] target document LCSs: (1)' [S T A T E BE is([™N G Java], t ™ language, RM([new]), RM( [computer])])] (2)' [S T A T C HAS has([ r a i N G Java], [ r a i N G computers, RM([A M O U N Ttwo])]) R M ( [ ^ N ™ m a l l ] ) ] similarity measure scores: query to (1)': 2; +1 for same sub-part 60 +1 for same RM query to (2)': 5.1.2 Synonymy This example gives two documents that a keyword system would not find if it was insisting that all the query words be present in the result documents. The FINDER system would find these clearly relevant documents. What is at issue is the fact that the verbs used in the documents are synonymous, which is not recognized by the keyword system. queries: keyword query: phrasal query for FINDER: Steve Carlton pitching 'Steve Carlton pitching' target documents: (1) Carlton tossed three balls. (2) Steve hurled six fast balls over home plate. query LCS: [E V E N T CAUSE fl™ k] , [ E V E N T GO pitching ( [ r a i N G Steve Carlton]k, j-THTNG ? X ] ; L^OCATION/PATH ?Y])])] target document LCSs: (1)' [E V E N T CAUSE fl™ k], r-EVENT G Q t o s s e d ( [ ™ I N G S t e y e C a r l t o n ] k ; 61 [ r a i N G balls, R M ( [ A M 0 U N T three])], [-LOCATION/PATH 7 xj-)jj ( 2 ) , [ E V E N T C A T J S E ( [ T H I N G y ? [EVENT G ( ) h u r l e d ( [ T H I N G C a r l t o n ] k ; [ ™ balls, RM([fast]), R M t f ^ ^ s i x ] ) ] , |-LOCATION/PATH o v e r ( [ F L A C E h o m e plate ])])]] similarity measure scores: query to(l) ' : 11; +lfor outer primitive +2 for primitive and field (location) +2 for corresponding arguments +6 in total for matching 'Steve Carlton' query to (2)': 11; +1 for outer primitive +2 for primitive and field (location) +2 for corresponding arguments +6 in total for matching 'Steve Carlton' 5.1.3 Semantic Structure This example again gives two target documents that a keyword system would not distinguish between. The FINDER system would return the more relevant document. What is at issue is how the words used in the documents participate in the semantic structures of the documents. queries: keyword query: junior college phrasal query for FINDER: 'junior college' 62 target documents: (1) The best junior college ... (2) College junior wins award. query LCS: [ r a i N G college, RM([junior])] target document LCSs: (1)' [™M G college, RM([junior]),RM([ I N T E N S I F I E R best])] (2)' [E V E N T win (1™° junior, RM([ college])], [ ™ award])] similarity measure scores: query to (1)': 7; +1 for R M match +6 in total for matching 'college' query to (2)': 0 What has been shown by the above examples is that the FINDER system can make some distinctions that a word driven method misses. What has not been shown is that this would necessarily result in a system with higher precision and accuracy that the methods described in Chapter 2. This is an area for future research. 63 Chapter 6 Conclusion The goals of this project were to provide a practical and cognitive method for improving the precision and accuracy of the IR task over heterogeneous domains like the WWW. To meet the cognitive goals of the project the LCS representation for sentence meaning was selected. It was shown to have a sound theoretical and empirical basis by comparing it to alternatives. That this approach is practical has been supported by the description of data structures that it supports. It has been shown that these structures would support querying while using space and time resources in the same order of magnitude as established methods. What remains for future work is to confirm these claims concretely with a full-scale implementation. 6.1 Future Directions This project is very preliminary and as such there are many ways beyond implementing it as described to continue it. Here I give a brief account of some of the more pressing issues. Grammatical Input Required FINDER requires that the user input is grammatical. This is a consequence of the LCS formalism itself. It is characteristic of most modern linguistic theory that seeks to model speaker competence. This might not be a problem for small user queries but, to be processed, target document sentences need to be grammatical. An IR system might work 64 well enough just skipping the ungrammatical sentences. However the question remains as to how we should use the LCS theory for computational semantics projects in general if their is a gap between the theory, and speaker performance, that can't be accounted for. Document Cluster Tree The algorithm described in chapter section 4.1.2 for building the document cluster tree might result in a tree that is very tall and narrow. For the system to perform as claimed it must be an empirical fact that, in practice, these trees are always relatively full. Alternatively, a tree spreading algorithm that periodically makes the tree fatter must be designed. Similarity Measure The similarity measure used to judge relatedness is not well motivated. Through tuning and experience it might become a good heuristic. This is not in keeping with the cognitive motivations for choosing the LCS representation though. What is needed is an implementation of a cognitive theory about how the brain makes similarity and thematic judgments and how this corresponds to operationsover these structures. Canonical Representation The canonical representation proposed by Jackendoff leaves some questions unresolved. The system is so canonical that, as we saw above, the representation for 'put' and 'buttered' ends up being the same in LCS. (So 'John buttered the toast' is the same as 'John put butter on the toast'.) This might seem acceptable for this example but the chance to loose some part of the meaning exists. I try to overcome this by leaving the 65 lexical items in the representation . Future work is needed though, as this is not an acceptable solution for other domains. In translation, for example, the LCS is supposed to be an interlingua for all languages, which it obviously cannot be if it has lexical items from the languages in it. Context and Document Meaning The LCS representation is used to encode sentence meaning. When compared to keyword techniques the relations it encodes are semantically rich. However, when humans read and understand texts they do not just understand a list of the context-less meanings of the sentences. To advance, this approach needs to be integrated into further levels of processing that take context into account and encode document meanings instead of sentence meanings. This step must be made judiciously and not before sentence meaning has been satisfactorily dealt with or we will be in danger of reinventing conceptual dependency graphs and scripts. Dorr [dorr92] addresses this problem by greatly expanding the MANNER category. 66 Bibliography [alle95] J. Allen. Natural Language Understanding. The Benjamin Cummings Publishing Company, Inc., Redwood City, CA, 1995. [bren96] E. Brenner. Beyond Boolean—New Approaches to Information Retrieval. The National Federation of Abstracting and Information Services, Philadelphia, PA, 1996. [chom57] N. Chomsky. Syntactic Structures. Mouton Publishers, Paris, 1957. [chom65] N. Chomsky. Aspects of the Theory of Syntax, The MIT Press, Cambridge, 1965. [coop97] E. Cooper and K. Hammond. 'The OnPoint System: a Meta-Search Engine that Answers Natural Language Questions". Technical report, University of Chicago, Chicago, IL, 1997. [cowp92] E. Cowper. A Concise Introduction to Syntactic Theory: The Government-Binding Approach. The University of Chicago Press, Chicago, 1992. [davi67] D. Davidson. "The Logical Form of Action Sentences" in Essays on Actions and Events, D. Davidson, ed., Clarendon Press, Oxford, 1980. [davi70] D. Davidson. "Semantics for Natural Languages" in Inquiries into Truth and Interpretation, D. Davidson, ed., Clarendon Press, Oxford, 1984. [dorr87] B. Dorr. "UNITRAN: An Interlingual Machine Translation System" A.I. Memo No. 998, MIT A.I. Laboratory, 1987. [dorr88] B. Dorr. "A Lexical Conceptual Approach to Generation for Machine Translation" A.I. Memo No. 1015, MIT A.I. Laboratory, 1988. [dorr92] B.Dorr. 'The Use of Lexical Semantics in Interlingual Machine Translation", Journal of Machine Translation, 7:3 pp. 135-193, 1992. 67 [etzi96] O. Etzioni. "Moving up the information food chain: Deploying softbots on the world wide web". AAA, 1996. [fill68] C. Fillmore. "The Case for Case" reprinted in E. .Bach and R.T. Harms (eds.) Universals in Linguistic Theory, New York: Holt, Rinehart and Winston, pp. 1-88, 1968 [frak92] W. Frakes and R. Baeza-Yates, eds. Information Retrieval: data structures and algorithms. Prentice Hah, Englewood Cliffs, New Jersey, 1992. [freg92] G. Frege. "On Sense and Nominatum" reprinted in Martinich, A.P. (ed.), The Philosophy of Language, Oxford University Press, 1892. [gamu91] L.T.F. Gamut. Logic, Language, and Meaning: Volume 1, Introduction to Logic. The University of Chicago Press, Chicago, IL, 1991. [gazd89] G. Gazdar and C. Mellish. Natural Language Processing in Prolog. Addison-Wesley Pubhshing Company, Workingham, England, 1989. [haeg94] L. Haegeman. Introduction to Government and Binding Theory. Blackwell Publishers Ltd. Oxford, England, 1994. [hind91] D. Hindle and M. Rooth. "Structural ambiguity and lexical relations". Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics '91, pp. 229-236, 1991. [hobb93] J. Hobbs and M. Stickel, D. Appelt, and P. Martin. "Interpretation as abduction". Artificial Intelligence 63, 1-2:69-142, 1993. [jack77] R. Jackendoff. X-bar Syntax: A Study of Phrase Structure. The MIT Press, Cambridge, Mass, 1977. [jack83] R. Jackendoff. Semantics and Cognition. The MIT Press, Cambridge, Massachusetts, 1983. [jack90] R. Jackendoff. Semantic Structures. The MIT Press, Cambridge, Massachusetts, 1990. [Iars95] R. Larson and G. Segal. Knowledge of Meaning. The MIT Press, Cambridge, Massachusetts, 1995. [Ii98] H. Li and N. Abe. "Generalizing Case Frames Using a Thesaurus and the MDL Principle". Computational Linguistics, Vol. 24, No. 2, pp. 217-244, 1998. 68 [Ii99] H. Li and N. Abe. "Learning Dependencies Between Case Frame Slots". Computational Linguistics, Vol. 25, No. 2, pp. 283-291, 1999. [Iuhn58] H.P. Luhn. 'The Automatic Creation of Literature Abstracts". IBM Journal of Research and Development, Vol. 2, No. 2, pp. 159-165, April 1958. [mart96] A.P. Martinich, ed. The Philosophy of Language. Oxford University Press, New York, NY, 1996. [mart88] J. Martin. "A Computational Theory of Metaphor" reprinted at http://www.ncstrl.org/, document id: ncstrl.ucb/CSD-88-465, 1988. [matl89] M.W. Matlin. Cognition. Harcourt Brace Jovanovich College Publishers, Philadelphia, PA, 1996. [maul86] M.L. Mauldin. "Information Retrieval by Text Skimming". Phd. Thesis. Carnegie Mellon University, Computer Science Department. [pust96] J. Pustejovsky. The Generative Lexicon. The MIT Press, Cambridge, Massachusetts, 1996. [ries82] C. Riesbeck. "Realistic Language Comprehension", in Strategies for Natural Language Processing, W. Lehnert and M. Ringle, ed., Lawrence Erlbaum Associates, Hillsdale, NJ, 1982, pp. 37-54, ch.2. [russ05] B. Russell. "On Denoting" reprinted in Martinich, A.P. (ed.), The Philosophy of Language, Oxford University Press, 1996. [salt75] G. Salton, A. Wong and C.S. Yang. "A Vector Space Model for Automatic Indexing" reprinted in Readings in Information Retrieval, K. Sparck Jones and P. Willet, eds.,Morgan Kaufmann Publishers, Inc., San Francisco, 1997. [salt83] G. Salton and M.J. McGill. 'The SMART and SIRE Experimental Retrieval Systems" reprinted in Readings in Information Retrieval, K. Sparck Jones and P. Willet, eds.,Morgan Kaufmann Publishers, Inc., San Francisco, 1997. [scha72] R. Schank. "Conceptual Dependency : A Theory of Natural Language Understanding" Cognitive Psychology 3, pp. 552-631, 1972. [scha73] R. Schank. "Identification of Conceptualizations Underlying Natural Language", reprinted in Computer Models of Thought and Language, R.Schank and K. Colby eds., Freeman, San Franciso, CA, 1973. [spar97] K. Spark Jones and P. Willet, eds. Readings in Information Retrieval. Morgan Kaufmann Publishers, Inc., San Francisco, California, 1997. 69 [strz96] T. Strzalkowski. Document indexing and retrieval using natural language processing. Circulated Monograph, 1996. [tars44] A. Tarkski. "The Semantic Conception of Truth" reprinted in Martinich, A.P. (ed.), The Philosophy of Language, Oxford University Press, 1996 [visi99] G. Vision. Personal communication. 1999. [wexl73] K. Wexler and H. Hamburger. "Insufficiency of Suface Data for the Learning of Transformational Languages" in Hintikka, Moravcsik, and Suppes, pp. 167-179,1973. [wilk75] Y. Wilks. "An Intelligent Analyzer and Understander of English", CACM 18(5), pp. 264-274, 1975. [wood78] W. Woods. "Semantics and Quantification in Natural Language Question Answering". Advances in Computers, vol. 17 pp. 2-64, 1978. 70 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items