UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A portable natural language database query system White, Steven John 1985

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1985_A6_7 W45.pdf [ 7.35MB ]
Metadata
JSON: 831-1.0051909.json
JSON-LD: 831-1.0051909-ld.json
RDF/XML (Pretty): 831-1.0051909-rdf.xml
RDF/JSON: 831-1.0051909-rdf.json
Turtle: 831-1.0051909-turtle.txt
N-Triples: 831-1.0051909-rdf-ntriples.txt
Original Record: 831-1.0051909-source.json
Full Text
831-1.0051909-fulltext.txt
Citation
831-1.0051909.ris

Full Text

A PORTABLE NATURAL LANGUAGE DATABASE QUERY SYSTEM by STEVEN JOHN WHITE B.Ed. ( S e c . ) , U n i v e r s i t y Of B r i t i s h Columbia, 1980 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE i n THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE We a c c e p t t h i s t h e s i s as c o n f o r m i n g t o the r e q u i r e d s t a n d a r d THE UNIVERSITY OF BRITISH COLUMBIA October 1985 © Steven John W h i t e , 1985 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of t h i s thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. I t i s understood that copying or publication of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of &>*t/l j C / * f C ^ The University of B r i t i s h Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 DE - 6 n / a n i i A b s t r a c t W ith the i n c r e a s e d use of c o m p u t e r i z e d d a t a b a s e s , the a b i l i t y t o a l l o w u s e r s t o a c c e s s i n f o r m a t i o n u s i n g n a t u r a l language i s becoming more d e s i r a b l e . There a r e many n a t u r a l language systems i n e x i s t e n c e today. The main problem w i t h t h e s e systems i s the amount of e x p e r t i s e and e f f o r t r e q u i r e d t o adapt them t o a new domain. The d e s i g n of a p o r t a b l e n a t u r a l language f r o n t - e n d t o a r e l a t i o n a l database i s d e s c r i b e d i n t h i s t h e s i s . I t i s d e s i g n e d so t h a t a t y p i c a l Database A d m i n i s t r a t o r can implement a new domain i n a r e a s o n a b l e amount of t i m e . Database p o r t a b i l i t y has been a c h i e v e d by s e p a r a t i n g the domain dependent n a t u r a l language d e f i n i t i o n s from the domain independent d e f i n i t i o n s . These domain dependent d e f i n i t i o n s a r e s p e c i f i e d i n the database schema, which i s s t r u c t u r e d t o e x t r a c t the s e m a n t i c s c o n t a i n e d i n t h e s t r u c t u r e of the a c t u a l d a t a b a s e . A r i c h s u p p l y of s t a n d a r d d e f i n i t i o n s a r e a v a i l a b l e t o both a i d i n the development of the database schema and t o h e l p f o r c e c o n s i s t e n c y amongst database domains. i i i T a b l e of C o n t e n t s A b s t r a c t i i L i s t of F i g u r e s v Chapter I INTRODUCTION 1 Chapter I I CURRENT NATURAL LANGUAGE QUERY SYSTEMS 5 2.1 E a r l y N a t u r a l Language Systems 6 2.2 S y n t a c t i c / S e m a n t i c Based Systems 7 2.3 Semantic Based Systems 10 2.4 Commercial Systems 12 2.5 Enhancing Syntax With Semantics 15 2.6 Summary 22 Chapter I I I CONCEPTS IN NATURAL LANGUAGE SYSTEMS 23 3.1 S y n t a c t i c A n a l y s i s Concepts 24 3.1.1 ATN Networks 25 3.1.2 V o c a b u l a r y 25 3.1.2.1 Morphing 26 3.1.2.2 Nouns 27 3.1.2.3 P r o p e r Nouns 28 3.1.2.4 A d j e c t i v e s 28 3.1.2.5 Verbs 29 3.1.2.6 Adverbs 30 3.1.2.7 P r e p o s i t i o n s 31 3.1.3 H a n d l i n g Unknown Words / User I n t e r f a c e 32 3.1.3.1 S p e l l i n g C o r r e c t i o n 32 3.1.3.2 A b b r e v i a t i o n s 32 3.1.3.3 Assumed Database Elements 33 3.1.3.4 I g n o r i n g Words 34 3.1.3.5 Knowledge A c q u i s i t i o n 34 3.2 Concepts In Semantic A n a l y s i s 35 3.2.1 D e f i n i t i o n Of Semantic I n f o r m a t i o n 36 3.2.2 Some Problems In Semantic I n t e r p r e t a t i o n 37 3.3 Database I n t e r f a c e 38 3.4 Summary 39 Chapter IV SYSTEM DESIGN I : SYNTACTIC ISSUES 41 4.1 Overview Of The System 41 4.2 The S y n t a c t i c P r o c e s s o r 46 4.2.1 ATN P a r s e r 46 4.2.2 ATN Grammar 46 4.2.3 ATN R e g i s t e r s 48 4.2.4 The Morpher 49 4.2.5 P a r s i n g A Query 51 4.2.5.1 Noun Ph r a s e s 52 4.2.5.2 C o n j u n c t i o n 52 4.2.5.3 Compound Proper Nouns 53 i v 4.2.5.4 P r e p o s i t i o n a l P h rases 57 4.2.5.5 R e l a t i v e C l a u s e s 58 4.2.5.6 N o i s e Words 59 4.2.5.7 R e l a x a t i o n Of Grammatical R u l e s 59 4.3 H a n d l i n g Unknown Words 60 4.3.1 A b b r e v i a t i o n s And Synonyms 60 4.3.2 The S p e l l i n g C o r r e c t o r 61 4.3.3 The User I n t e r f a c e 62 4.3.3.1 S p e l l i n g E r r o r s 62 4.3.3.2 A b b r e v i a t i o n s 63 4.3.3.3 Synonyms 63 4.3.3.4 Unknown Database Elements 63 4.3.3.5 I g n o r i n g Words 65 4.3.3.6 C a n c e l l i n g The Query 65 4.3.4 Knowledge A q u i s i t i o n 66 4.4 Summary 66 Chapter V SYSTEM DESIGN I I : KNOWLEDGE BASE 68 5.1 R e l a t i o n a l Database Approach 68 5.2 The T e s t Databases 69 5.3 R e l a t i o n a l Database Te r m i n o l o g y 70 5.4 S p e c i f i c a t i o n Of Domain Dependent I n f o r m a t i o n 71 5.4.1 The Database Schema 71 5.4.1.1 The Database D e f i n i t i o n 72 5.4.1.1.1 R e l a t i o n S p e c i f i c a t i o n 72 5.4.1.1.2 F i e l d S p e c i f i c a t i o n 76 5.4.1.2 Verb S p e c i f i c a t i o n 80 5.4.1.3 D e f a u l t J o i n s 82 5.4.2 Database Schema L i b r a r y 84 5.4.3 I n v e r t e d Index 85 5.5 S p e c i f i c a t i o n Of Domain Independent I n f o r m a t i o n ...86 5.5.1 G e n e r a l D i c t i o n a r y 86 5.5.2 G l o b a l D i c t i o n a r y 88 5.6 C o m p i l i n g The I n f o r m a t i o n : A c t i v e Domain D i c t i o n a r y 89 Chapter VI SYSTEM DESIGN I I I - SEMANTIC INTERPRETATION 91 6.1 I n t e r p r e t i n g The Deep S t r u c t u r e 91 6.1.1 A s s o c i a t i n g P h rases W i t h R e l a t i o n s And F i e l d s ...91 6.1.2 Verb Frames 93 6.1.2.1 Verb Frame S t r u c t u r e 94 6.1.2.2 M a t c h i n g The Deep S t r u c t u r e To A Verb Frame ...97 6.1.3 R e s o l v i n g M u l t i p l e Matched S l o t s 99 6.1.4 I g n o r i n g P h r a s e s 100 6.1.5 P r o c e s s i n g P r e p o s i t i o n a l P h rases 100 6.1.6 R e l a t i v e C l a u s e s * 103 6.2 B u i l d i n g The S t a n d a r d Query R e p e s e n t a t i o n 104 6.2.1 SQR S t r u c t u r e 104 6.2.1.1 The S e l e c t L i s t 105 6.2.1.2 The R e s t r i c t i o n L i s t 105 6.2.1.3 The R e l a t i o n L i s t 105 6.2.1.4 The J o i n L i s t 106 V 6.2.1.5 The D i s t i n c t F l a g 106 6.2.2 D e t e r m i n i n g The Q u e s t i o n Type 107 6.2.3 A n a l y z i n g The Verb Frame 109 6.2.3.1 D e t e r m i n i n g The Q u e s t i o n Element 109 6.2.3.2 A n a l y z i n g Noun And P r e p o s i t i o n a l P h r a s e s 110 6.2.3.2.3 Head Noun 110 6.2.3.2.4 M o d i f i e r s 113 6.2.3.2.5 C o n j u n c t i o n 115 6.2.3.2.6 R e l a t i v e C l a u s e s 117 6.2.3.2.7 D e t e r m i n e r s 117 6.2.3.3 A n a l y z i n g Adverbs And P r e d i c a t e s 118 6.3 Summary Of Semantic A n a l y s i s 118 Chapter V I I SYSTEM DESIGN IV - DATABASE INTERFACE AND RESPONSE GENERATION 120 7.1 The Sequel Query Language 120 7.2 A s s e m b l i n g The Sequel Query 121 7.3 Response G e n e r a t i o n 123 Chapter V I I I PORTABILITY ISSUES 125 8.1 Domain P o r t a b i l i t y 125 8.2 The SUPPLIER/PART/JOB Database 126 8.3 Database P o r t a b i l i t y 129 Chapter IX CONCLUSION 131 9. 1 F u t u r e Work 131 9.1.1 F u r t h e r E x p l o r e Domain P o r t a b i l i t y 132 9.1.2 A d a p t a t i o n To Other DBMS 132 9.1.3 E x t e n s i o n s To The ATN 132 9.1.4 Pronoun R e f e r e n c e 133 9.1.5 E l l i p s e s 1 33 9.1.6 Q u a n t i f i e r s 134 9.1.7 Complex C o n j u n c t i o n 134 9.1.8 Response G e n e r a t i o n 135 9.2 Summary 136 BIBLIOGRAPHY 137 APPENDIX A - AUGMENTED TRANSITION NETWORK DIAGRAMS 140 APPENDIX B - SAMPLE PARSE 143 APPENDIX C - DATABASE SCHEMA FOR THE PROGRAM DATABASE ...145 APPENDIX D - PARTIAL DATABASE SCHEMA LIBRARY 150 APPENDIX E - GENERAL DICTIONARY 152 APPENDIX F - GLOBAL DICTIONARY 154 APPENDIX G - PARTIAL ACTIVE DOMAIN DICTIONARY FOR THE v i PROGRAM DATABASE 156 APPENDIX H - SAMPLE PROGRAM DATABASE SESSION 158 APPENDIX I - DATABASE SCHEMA FOR THE SUPPLIER DATABASE ..174 APPENDIX J - SAMPLE "SUPPLIER/PART/JOB" DATABASE SESSION 179 v i i L i s t of F i g u r e s 4.1. Diagram of the System 44 4.2. Sample Q u e r i e s 45 4.3. P a r t i a l S u f f i x T a b l e 51 5.1. Design of the Program Database 69 5.2. Semantic CATEGORIES 7 5 5.3. TYPES 7 9 5.4. L-TYPES 80 5.5. P a r t i a l A c t i v e Domain D i c t i o n a r y 90 6.1. Verb Frame S l o t s 94 6.2. Semantic C l a s s i f i c a t i o n s f o r Verb Frame S l o t s 95 6.3. Semantic A s s o c i a t i o n F a c t o r s 102 8.1. Sample Q u e r i e s t o S u p p l i e r / P a r t / J o b database 128 v i i i Acknowledgement I would t o thank Dr. R i c h a r d Rosenberg f o r h i s s u p p o r t and i n p u t i n t o t h i s t h e s i s . I would a l s o l i k e t o thank Dr. M i c h e a l McRae f o r h i s i d e a s and f o r r e a d i n g the m a n u s c r i p t . I a l s o thank Anne Chou f o r her h e l p i n w r i t i n g t h e r o u t i n e s f o r the Morpher r o u t i n e . I would e s p e c i a l l y l i k e t o thank my f a t h e r and my mother f o r t h e i r s u p p o r t . 1 I . INTRODUCTION The use of c o m p u t e r i z e d d a t a b a s e s i s growing a t a r a p i d r a t e . There a r e now thousands of d a t a b a s e s a v a i l a b l e t o the p u b l i c i n a wide range of a r e a s i n c l u d i n g b u s i n e s s , e d u c a t i o n , e n t e r t a i n m e n t , and m e d i c i n e . Even though t h e s e d a t a b a s e s would b e n e f i t many p e o p l e , v e r y few p e o p l e a c t u a l l y use them. One of the main o b s t a c l e s i n h i b i t i n g t h e i r use i s the l e v e l of s k i l l needed. B e f o r e they can become p a r t of our everyday l i f e , t h e r e w i l l have t o be a s u i t a b l e user i n t e r f a c e which would a l l o w even c a s u a l u s e r s t o a c c e s s them. There a r e c u r r e n t l y t h r e e p o p u l a r ways i n which u s e r s a c c e s s i n f o r m a t i o n from t h e s e d a t a b a s e s . These a r e by menu-d r i v e n systems, query l a n g u a g e s , and c u s t o m i z e d programs. Menu-d r i v e n systems p r e s e n t u s e r s w i t h a menu of the i n f o r m a t i o n a v a i l a b l e and a l l o w s them t o make a s e l e c t i o n . T h i s method i s u s e f u l i f t h e r e i s l i t t l e i n f o r m a t i o n i n the d a t a b a s e . However, f o r l a r g e r d a t a b a s e s c o n t a i n i n g i n f o r m a t i o n i n many a r e a s , the us e r w i l l have t o go t h r o u g h many l e v e l s of menus i n o r d e r t o o b t a i n the d e s i r e d i n f o r m a t i o n . Query languages a r e a n o t h e r p o p u l a r method f o r e x t r a c t i n g i n f o r m a t i o n from d a t a b a s e s . The problem w i t h t h i s method i s i t r e q u i r e s a more s o p h i s t i c a t e d computer user who has some g e n e r a l knowledge of b o t h the query language and the s t r u c t u r e of the database b e i n g used. Customized programs are a p p r o p r i a t e i f the user o n l y wants t o a c c e s s i n f o r m a t i o n from a s p e c i f i c d a t a b a s e , but i t i s not a p p r o p r i a t e i f one wants t o e x p l o r e many d i f f e r e n t d a t a b a s e s . Another method now b e i n g used t o a c c e s s i n f o r m a t i o n i s 2 n a t u r a l language (NL). N a t u r a l language p r o c e s s i n g i s an a r e a of A r t i f i c i a l I n t e l l i g e n c e ( A l ) which r e q u i r e s the computer to be a b l e t o p r o c e s s n a t u r a l language. In a d d i t i o n t o b e i n g used f o r a c c e s s i n g i n f o r m a t i o n from d a t a b a s e s , NL i s a l s o b e i n g used f o r Machine T r a n s l a t i o n , E x p e r t Systems, and I n t e l l i g e n t Computer A s s i s t e d I n s t r u c t i o n . A l t h o u g h NL i s an e x t r e m e l y complex a r e a w i t h many u n r e s o l v e d problems, r e s e a r c h has advanced enough so t h a t t e c h n i q u e s can be a p p l i e d t o the s u c c e s s f u l p r o c e s s i n g of a h i g h p e r c e n t a g e of database q u e r i e s . There are c u r r e n t l y numerous N a t u r a l Language Systems i n e x i s t e n c e t o d a y . Many of them have a r i s e n from r e s e a r c h p r o j e c t s (Woods e t a l . 1972, W a l t z 1978) and have been d e s i g n e d f o r one s p e c i f i c database o n l y . In the l a s t few y e a r s some systems have been i n t r o d u c e d as commercial p r o d u c t s ( I n t e l l e c t , E n g l i s h , Themis). The major problem w i t h most of t h e s e systems i s t h a t they r e q u i r e a c o n s i d e r a b l e amount of time and e f f o r t t o adapt them t o a new domain or d a t a b a s e . T h i s t a s k must be done by a person w i t h c o n s i d e r a b l e knowledge of NL systems. Even a f t e r the i n i t i a l i m p l e m e n t a t i o n t h e r e i s a need f o r such a person t o m a i n t a i n the d a t a b a s e . However, most database a d m i n i s t r a t o r s (DBA's) do not have t h i s knowledge. In o r d e r f o r such systems t o become w i d e l y used, they must be d e s i g n e d so t h a t an average DBA can e a s i l y m a i n t a i n them w i t h m i n i m a l t r a i n i n g . The major concern i n the d e s i g n of a NL system i s the ease w i t h which a new domain can be implemented (domain p o r t a b i l i t y ) and whether the system can be adapted t o o t h e r database systems 3 (database p o r t a b i l i t y ) . Many NL systems have a c h i e v e d some degree of domain p o r t a b i l i t y by u s i n g p r o c e d u r e s which s y n t a c t i c a l l y a n a l y s e q u e r i e s w i t h o u t r e f e r e n c i n g a s p e c i f i c d a t a b a s e . . By e x t e n d i n g t h e s e i d e a s t o semantic i n t e r p r e t a t i o n , i t i s p o s s i b l e t o a c h i e v e a h i g h e r degree of domain p o r t a b l i t y . In o r d e r t o a c h i e v e database p o r t a b i l i t y , the system s h o u l d be d e s i g n e d such t h a t the f e a t u r e s of a s p e c i f i c database management system (DBMS) a r e not d e e p l y r o o t e d i n t h e s y n t a c t i c or semantic p a r t s of the system. In order t o a c h i e v e domain p o r t a b i l i t y , the domain independent components have been l o g i c a l l y s e p a r a t e d from the domain dependent i n f o r m a t i o n . The domain independent components form the " l i n g u i s t i c c o r e " (Rosenberg 1980), which i n c l u d e s p r o c e d u r e s f o r p a r s i n g the query, a n a l y z i n g each word, p r o m p t i n g u s e r s f o r any a d d i t i o n a l i n f o r m a t i o n , and forming a r e s p o n s e . The domain dependent i n f o r m a t i o n , c a l l e d the "database schema", i n c l u d e s a s t r u c t u r e d r e p r e s e n t a t i o n of the d a t a b a s e , as w e l l as s e m a n t i c d e f i n i t i o n s of words which d e s c r i b e the d a t a b a s e and the c o n t e n t s of the a c t u a l d a t a b a s e . T h i s i n f o r m a t i o n i s then c o m p i l e d t o g e t h e r w i t h the g l o b a l d i c t i o n a r y , which c o n t a i n s s y n t a c t i c i n f o r m a t i o n about i n d i v i d u a l words. The r e s u l t of t h i s c o m p i l a t i o n i s the " A c t i v e domain d i c t i o n a r y " , which i s a c o r e component of the system. In Chapter 2, a number of o t h e r N a t u r a l Language Systems a r e examined, w i t h p a r t i c u l a r emphasis p l a c e d on p o r t a b i l i t y i s s u e s . Chapter 3 examines which components of a system are domain independent and shows how the domain dependent 4 i n f o r m a t i o n can be s u p p l i e d t o the system i n the form of d a t a (not programs). Chapter 4, 5, 6, and 7 d e s c r i b e a p r o t o t y p e system d e s i g n e d t o demonstrate the f e a s i b i l i t y of such a system. In c h a p t e r 8, the i s s u e of domain p o r t a b i l i t y i s t e s t e d by a d a p t i n g the system t o another domain. The c o n c l u s i o n and a r e a s f o r f u t u r e work a r e p r e s e n t e d i n c h a p t e r 9. 5 I I . CURRENT NATURAL LANGUAGE QUERY SYSTEMS The f i r s t n a t u r a l language q u e s t i o n a n s w e r i n g systems were d e v e l o p e d i n the e a r l y 1960's. These systems were based on s i m p l e t e c h n i q u e s such as keyword a n a l y s i s and s i m p l e p a t t e r n matching of s e n t e n c e s . A l t h o u g h t h e s e systems were v e r y l i m i t e d i n the t y p e s of q u e r i e s they answered, they d i d demonstrate t h a t computers c o u l d be used t o answer q u e s t i o n s i n s p e c i f i c domains. Another approach d e v e l o p e d i n the l a t e s i x t i e s f o l l o w e d a l i n e a r paradigm (Rosenberg 1980), where the s y n t a c t i c and semantic p r o c e s s i n g were s e p a r a t e d i n t o two d i s t i n c t s t a g e s . In these systems the query i s i n i t i a l l y a n a l y s e d f o r i t s s y n t a c t i c s t r u c t u r e and then passed on f o r semantic i n t e r p r e t a t i o n . The problem w i t h t h i s method i s t h a t a p u r e l y s y n t a c t i c p a r s e o f t e n produced a l a r g e number of ambiguous i n t e r p r e t a t i o n s , which made the semantic i n t e r p r e t a t i o n v e r y complex. Other r e s e a r c h e r s i n the e a r l y 1970's d e v e l o p e d systems which were e n t i r e l y based on s e m a n t i c s . These systems examined q u e r i e s a t the semantic l e v e l w i t h o u t i n i t i a l l y p a r s i n g i t a t the s y n t a c t i c l e v e l . The most p o p u l a r type of t h e s e systems a re semantic grammars (Brown e t a l 1974; Waltz 1977; H e n d r i x et a l 1978) . The main problem w i t h t h e s e systems i s t h a t they a re ve r y domain dependent. Implementing a new domain would u s u a l l y r e q u i r e c o n s i d e r a b l e work. R e c e n t l y , r e s e a r c h e r s have d e s i g n e d systems where both s y n t a c t i c and semantic p r o c e s s i n g c o e x i s t (Booth 1983; Kaplan 1979) . By enhancing the s y n t a c t i c p a r s e r w i t h semantic i n f o r m a t i o n , systems can a v o i d many u n l i k e l y i n t e r p r e t a t i o n s f o r 6 ambiguous q u e r i e s . These systems a l s o t e n d t o have a h i g h e r degree of domain p o r t a b i l i t y than most of the e a r l i e r systems. 2.1 E a r l y N a t u r a l Language Systems Most of the e a r l y n a t u r a l language q u e s t i o n a n s w e r i n g systems were based upon ad hoc t e c h n i q u e s and answered q u e s t i o n s i n v e r y l i m i t e d domains. The most p o p u l a r of these programs i s ELIZA (Weizenbaum 1966), which i m i t a t e s a d i a l o g u e between a p a t i e n t and a p s y c h o l o g i s t . The program i s based upon s c a n n i n g the u s e r s i n p u t f o r keywords and matching i t w i t h a number of sentence p a t t e r n s . Other systems such as SIR (...) and Student (...) matched q u e r i e s t o a l i s t of sentence p a t t e r n s . For example, t o p a r s e a t y p i c a l "How many" q u e s t i o n , the system would need a sentence p a t t e r n as shown below. HOW MANY * DOES * HAVE ? The problem w i t h these systems i s t h a t a s e p a r a t e sentence p a t t e r n i s needed f o r every type of query the system can. answer. The semantic knowledge needed t o answer each query i s s p e c i f i c t o each sentence p a t t e r n and i s i m p l i c i t t o the g i v e n domain. Hence the t a s k of implementing such a system would i n v o l v e w r i t i n g sentence p a t t e r n s w i t h r e l a t e d semantic i n f o r m a t i o n f o r each type of query a c c e p t e d t o the system. 7 2.2 S y n t a c t i c / S e m a n t i c Based Systems S y n t a c t i c / S e m a n t i c based systems a r e two pass systems which f i r s t a n a l y s e t h e query s y n t a c t i c a l l y , and t h e n , upon a s u c c e s s f u l a n a l y s i s , c o n t i n u e w i t h the semantic i n t e r p r e t a t i o n . The s y n t a c t i c component of such systems are commonly based upon T r a n s f o r m a t i o n a l G e n e r a t i v e Grammars (Chomsky 1965), where t h e r e i s a mapping between the " s u r f a c e s t r u c t u r e " and the c o r r e s p o n d i n g "deep s t r u c t u r e " of a query. The main b e n e f i t of t h i s i s t h a t t h e deep s t r u c t u r e c a p t u r e s many s i m i l a r i t i e s and d i f f e r e n c e s between s i m i l a r q u e r i e s . For example, the f o l l o w i n g two q u e r i e s map i n t o the same deep s t r u c t u r e . What programs a r e run by the Apple? The A p p l e runs what programs? Another b e n e f i t of t h i s approach i s t h a t the deep s t r u c t u r e r e p r e s e n t s t h e s y n t a c t i c c a t e g o r y of each word i n the query. For i n s t a n c e , i n the above query the word "programs" i s i n t e r p r e t e d as a noun and not a v e r b . The most common method f o r p r o d u c i n g the deep s t r u c t u r e of a query i s by u s i n g an Augmented T r a n s i t i o n Network (ATN) p a r s e r (Woods 1970). The ATN was d e v e l o p e d from a f i n i t e - s t a t e t r a n s i t i o n d i a g r a m , which i s a s e t of nodes co n n e c t e d by 8 d i r e c t e d , l a b e l e d a r c s . T h i s system has been adapted t o N a t u r a l Language P r o c e s s i n g by h a v i n g the a r c s r e f e r t o l e x i c a l or s y n t a c t i c c a t e g o r i e s and by a s s o c i a t i n g c o n d i t i o n s and a c t i o n s w i t h each a r c . The semantic component i s r e s p o n s i b l e f o r t a k i n g the deep s t r u c t u r e ( p a r s e ) produced by the p a r s e r and t r a n s l a t i n g i t i n t o a database query t o some da t a b a s e . T h i s r e s u l t i n g q u e r y , when e x e c u t e d , would produce the answer t o the o r i g i n a l NL qu e r y . S i n c e most of the i n t e l l i g e n c e of a NL system i s a t the semantic l e v e l , t h i s component i s v e r y complex. One of the b e s t known methods f o r p e r f o r m i n g semantic a n a l y s i s i s " p r o c e d u r a l s e m a n t i c s " used i n the LUNAR system (Woods 1967, Woods 1968). T h i s method i n v o l v e s " p a t t e r n - m a t c h i n g " the query t o p r e d e f i n e d t e m p l a t e s . S p e c i f i c t e m p l a t e s or r u l e s would match c e r t a i n p a r t s of the que r y , such as noun p h r a s e s (N-Rules) or the e n t i r e s e n t e n c e ( S - R u l e s ) . These r u l e s would then s p e c i f y the a p p r o p r i a t e r e t r i e v a l components c a l l e d "semantic p r i m i t i v e s " , w h ich a r e e s s e n t i a l l y programmed s u b r o u t i n e s . The major problem w i t h t h i s method i s t h a t the database d e s i g n e r would l i t e r a l l y have t o p r e d i c t every p o s s i b l e query s t r u c t u r e t h a t the system c o u l d e n c o u n t e r . For example, each of the f o l l o w i n g q u e r i e s r e q u i r e a s e p a r a t e S-Rule and a s s o c i a t e d semantic p r i m i t i v e : AA-57 d e p a r t s from Boston f o r Rome. AA-57 d e p a r t s from B o s t o n . AA-57 d e p a r t s from Boston a t 8:00 am. AA-57 d e p a r t s from Boston on Monday. AA-57 d e p a r t s from Boston f o r Rome a t 8:00 am on Monday. The main advantage of t h i s S y n t a c t i c / S e m a n t i c based model 9 i s t h a t the s y n t a c t i c component i s e s s e n t i a l l y domain independent. The o n l y domain s p e c i f i c i n f o r m a t i o n needed i s the s y n t a c t i c i n f o r m a t i o n , such as the s y n t a c t i c c a t e g o r y , of each word a s s o c i a t e d w i t h the domain. T h i s can be e a s i l y o b t a i n e d from a d i c t i o n a r y or even deduced from i t s p o s i t i o n i n the q u e r y . Hence when a new domain i s implemented, the s y n t a c t i c component of the system r e q u i r e s l i t t l e change. Another advantage t o t h i s approach i s t h a t the semantic component does not need t o be aware of the v a r i o u s ways of wording the same query s i n c e they a r e o f t e n c a p t u r e d by the s y n t a c t i c component. A l t h o u g h t h e r e a r e some b e n e f i t s of s e p a r a t i n g the s y n t a x from the s e m a n t i c s , t h e r e a r e some s e r i o u s d i s a d v a n t a g e s . One problem i s t h a t one query may produce more than one p a r s e . The semantic component would then have t o s e l e c t which p a r s e was i n t e n d e d by the u s e r . Many of t h e s e ambiguous p a r s e s can be a v o i d e d by u s i n g semantic i n f o r m a t i o n d u r i n g the p a r s e . For i n s t a n c e , the q u e r y : Does the Rose and Crown pub s e l l c i d e r ? can be i n t e r p r e t e d i n a number of ways depending i f the p a r s e r t r e a t s "and" as a c o n j u n c t i o n and not as p a r t of the name "Rose and Crown". T h i s type of a m b i g u i t y can be a v o i d e d i f the p a r s e r i s s u p p l i e d w i t h semantic i n f o r m a t i o n about m u l t i - w o r d p r o p e r nouns. To get around t h i s problem, methods were dev e l o p e d t o s e l e c t the most l i k e l y p a r s e . One- method i s t o have a semantic component a n a l y s e each p a r s e u n t i l an a p p r o p r i a t e one i s found 10 ( H a r r i s 1977a). Once a s p e c i f i c parse i s chosen, i t i s n e c e s s a r y t o i n f o r m the user which i n t e r p r e t a t i o n was p i c k e d . One method used i s t o g e n e r a t e a paraphrase of the query from the chosen p a r s e ( H a r r i s 1977a). Other systems r e q u i r e d u s e r s t o d etermine the i n t e r p r e t a t i o n by examining the i n t e r n a l r e p r e s e n t a t i o n of the query (Woods et a l 1972). T h i s method i s o n l y a p p r o p r i a t e f o r u s e r s who have some knowledge of g i v e n r e p r e s e n t a t i o n . The major d i s a d v a n t a g e of t h i s s y n t a c t i c / s e m a n t i c method l i e s i n the work needed t o adapt the semantic component t o another domain. A l t h o u g h the s y n t a c t i c component of these systems i s b a s i c a l l y domain independent, the l a r g e r and more complex semantic component i s domain dependent. For example, t o adapt a system based on p r o c e d u r a l semantics t o a new domain would r e q u i r e almost as much work as t h a t expended i n e x p l i c a t i n g the i n i t i a l domain. I t would a l s o r e q u i r e a person w i t h c o n s i d e r a b l e e x p e r i e n c e w i t h the system t o implement and t e s t the new domain. 2.3 Semantic Based Systems Semantic based systems i n t e r p r e t n a t u r a l language a t the semantic l e v e l w i t h o u t c o m p l e t e l y p a r s i n g i t a t the s y n t a c t i c l e v e l . The most n o t a b l e works i n t h i s a r e a are "semantic grammars". Semantic grammars a r e based on semantic c a t e g o r i e s such as "Plane-Type" or " C i t y " , r a t h e r than the s y n t a c t i c c a t e g o r i e s such as Nouns or P r e p o s i t i o n a l P h r a s e s . These grammars a r e o f t e n implemented by an ATN, where the a r c s r e f e r 11 to semantic c a t e g o r i e s or s p e c i f i c words. There a r e t h r e e well-known NL systems based on semantic grammars. SOPHIE (Brown e t a l 1974) i s an I n t e l l i g e n t Computer A s s i s t e d I n s t r u c t i o n (ICAI) program which answers q u e s t i o n s and e v a l u a t e s s t u d e n t s hypotheses about f a u l t s i n an e l e c t r o n i c c i r c u i t . The NL component of the program r e c o g n i z e s word groups f o r c a t e g o r i e s such as "Measurement" or " V o l t a g e s " . PLANES (Waltz 1977) and LIFER ( H e n d r i x et a l . 1978) a r e both n a t u r a l language f r o n t - e n d s t o d a t a b a s e s . Both systems t r a n s l a t e the query i n t o a database query which w i l l produce the answer. The PLANES system (Waltz 1977) i s a n a t u r a l language f r o n t -end t o a database c o n t a i n i n g i n f o r m a t i o n about a i r p l a n e s and r e l a t e d f l i g h t i n f o r m a t i o n . The query i s p a r s e d by an ATN which employs semantic c a t e g o r i e s f o r each "semantic o b j e c t " ( P l a n e -t y p e , f l i g h t - h o u r s ) r e l e v a n t t o the domain. The system a l s o p a r s e s some p a r t s of q u e r i e s (eg. q u a n t i f i e r s ) based on the s y n t a c t i c c a t e g o r i e s of the words. U n l i k e most semantic grammar systems, PLANES can i n t e r p r e t p a s s i v e and o t h e r v a r i a t i o n s of a query w i t h o u t s e p a r a t e r u l e s . Once the query has been s u c c e s s f u l l y p a r s e d , the query s t r u c t u r e i s matched t o "Concept Case Frames", which c o n s i s t of an " a c t " ( u s u a l l y a v e r b ) and a number of semantic o b j e c t s . Once a s u i t a b l e match i s found, the query s t r u c t u r e i s t r a n s l a t e d i n t o an i n t e r i m q u e r y , which i n t u r n i s t r a n s l a t e d i n t o a f o r m a l query e x p r e s s i o n . Semantic grammar based systems have a few advantages over s y n t a c t i c based systems. One advantage i s t h a t t hey don't r e q u i r e the DBA t o s p e c i f y any l e x i c a l or s y n t a c t i c i n f o r m a t i o n 12 about the words used i n the domain. These systems can a l s o be more e f f i c i e n t t han s y n t a c t i c based systems s i n c e a p a r s e a t the s y n t a c t i c l e v e l i s not r e q u i r e d . There a r e two b i g d i s a d v a n t a g e s of t h e s e systems. F i r s t , t hey r e q u i r e t h e DBA t o s p e c i f y a l l p o s s i b l e t y p e s of q u e r i e s t h a t the system w i l l p r o c e s s . Some systems (LIFER & SOPHIE) even r e q u i r e the DBA t o s p e c i f y the p a s s i v e form of a query i n a s e p a r a t e p a t t e r n . B e s i d e s t a k i n g a l o t of t i m e , t h i s c o u l d cause d i f f e r e n t d a t a b a s e s t o be i n c o n s i s t e n t i n the forms of q u e r i e s i t p r o c e s s e s . Hence some q u e r i e s c o u l d be s t a t e d i n a number of v a r i a t i o n s (eg p a s s i v e ) w h i l e o t h e r s may n o t . T h i s o b v i o u s l y w i l l cause d i f f i c u l t y amongst u s e r s . Another major d i s a d v a n t a g e i s the amount of work needed t o be done t o adapt the system t o a n o t h e r domain. The DBA would have t o w r i t e a new semantic grammar f o r each domain. A l t h o u g h some t o o l s have been d e v e l o p e d t o a i d i n t h i s p r o c e s s ( H e n d r i x et a l , 1978), t h e t a s k i s s t i l l time consuming. The DBA a l s o has t o s u p p l y the system w i t h an a d d i t i o n a l semantic i n f o r m a t i o n f o r each query t y p e . T h i s may i n v o l v e s p e c i f y i n g a concept frame f o r each query type as i n PLANES, or a number of e x p r e s s i o n s (query t e m p l a t e s ) as i n LIFER. 2.4 Commercial Systems In r e c e n t y e a r s a few n a t u r a l language systems have become c o m m e r c i a l l y a v a i l a b l e . These i n c l u d e I n t e l l e c t and E n g l i s h . U n f o r t u n a t e l y , due t o the commercial i n t e r e s t s , t h e r e i s l i t t l e p u b l i s h e d i n f o r m a t i o n about the a l g o r i t h m s used i n t h e s e 13 programs. The most s u c c e s s f u l commercial n a t u r a l language system i s I n t e l l e c t ( H a r r i s 1977a, 1977b, 1977c). I n t e l l e c t , f o r m e r l y c a l l e d ROBOT, a c t s as a NL f r o n t - e n d t o a number of commercial database systems, i n c l u d i n g IMS, TOTAL, and ADABAS. The system uses an ATN grammar t o g e n e r a t e a l l p o s s i b l e p a r s e s of a query based on s y n t a c t i c and some database i n f o r m a t i o n . A l l words c o n t a i n e d i n the database a r e s t o r e d i n an " i n v e r t e d i n d e x " . T h i s a l l o w s the p a r s e r t o r e c o g n i z e database elements as w e l l as determine the database f i e l d ( s ) t o which they b e l o n g . The system has f e a t u r e s t o handle pronoun r e f e r e n c e s and s p e l l i n g c o r r e c t i o n . I t a l s o has a s e p a r a t e ATN t o p a r s e sentence f r a g m e n t s , which i s i n v o k e d i f the i n i t i a l ATN f a i l s t o g e n e r a t e a p a r s e . I n t e l l e c t ' s p a r s e r w i l l produce more than one p a r s e when l e x i c a l or s t r u c t u r a l a m b i g u i t i e s occur ( H a r r i s 1977a). Many of th e s e e x t r a p a r s e s occur when a word belongs t o more than one f i e l d i n the da t a b a s e . For i n s t a n c e , the word "GREEN" i n the query "Are t h e r e any green c a r s ? " c o u l d be e i t h e r a c o l o u r or a surname. I n t e l l e c t w i l l t r y t o determine the i n t e n d e d meaning by e x e c u t i n g both of t h e s e i n t e r p r e t a t i o n s on the d a t a b a s e . I f o n l y one of these r e t u r n s i n f o r m a t i o n , then t h a t i n t e r p r e t a t i o n i s chosen on the assumption t h a t u s e r s o n l y ask q u e s t i o n s about i n f o r m a t i o n t h a t i s a c t u a l l y i n the da t a b a s e . The system i n f o r m s the u s e r s of the chosen i n t e r p r e t a t i o n by d i s p l a y i n g a pa r a p h r a s e of the query on the s c r e e n . I f both q u e r i e s r e t u r n i n f o r m a t i o n , the user w i l l be q u e r i e d about t h e i n t e n d e d 1 4 meaning. I f n e i t h e r of the q u e r i e s r e t u r n i n f o r m a t i o n , the system a p p r o p r i a t e l y answers "NO" t o t h e que r y . T h i s method f o r c h o o s i n g the i n t e n d e d meaning f o r a query can be ve r y i n e f f i c i e n t f o r da t a b a s e s w i t h many r e l a t i o n s . By s u p p l y i n g the system w i t h i n f o r m a t i o n about the s t r u c t u r e of the da t a b a s e , i t i s p o s s i b l e f o r the system t o determine the most l i k e l y i n t e r p r e t a t i o n . I f the system knew t h a t c o l o u r i s an a t t r i b u t e of a c a r and t h a t the surname i s an a t t r i b u t e of an employee, i t c o u l d make the l o g i c a l d e c i s i o n (a green c o l o u r e d c a r ) f o r the i n t e n d e d i n t e r p r e t a t i o n . I n t e l l e c t appears t o be the most r o b u s t system a v a i l a b l e . I t i s c u r r e n t l y i n s t a l l e d a t over 200 s i t e s . I t i s o b v i o u s l y f a i r l y p o r t a b l e s i n c e i t t a k e s a p p r o x i m a t e l y one week to^ implement a new domain (Johnson 1984). I t a l s o i s t r a n s p o r t a b l e w i t h r e s p e c t t o t h e database s i n c e i t i s a v a i l a b l e f o r many database management systems (DBMS). Another commercial system, named E n g l i s h , i s produced by Mathematica P r o d u c t s Group f o r t h e i r RAMIS I I database system (Mathematica P r o d u c t s Group I n c . 1983; Johnson 1984). E n g l i s h answers "yes/no" q u e s t i o n s , "how many" q u e s t i o n s , and q u e s t i o n s which a re answered by a r e p o r t . I t use s p a r a p h r a s e s t o i n f o r m the user of i t s i n t e r p r e t a t i o n when a m b i g u i t i e s o c c u r . The system uses t h r e e d i c t i o n a r i e s — a g e n e r a l E n g l i s h d i c t i o n a r y c o n t a i n s the common v o c a b u l a r y ; the RAMASTER d i c t i o n a r y c o n t a i n s i n f o r m a t i o n about the a c t i v e f i l e s and t h e i r f i e l d s ; and an o p t i o n a l d i c t i o n a r y c o n t a i n s i n f o r m a t i o n unique t o a s p e c i f i c u s e r or a p p l i c a t i o n . 15 The major problem w i t h E n g l i s h i s t h a t i t does not a s s o c i a t e any semantic i n f o r m a t i o n w i t h v e r b s o t h e r than command v e r b s such as " p r i n t " or "show". A l t h o u g h i t i s a b l e t o p a r s e d i f f e r e n t t y p e s of v e r b s , they a r e i g n o r e d i n the semantic i n t e r p r e t a t i o n . For example, the system can not d i s t i n g u i s h between the f o l l o w i n g two q u e r i e s : Who i s t e a c h i n g Math 100? Who i s t a k i n g Math 100? A l t h o u g h the user would be i n f o r m e d of the system's i n t e r p r e t a t i o n t h r o u g h the p a r a p h r a s i n g a b i l i t i e s , such a p i t f a l l i s not s a t i s f a c t o r y . Too much semantic i n f o r m a t i o n i s c o n t a i n e d i n v e r b s t o s i m p l y i g n o r e them. 2.5 Enhancing Syntax W i t h Semantics The main problem, i n u s i n g o n l y l e x i c a l and s y n t a c t i c i n f o r m a t i o n i n p a r s i n g a query i s the p o s s i b l i t y of o b t a i n i n g more than one i n t e r p r e t a t i o n . Many of the s e i n t e r p r e t a t i o n s can be a v o i d e d i f the p a r s e r i s enhanced w i t h some semantic i n f o r m a t i o n . T h i s w i l l h e l p reduce the c o m p l e x i t y of the semantic component by e l i m i n a t i n g t h e s e e x t r a p a r s e s a t the s y n t a c i c s t a g e . One system t h a t uses s e m a n t i c s e f f e c t i v e l y t o a v o i d m u l t i p l e p a r s e s i s CO-OP (Kaplan 1979, 1984). The system i s b a s i c a l l y a two-pass ( s y n t a x / s e m a n t i c ) system. The s y n t a c t i c component uses an ATN grammar enhanced w i t h some semantic i n f o r m a t i o n . The semantic component i s based on a s i m p l e case 16 system, where a l i s t of p o s s i b l e s u b j e c t s and o b j e c t s a re g i v e n f o r each v e r b and p r e p o s i t i o n . CO-OP reduces the number of p a r s e s the ATN produces by u s i n g semantic i n f o r m a t i o n t o dete r m i n e which head noun a m o d i f i e r w i l l be a t t a c h e d t o . T h i s p r o c e s s i s c a l l e d m o d i f i e r a t t a c h m e n t . For example, i n the que r y : Which u s e r s work on p r o j e c t s i n ar e a 3 t h a t a re i n d i v i s i o n 200? the p hrase " t h a t a r e i n d i v i s i o n 200" can modify e i t h e r " u s e r s " or " p r o j e c t s " . The system uses two h e u r i s t i c s t o d e c i d e on the most l i k e l y i n t e r p r e t a t i o n . The f i r s t h e u r i s t i c i s the d i s t a n c e between t h e m o d i f i e r and each head noun i n the q u e r y ' s s u r f a c e s t r u c t u r e . The o t h e r h e u r i s t i c i s the "semantic r e l a t e d n e s s " of the terms w i t h r e s p e c t t o the se m a n t i c s c o n t a i n e d i n the s t r u c t u r e of the dat a b a s e . T h i s i s d e r i v e d by examining how c l o s e t o g e t h e r t h e s e terms a r e i n the database. A l t h o u g h t h i s t e c h n i q u e i s f a i r l y c r u d e , i t i s v e r y e f f e c t i v e i f the s t r u c t u r e of the database a d e q u a t e l y models the semantics of the domain. In o r d e r t o be adapted t o new domains, CO-OP r e q u i r e s t h r e e s o u r c e s of i n f o r m a t i o n - a database schema, a l e x i c o n , and the database i t s e l f . The database schema c o n t a i n s semantic i n f o r m a t i o n d e s c r i b i n g the s t r u c t u r e of the da t a b a s e . T h i s i n c l u d e s i n f o r m a t i o n on how o b j e c t s a r e r e l a t e d t o each o t h e r ( i e . one-to-one, one-to-many r e l a t i o n s h i p s ) and how c l o s e l y items a r e r e l a t e d t o each o t h e r . The l e x i c o n i n c l u d e s m o r p h o l o g i c a l , s y n t a c t i c , and semantic i n f o r m a t i o n about a l l of the words which a r e known by the system. To adapt CO-OP t o a 17 new domain, the DBA must c o n t r u c t a new database schema and a new l e x i c o n . The major problem here i s t h a t b oth the domain independent and dependent i n f o r m a t i o n i s s t o r e d t o g e t h e r i n the l e x i c o n . Hence the DBA must go t h r o u g h the e n t i r e l e x i c o n , removing a l l o l d domain dependent i n f o r m a t i o n and m o d i f y i n g the s e m a n t i c s a s s o c i a t e d w i t h some of the domain independent i n f o r m a t i o n . I t would make b e t t e r sense t o p h y s i c a l l y s e p a r a t e the domain dependent from the domain independent i n f o r m a t i o n . T h i s c o u l d be a c h i e v e d by s i m p l y h a v i n g two d i c t i o n a r i e s -- one f o r the domain dependent i n f o r m a t i o n and the o f t h e r f o r the domain independent i n f o r m a t i o n . The DBA i s a l s o e x p e c t e d t o s u p p l y m o r p h o l o g i c a l and s y n t a c t i c i n f o r m a t i o n f o r each new word i n the domain. T h i s p r o c e s s i s time consuming as w e l l as b e i n g s u s c e p t i b l e t o e r r o r s . The l e x i c o n a l s o c o n t a i n s the nouns which i d e n t i f y the f i e l d s i n the d a t a b a s e . I t would be more l o g i c a l t o s p e c i f y t h e s e i n the database schema, s i n c e these words a r e s t r i c t l y a s s o c i a t e d w i t h the f i e l d s i n the d a t a b a s e . Rather than u s i n g an i n v e r t e d index t o determine the database v a l u e s , CO-OP t r i e s t o i n f e r them by examining the c o n t e x t of the q u e r y . In the q u e r y : Which p r o j e c t s i n oceanography does NASA He a d q u a r t e r s sponsor? b o t h "oceanography" and "NASA H e a d q u a r t e r s " a r e unknown terms and are assumed t o be c o n t e n t s of the d a t a b a s e . The system c o r r e c t l y p r e d i c t s t h a t "oceanography" i s an "AREA OF INTEREST" by n o t i n g t h a t i t m o d i f i e s a p r o j e c t and i s preceded by the 18 p r e p o s i t i o n " i n " . "NASA H e a d q u a r t e r s " i s d e t e r m i n e d t o be a "SPONSOR NAME" s i n c e i t a c t s as the s u b j e c t t o the v e r b " s p o n s o r " . In o r d e r f o r t h i s a l g o r i t h m t o work, the l e x i c o n must c o n t a i n adequate semantic i n f o r m a t i o n about the known words i n t he query ( i e . " p r o j e c t s " , " i n " , and " s p o n s o r " ) . I f t h e r e i s not enough i n f o r m a t i o n , the system i s i n c a p a b l e of making any sense of the query. For example, i n the query: Does John Smith have r e d b o l t s ? t h e system would o n l y be a b l e t o use the words "does" and "have" t o i n t e r p r e t the meaning of the que r y . S i n c e "Does" and "have" can appear i n numerous d i f f e r e n t c o n t e x t s , i t i s v e r y d i f f i c u l t t o p r e d i c t i n which database f i e l d s the r e m a i n i n g words a r e c o n t a i n e d . The system a l s o would not be a b l e t o d e t e r m i n e whether " r e d b o l t s " c o r r e s p o n d t o a m u l t i - w o r d noun or t o v a l u e s of two s p e c i f i c f i e l d s (say PART and PART-COLOUR). I t would have t o i n f e r whether " b o l t s " i s s t o r e d as " b o l t " or " b o l t s " i n th e d a t a b a s e . S i m i l a r l y i t would have t o i n f e r t h a t John Smith i s a m u l t i - w o r d p r o p e r noun and i s s t o r e d i n the database i n the g i v e n form and not as "John Smith L i m i t e d " , or even as v a l u e s of two d i s t i n c t f i e l d s (say SURNAME and FIRSTNAME). The system a l s o has a v e r y poor u s e r i n t e r f a c e . I n the q u e r y : Which F5s c a r r y s t r u t c u r v e r a d a r ? t h e system must det e r m i n e whether F5 c o r r e s p o n d s t o a s h i p or an a i r p l a n e s i n c e t hey a r e both d e f i n e d as a p o s s i b l e s u b j e c t f o r 19 the v e r b " c a r r y " . In t h i s case CO-OP chooses one a t random and answers the query, r e l a y i n g t he i n t e r p r e t e d meaning t o the user by p a r a p h r a s i n g the query. I f the user i n t e n d e d the o t h e r i n t e r p r e t a t i o n , say a i r p l a n e s , he/she would have t o r e s t a t e the query as f o l l o w s : Which F5 a i r p l a n e s c a r r y s t r u t c u r v e r a d a r ? T h i s t ype of assumption i s t o t a l l y u n a c c e p t a b l e f o r any u s a b l e system. T h i s a m b i g u i t y can be a v o i d e d by e i t h e r e x a m i n i n g the database t o determine i f an "F5" i s a s h i p or an a i r p l a n e , or prompting the user t o choose the i n t e n d e d i n t e r p r e t a t i o n . Another problem caused by not h a v i n g the c o n t e n t s of the database a v a i l a b l e t o the p a r s e r i s t h a t a l l unknown words w i l l be assumed as unknown database e l e m e n t s . However, t h e r e may be many o t h e r p o s s i b l e s o u r c e s of unknown words. For example, a word which i d e n t i f i e s a f i e l d may have been e r r o n e o u s l y l e f t out of t he l e x i c o n . T h i s method a l s o makes i t d i f f i c u l t f o r the system t o do any s o r t of s p e l l i n g c o r r e c t i o n or t o be a b l e t o p r o c e s s a b b r e v i a t e d terms. A l t h o u g h CO-OP has some s e r i o u s f a u l t s , i t does demonstrate t h a t a system can be made p o r t a b l e by s p e c i f y i n g a l l domain dependent i n f o r m a t i o n as d a t a t o the system. I t a l s o shows how the system can e x p l o i t the s t r u c t u r e of the database as a r i c h s o u r c e of semantic i n f o r m a t i o n . A system t h a t a c h i e v e s a h i g h degree of p o r t a b i l i t y by a c c e s s i n g semantic i n f o r m a t i o n d u r i n g the s y n t a c t i c p h rase was implemented by Booth (Booth 1983). T h i s system i n i t i a l l y p a r s e s 20 q u e r i e s f o r noun p h r a s e s , v e r b phrases and p r e p o s i t i o n a l p h r a s e s u s i n g an ATN p a r s e r (Woods 1970). The semantic i n t e r p r e t a t i o n p a r t of the system i s based on a case grammar ( F i l l m o r e 1968). In a case grammar t h e r e are a number of " c a s e s " a s s o c i a t e d w i t h each main v e r b d e f i n e d i n the system. Each case r e f e r s t o a phrase i n the sentence which f i l l s a s p e c i f i c semantic r o l e a s s o c i a t e d w i t h the v e r b . An example of a case frame and the q u e r i e s i t i n t e r p r e t s i s shown below: OPEN ACTION (AG NAME PA MEALS) I s the White Spot open f o r lunch? What r e s t a u r a n t s a r e open f o r d i n n e r ? In the above example the v e r b "open" has two c a s e s a s s o c i a t e d w i t h i t . They a r e the "agent" (AG), which w i l l match a phrase i n the query c o r r e s p o n d i n g t o the name of a r e s t a u r a n t , and " p a t i e n t " ( P A ) , which w i l l match the phrase c o r r e s p o n d i n g t o the meals which the r e s t a u r a n t s e r v e s . Once the system p a r s e s a phrase w i t h i n the q u e r y , i t t r i e s t o d e t e r m i n e i t s c a s e . When a v e r b phrase i s p a r s e d , the system w i l l t r y and match a l l noun and p r e p o s i t i o n a l p h r a s e s w i t h the case s l o t s d e s i g n a t e d by the v e r b . Once a l l the p h r a s e s i n the query have been s u c c e s s f u l l y matched w i t h a case s l o t , the r e s u l t i n g s t r u c t u r e i s t r a n s l a t e d i n t o the " s t a n d a r d sentence r e p r e s e n t a t i o n " (SSR), which s p e c i f i e s the f i e l d s which are t o be r e t r i e v e d and the r e s t r i c t i o n s t o be p l a c e d on the query. T h i s i n t u r n i s t r a n s l a t e d i n t o a query i n a database query language. The system a c h i e v e s a h i g h degree of p o r t a b i l i t y by c l e a r l y 21 s e p a r a t i n g the domain dependent i n f o r m a t i o n from the domain independent. The domain independent i n f o r m a t i o n i s c a l l e d the " l i n g u i s t i c c o r e " , which i n c l u d e s the s y n t a c t i c and semantic components, and the d e f i n i t i o n s of a l l words which a r e n o r m a l l y used i n most domains. The domain dependent i n f o r m a t i o n , or the "domain d e f i n i t i o n " , c o n s i s t s of words and r e l a t e d s y n t a c t i c and semantic i n f o r m a t i o n which r e f e r t o the p a r t i c u l a r d a t a b a s e , a l i s t of a d d i t i o n a l c a s e s needed i n the domain, and an i n v e r t e d i n d e x . Hence when a new domain i s implemented, the DBA o n l y needs t o s u p p l y the system w i t h a new domain d e f i n i t i o n . The main d i s a d v a n t a g e of t h i s system i s t h a t i n o r d e r t o implement a new domain, the DBA needs t o be a b l e t o d e t e r m i n e the c a s e s a s s o c i a t e d w i t h each new v e r b . T h i s r e q u i r e s c o n s i d e r a b l e knowledge about l i n g u i s t i c s , which most DBA's don't have. I t i s a l s o p o s s i b l e t h a t the p r e d e f i n e d c a s e s i n the system may not be s a t i s f a c t o r y f o r a l l v e r b s . In t h i s s i t u a t i o n a d d i t i o n a l programming w i l l be r e q u i r e d t o add new c a s e s t o the system. The system a l s o does not have an adequate a l g o r i t h m f o r matching the p a r s e d p h r a s e s of a query t o the c a s e s s p e c i f i e d w i t h i n a v e r b frame. For i n s t a n c e , i f t h e r e a r e p r e p o s i t i o n a l p h r a s e s i n the query which a r e not matched t o a c a s e , t h e r e a r e no o p t i o n s t o t r y and a t t a c h the PP t o a s u i t a b l e NP ( m o d i f i e r a t t a c h m e n t ) . Another d i s a d v a n t a g e t o t h i s system i s t h a t i t i s d e s i g n e d t o work w i t h databases c o n s i s t i n g of one r e l a t i o n or t a b l e . Hence i t does not address the problems i n t r o d u c e d by l a r g e r d a t a b a s e s w i t h more than one r e l a t i o n . 22 2.6 Summary There a r e c u r r e n t l y no n a t u r a l language query systems a v a i l a b l e today t h a t a r e b o t h domain independent and s i m p l e enough f o r an average DBA t o use. The systems b e s t d e s i g n e d f o r p o r t a b i l i t y a r e I n t e l l e c t ( H a r r i s 1977) and the system implemented by Booth (Booth 1983). Both of t h e s e systems use an i n v e r t e d index and o t h e r database i n f o r m a t i o n as a s o u r c e of semantic knowledge. They a l s o p e r m i t a new domain t o be implemented w i t h l i t t l e or no a d d i t i o n a l programming. However, bo t h of t h e s e systems r e q u i r e the DBA t o have c o n s i d e r a b l e e x p e r t i s e i n l i n g u i s t i c s . T h i s problem can be r e s o l v e d by d e s i g n i n g a d a t a - d r i v e n system which uses a domain independent p a r s e r w i t h an e x t e n s i v e d i c t i o n a r y c o n t a i n i n g l e x i c a l , s y n t a c t i c , and semantic i n f o r m a t i o n . A d d i t i o n a l knowledge can a l s o be g a i n e d by e x p l o i t i n g the s t r u c t u r e of the database f o r semantic i n f o r m a t i o n . 23 I I I . CONCEPTS IN NATURAL LANGUAGE SYSTEMS The g o a l i n b u i l d i n g a n a t u r a l language query system i s t o make i t both p o r t a b l e and easy t o use w i t h o u t s a c r i f i c i n g power. In the p a s t , l i t t l e a t t e n t i o n has been p a i d t o the s e g o a l s due t o the r e s e a r c h n a t u r e of the p r o j e c t s . Now t h a t NL i s b e i n g c o m m e r c i a l i z e d , more emphasis must be p l a c e d on t h e s e i s s u e s . The s u c c e s s of p r e v i o u s a t t e m p t s i n d e s i g n i n g p o r t a b l e NL systems has been measured by the ease w i t h which an e x p e r t can implement a new domain. T h i s c r i t e r i a i s not s u f f i c i e n t i f NL systems a r e t o become w i d e l y used. Most companies don't want t o r e l y on o u t s i d e e x p e r t s or pay t h e i r h i g h c o n s u l t i n g f e e s each time a new datab a s e i s added. I f the l e v e l of e x p e r t i s e needed t o m a i n t a i n a NL system does not match those of a t y p i c a l DBA, th e s e systems w i l l have the same f a t e as many o t h e r " d i f f i c u l t -t o - u s e " s o f t w a r e packages. In o r d e r t o a c h i e v e domain p o r t a b i l i t y , i t i s n e c e s s a r y t o c l e a r l y s e p a r a t e the domain dependent component from the domain independent p a r t of the system. To m i n i m i z e the s k i l l needed t o implement a new domain, the domain dependent s p e c i f i c a t i o n s h o u l d r e q u i r e a m i n i m a l amount of l i n g u i s t i c i n f o r m a t i o n . A l l of the domain dependent i n f o r m a t i o n s h o u l d be g i v e n i n d a t a form. The DBA s h o u l d not be r e q u i r e d t o s u p p l y any programmed r o u t i n e s . In t h e p r e v i o u s c h a p t e r we examined the advantages and d i s a d v a n t a g e s of o t h e r NL systems. In t h i s c h a p t e r we w i l l c l o s e l y examine each component of a NL system t o de t e r m i n e whether i t i s domain dependent or independent. I w i l l a l s o 24 argue f o r the importance of user f r i e n d l y components, such as a s p e l l i n g c o r r e c t o r , and the a b i l i t y t o d e t e r m i n e the meaning of unknown words. For the remainder of t h i s t h e s i s I w i l l d e f i n e a " u s e r " t o be a person who e n t e r s q u e r i e s i n t o t h e system and who has some knowledge of the c o n t e n t s of the d a t a b a s e , but has no knowledge of i t s s t r u c t u r e . A DBA (database a d m i n i s t r a t o r ) i s d e f i n e d as a person who has c o n s i d e r a b l e knowledge of the s t r u c t u r e and the c o n t e n t s of the d a t a b a s e , but has l i t t l e knowledge of l i n g u i s t i c s and no knowledge of A l programming. A "system d e s i g n e r " i s d e f i n e d as a person who has c o n s i d e r a b l e knowledge of l i n g u i s t i c s , A l programming, and database management systems. 3.1 S y n t a c t i c A n a l y s i s Concepts S y n t a c t i c A n a l y s i s i n v o l v e s d e t e r m i n i n g the s y n t a c t i c c a t e g o r y of each word and b u i l d i n g a s y n t a c t i c s t r u c t u r e of each query u s i n g the r u l e s of a grammar r e s i d i n g i n the system. R o u t i n e s such as a s p e l l i n g c o r r e c t o r a r e n o r m a l l y i n v o k e d a t t h i s s t a g e . The major b e n e f i t of i n i t i a l l y a n a l y z i n g a query at the l e x i c a l and s y n t a c t i c l e v e l i s t h a t i t s i m p l i f i e s p r o c e s s i n g a t the semantic l e v e l . T h i s i s a c h i e v e d by mapping many s i m i l a r q u e r i e s i n t o one s y n t a c t i c r e p r e s e n t a t i o n c a l l e d the deep s t r u c t u r e . 25 3.1.1 ATN Networks ATN's have been used i n many NL systems w i t h v a r i n g degrees of s u c c e s s (Woods 1970; H a r r i s 1977). One of the main b e n e f i t s of u s i n g an ATN i s t h a t the grammar i s domain independent. The domain dependent component c o n s i s t s of the v o c a b u l a r y enhanced w i t h l e x i c a l and s y n t a c t i c i n f o r m a t i o n . No a d d i t i o n a l programming i s r e q u i r e d when a new domain i s implemented s i n c e t h i s i n f o r m a t i o n i s i n d a t a form. There a r e two d i s a d v a n t a g e s i n u s i n g ATN's. The f i r s t i s t h a t they can produce more t h a t one i n t e r p r e t a t i o n of a qu e r y . As shown i n t h e p r e v i o u s c h a p t e r , many of t h e s e can be a v o i d e d by enhancing t h e ATN w i t h some se m a n t i c i n f o r m a t i o n . The o t h e r d i s a d v a n t a g e i s t h a t the ATN may p a r s e a phrase i n the query more than one t i m e . T h i s o c c u r s when more than one a r c i n the ATN can be taken a t a g i v e n p o i n t . The parse w i l l then c o n t i n u e b u i l d i n g a s y n t a c t i c r e p r e s e n t a t i o n up t o a p o i n t where no a p p l i c a b l e a r c s can be t a k e n . At t h i s p o i n t the system w i l l back up t o t h e l a s t d e c i s i o n p o i n t , d i s c a r d i n g a l l p r o c e s s i n g done a l o n g t h e i n c o r r e c t a r c . T h i s problem w i l l not be addr e s s e d i n t h i s t h e s i s , a l t h o u g h t h e r e a r e a few f e a t u r e s d e s c r i b e d i n c h a p t e r 4 which w i l l l e s s e n t h i s problem. 3.1.2 V o c a b u l a r y The f i r s t s t e p i n d e s i g n i n g a p o r t a b l e NL system i s t o d i s t i n g u i s h between the domain independent and the domain dependent v o c a b u l a r y . Words such as " p r i n t " , " t h e " , "and", 26 " o f " , and " i n f o r m a t i o n " a r e o b v i o u s l y domain independent. The s o u r c e of domain dependent v o c a b u l a r y comes from two s o u r c e s words which d e s c r i b e the i n f o r m a t i o n and r e l a t i o n s h i p s i n the d a t a b a s e , and the a c t u a l c o n t e n t s of the d a t a b a s e . A l t h o u g h words such as "program" and " r u n " are c o n s i d e r e d domain dependent, the l e x i c a l p r o p e r t i e s a s s o c i a t e d w i t h t h e s e words i s always the same. The p a s t p a r t i c i p l e of the v e r b " r u n " i s a lways " r a n " r e g a r d l e s s of the semantic meaning w i t h i n any d a t a b a s e . Hence i t makes sense t o t r e a t the s y n t a c t i c d e f i n i t i o n of a l l common E n g l i s h words as domain independent i n f o r m a t i o n . I n i t i a l l y t h i s may seem t o be an u n r e a l i s t i c p r o p o s i t i o n f o r any NL system, but many microcomputer w o r d p r o c e s s o r s now have d i c t i o n a r i e s which o f t e n c o n t a i n over 100,000 words. P r o v i d i n g a d i c t i o n a r y w i t h s y n t a c t i c i n f o r m a t i o n would produce many b e n e f i t s t o a NL system. I t would f r e e the DBA from p r o v i d i n g t h i s i n f o r m a t i o n when d e f i n i n g a d a t a b a s e . I t would a l s o enhance the system when d e a l i n g w i t h words which a r e not s p e c i f i e d i n e i t h e r the domain dependent or domain independent components. Another advantage i s t h a t the s y n t a c t i c d e f i n i t i o n of words c o n t a i n e d i n t h i s d i c t i o n a r y w i l l be c o n s i s t e n t from domain t o domain, and the chance of any e r r o r s caused by i n c o r r e c t d e f i n i t i o n s would be reduced. 3.1.2.1 Morphinq One a r e a of Morphology ( t h e study of the i n t e r n a l s t r u c t u r e of words) which i s o f t e n used i n NL systems i s a n a l y z i n g a word f o r p r e f i x e s and s u f f i x e s . T h i s p r o c e s s i s commonly c a l l e d 27 Morphing. The b e n e f i t of h a v i n g the system r e c o g n i z e s u f f i x e s i s t h a t i t can r e c o g n i z e v a r i o u s forms of words, such as p l u r a l forms of nouns and p a s t forms of v e r b s , w i t h o u t h a v i n g each form of the word d e f i n e d s e p a r a t e l y . The i n f o r m a t i o n needed t o p e r f o r m t h i s f u n c t i o n on E n g l i s h words w i t h r e g u l a r e n d i n g s i s domain independent. T h e r e f o r e DBA's s h o u l d not need t o s p e c i f y any m o r p h o l o g i c a l i n f o r m a t i o n when d e f i n i n g a d a t a b a s e , u n l e s s an obscure word i s used. I t i s p o s s i b l e f o r the system t o encounter words ( o f t e n nouns) where the m o r p h o l o g i c a l i n f o r m a t i o n i s not a v a i l a b l e . T h i s may happen when a noun i s added t o the database and i s not c o n t a i n e d i n the d i c t i o n a r y . In t h i s case the system can t r y t o i n f e r the r o o t form of p l u r a l nouns i n the query by s i m p l y a t t e m p t i n g t o remove p o s s i b l e endings ("s", "es", e t c . ) and c h e c k i n g t o see i f the "computed" r o o t e x i s t s . 3.1.2.2 Nouns A l t h o u g h t h e r e a r e a few nouns, such as " r e p o r t " , " i n f o r m a t i o n " , and "number", which are domain independent, most nouns a r e domain dependent. Many nouns r e p r e s e n t v a l u e s s t o r e d i n the d a t a b a s e . In the q u e r y : How much i s a b o l t ? the noun " b o l t " c o u l d be s t o r e d i n a " S u p p l i e r / P a r t s " d a t a b a s e . Nouns a r e a l s o used t o i d e n t i f y t y p e s of i n f o r m a t i o n i n a d a t a b a s e . In the query: 28 What programs have good r a t i n g s ? the noun "programs" r e f e r s t o the r e l a t i o n ( t a b l e ) of computer programs, w h i l e " r a t i n g s " r e f e r s t o a f i e l d ( a t t r i b u t e ) s p e c i f y i n g the r a t i n g of the computer program. 3.1.2.3 P r o p e r Nouns Proper Nouns are u s u a l l y v a l u e s s t o r e d i n the database. The q u e r y : Does John W i l s o n produce V i s i c a l c ? c o n t a i n s the p r o p e r nouns "John W i l s o n " and " V i s i c a l c " , which r e f e r t o a s o f t w a r e producer and program name r e s p e c t i v e l y . I t i s a l s o p o s s i b l e t h a t t h e s e names may not be c o n t a i n e d i n the database a t a l l . In t h i s case i t i s i m p o r t a n t f o r the system to r e c o g n i z e such p r o p e r nouns and respond a p p r o p r i a t e l y . 3.1.2.4 A d j e c t i v e s A d j e c t i v e s a r e a l s o domain dependent, a l t h o u g h they may appear i n many d i f f e r e n t domains. Some a d j e c t i v e s , such as the se t of c o l o u r s , a re l i k e l y t o be d e f i n e d s i m i l a r l y i n each domain, w h i l e o t h e r a d j e c t i v e s , such as "good", " t e r r i b l e " , "cheap", or " e x p e n s i v e " , may be d e f i n e d d i f f e r e n t l y each time they a r e used. For i n s t a n c e , a cheap computer program may be d e f i n e d t o c o s t l e s s than $100.00, w h i l e a cheap house would o b v i o u s l y be d e f i n e d d i f f e r e n t l y . S i n c e a t t r i b u t e s such as c o l o u r , r a t i n g , and c o s t a r e l i k e l y t o be used i n many d i f f e r e n t 29 d a t a b a s e s , i t would be u s e f u l t o have the needed i n f o r m a t i o n d e f i n e d i n a manner such t h a t the DBA can a c c e s s t h e s e d e f i n i t i o n s i n an easy and c o n s i s t e n t manner. 3.1.2.5 Verbs Verbs p l a y many r o l e s i n NL systems. Some of t h e s e r o l e s a r e domain dependent, w h i l e o t h e r s a r e domain independent. One c a t e g o r y of v e r b s which i s domain independent a r e "command v e r b s " . The f o l l o w i n g two q u e r i e s : P r i n t the good programs. Count the good programs. c o n t a i n v e r b s which c o u l d be used i n any domain. These v e r b s i n s t r u c t the system t o p e r f o r m some t a s k ( i e . p r i n t a r e p o r t ; count something) and s h o u l d be d e f i n e d c o n s i s t e n t l y i n e v e r y d a t a b a s e . A c a t e g o r y of v e r b s which i s l a r g e l y domain dependent a r e " r e l a t i o n v e r b s " . In t h e f o l l o w i n g q u e r i e s the v e r b " r u n " s p e c i f i e s a semantic r e l a t i o n s h i p between computers and programs, or between an a t h l e t e and a r a c e . What computers run good programs? Does the Apple I I run LOGO? Which a t h l e t e s run the 100 metre dash? S i n c e the r e q u i r e d d e f i n i t i o n s of such v e r b s a r e d e t e r m i n e d by the g i v e n d a t a b a s e , these v e r b s s h o u l d be d e f i n e d f o r each database i n which they a r e used. There a r e a number of v e r b s which a r e s i m i l a r t o r e l a t i o n 30 v e r b s but can be used i n any domain. The verbs "be" ( i s , a r e , e t c . ) and "have" are two such v e r b s when they a r e used as the main v e r b of a query. In the q u e r i e s : I s Logo a good program? Does LOGO have a good r a t i n g ? b o t h " i s " and "have" are used t o p l a c e a c o n s t r a i n t ( r a t i n g = good) on the program Logo. Another s e t of v e r b s which a r e domain independent a r e a u x i l i a r y v e r b s . These v e r b s have many d i f f e r e n t f u n c t i o n s i n E n g l i s h . For example, they can be used t o s p e c i f y a "yes-no" q u e s t i o n as shown by the v e r b "does": Does the Apple have 64 KB? They can a l s o be used t o s p e c i f y t e n s e as shown by the a u x i l i a r y v e r b "have" i n the f o l l o w i n g q u e r y : What companies have s u p p l i e d p r o j e c t J3? 3.1.2.6 Adverbs Adverbs can be e i t h e r domain dependent or domain independent. Adverbs such as " p l e a s e " and " i m m e d i a t e l y " , as shown i n the f o l l o w i n g q u e r i e s , a r e domain independent and o f f e r l i t t l e semantic i n f o r m a t i o n . 31 P r i n t the good programs i m m e d i a t e l y ! P l e a s e p r i n t the good programs! Other adverbs c o u l d be used t o d e s i g n a t e the v a l u e of a f i e l d . The adverbs " q u i c k l y " and " s l o w l y " c o u l d s p e c i f y the r e l a t i v e speed of a computer program. In t h i s case they are c o n s i d e r e d t o be domain dependent. 3.1.2.7 P r e p o s i t i o n s P r e p o s i t i o n s a r e v e r y u s e f u l i n the semantic i n t e r p r e t a t i o n of a query. In NL systems w i t h s m a l l domains, i t i s o f t e n p o s s i b l e t o p r e d i c t the meaning of the noun f o l l o w i n g a p r e p o s i t i o n by a n a l y z i n g the p r e p o s i t i o n a l o n e . For example, i n the query: P r i n t the s u p p l i e r s from Spuzzum. i t i s r e a s o n a b l e t o p r e d i c t the name "Spuzzum" i s a l o c a t i o n ( c i t y , town, e t c . ) , e s p e c i a l l y i f each s u p p l i e r has a l o c a t i o n a s s o c i a t e d w i t h i t i n the d a t a b a s e . In o t h e r s i t u a t i o n s p r e p o s i t i o n s a r e used t o determine the r o l e a noun phrase p l a y s w i t h r e s p e c t t o the main v e r b . In the query: Who f l i e s from Vancouver t o T o r o n t o ? the l o c a t i o n f o l l o w i n g "from" s i g n a l s the o r i g i n w h i l e " t o " s i g n a l s the d e s t i n a t i o n . A l t h o u g h p r e p o s i t i o n s a r e s y n t a c t i c a l l y domain independent, most semantic i n f o r m a t i o n a s s o c i a t e d w i t h them i s l a r g e l y domain dependent. 32 3.1.3 H a n d l i n g Unknown Words / User I n t e r f a c e In d e s i g n i n g a system t o be used by c a s u a l u s e r s , i t i s i m p o r t a n t f o r the system t o be a b l e t o d e a l w i t h v i r t u a l l y a n y t h i n g a user may t y p e . One d o w n f a l l w i t h many p r e v i o u s NL systems i s t h a t they a c c e p t e d a l i m i t e d v o c a b u l a r y . However, t h e r e a r e numerous reasons why a user may s p e c i f y a word t h a t i s unknown t o the system. An unknown word c o u l d be c a u sed by a s p e l l i n g m i s t a k e , an a b b r e v i a t i o n , a word assumed t o be c o n t a i n e d i n or a s s o c i a t e d w i t h a d a t a b a s e , or even a word which the DBA m i s t a k e n l y never i n c l u d e d i n the s p e c i f i c a t i o n . 3.1.3.1 S p e l l i n g C o r r e c t i o n A r e a s o n a b l e s p e l l i n g c o r r e c t i o n f a c i l i t y i s i m p o r t a n t i n any s o f t w a r e system where u s e r s a r e e x p e c t e d t o type i n n a t u r a l language. T h i s need i s m u l t i p l i e d i f the user has l i t t l e e x p e r i e n c e w i t h the computer's keyboard. Methods f o r s p e l l i n g c o r r e c t i o n range from s i m p l y l o o k i n g up the word i n a d i c t i o n a r y and prompting the user i f the word i s not found, t o complex a l g o r i t h m s of comparing the word t o a l i s t of words. 3.1.3.2 A b b r e v i a t i o n s The system must be a b l e t o i d e n t i f y b o t h common a b b r e v i a t i o n s as w e l l as the ones i n v e n t e d by u s e r s t o save k e y b o a r d i n g t i m e . A b b r e v i a t i o n s such as "LTD." and "AVE." s h o u l d o b v i o u s l y be b u i l t i n t o the system. I t s h o u l d a l s o a l l o w e i t h e r the DBA or the user t o d e f i n e a b b r e v i a t i o n s such as "Kb" 33 or "K" ( f o r K i l o b y t e ) , or " J . W. S o f t w a r e " ( f o r John W i l s o n S o f t w a r e L i m i t e d ) . Once such an a b b r e v i a t i o n i s a c c e p t e d , the system s h o u l d remember i t f o r f u t u r e use. 3.1.3.3 Assumed Database Elements One common s o u r c e of unknown words i s caused when u s e r s t r y t o i n q u i r e about elements which a r e not c o n t a i n e d i n the d a t a b a s e . H a r r i s (1977b) assumes t h a t u s e r s w i l l o n l y ask q u e s t i o n s about elements which a r e i n the d a t a b a s e . T h i s may be a r e a s o n a b l e assumption t o make i f the database i s b e i n g used i n a c o r p o r a t i o n where the employees are f a m i l i a r w i t h the data s t o r e d . T h i s assumption i s t o t a l l y i n v a l i d i f u s e r s a r e f o r e i g n t o the data b a s e or i f they a r e i n q u i r i n g whether something i s a c t u a l l y s t o r e d i n the d a t a b a s e . For example, l e t ' s assume i n the f o l l o w i n g query t h a t "paper" i s not a p a r t (or a n y t h i n g e l s e ) c o n t a i n e d i n a d a t a b a s e . Does Smith s u p p l y paper? One would hope t h a t the system would be a b l e t o p a r s e "paper" and a p p r o p r i a t e l y answer: I don't know. "PAPER" i s not a "PART" c o n t a i n e d i n the d a t a b a s e . 34 3.1.3.4 I g n o r i n g Words I t i s i m p o s s i b l e t o p r e d i c t every p o s s i b l e word t h a t a user may t y p e . Hence i t i s i m p o r t a n t f o r the system t o be a b l e t o p r o c e s s q u e r i e s which c o n t a i n such words i n a r e a s o n a b l e way. One u s e f u l t e c h n i g u e which can be used i s t o i g n o r e the word. For example, i n the query: P r i n t t h e c o l l o i d C h e m i s t r y programs. the system may not u n d e r s t a n d the word " c o l l o i d " . In t h i s case the user s h o u l d have the o p t i o n of h a v i n g the query answered w i t h " c o l l o i d " removed from the q u e r y . 3.1.3.5 Knowledge A c q u i s i t i o n Knowledge A c q u i s i t i o n can range from the system s t o r i n g new a b b r e v i a t i o n s o r synonyms t o l e a r n i n g about new sentence s t r u c t u r e s and d e f i n i t i o n s of new v e r b s . A l t h o u g h the more complex d e f i n i t i o n s would need t o be s u p p l i e d by a system d e s i g n e r or the DBA, c o n s i d e r a b l e knowledge can be g i v e n by the u s e r . One s i m p l e method f o r l e a r n i n g new words i s t o a l l o w the user t o s p e c i f y a synonym f o r a word which i s not known t o the system. For example, i n the f o l l o w i n g query the system may not have any semantic (or even s y n t a c t i c ) i n f o r m a t i o n on the v e r b " e x e c u t e " . What computers e x e c u t e LOGO? 35 In t h i s case the system may prompt the user f o r a synonym f o r the g i v e n word. I f a t t h i s p o i n t the u s e r e n t e r s a word which has meaning such as " r u n " , the system c o u l d c o r r e c t l y answer the q u e r y . When the user s u p p l i e s the system w i t h i n f o r m a t i o n such as an a b b r e v i a t i o n or a synonym f o r an unknown word, i t i s i m p o r t a n t t h a t i t be s t o r e d f o r f u t u r e r e f e r e n c e . T h i s w i l l p r e v e n t the system from a s k i n g the u s e r the same q u e s t i o n t w i c e , as w e l l as making the system more r o b u s t over t i m e . 3.2 Concepts In Semantic A n a l y s i s There has been l i t t l e work done i n c o n s t r u c t i n g a p o r t a b l e semantic p r o c e s s o r as opposed t o s y n t a c t i c p r o c e s s o r s . Methods such as P r o c e d u r a l Semantics (Woods 1967, 1968) r e q u i r e the DBA t o do e x t e n s i v e r e p e t a t i v e work f o r each database domain. Other methods (Booth 1983) l e s s e n e d t h i s work, but s t i l l r e q u i r e d t h e DBA t o have c o n s i d e r a b l e knowledge of l i n g u i s t i c s . A p o r t a b l e semantic p r o c e s s o r s h o u l d be d e s i g n e d i n a s i m i l a r f a s h i o n as the s y n t a c t i c p r o c e s s o r . The domain dependent i n f o r m a t i o n s h o u l d be s e p a r a t e from the domain independent component. The domain dependent component s h o u l d be i n d a t a form and r e q u i r e the DBA t o s p e c i f y as l i t t l e l i n g u i s t i c i n f o r m a t i o n as p r a c t i c a l l y p o s s i b l e . 36 3.2.1 D e f i n i t i o n Of Semantic I n f o r m a t i o n There a r e v a r i o u s s o u r c e s of semantic i n f o r m a t i o n , some of which a r e domain dependent w h i l e o t h e r s a r e domain independent. The semantic meaning of command v e r b s such as "PRINT" or "COUNT" i s o b v i o u s l y domain independent. Other words such as "A", "THE", "OR", "DOES" and "PLEASE" a r e a l s o domain independent and s h o u l d be d e f i n e d a c c o r d i n g l y . There a r e a number of types of domain dependent semantic i n f o r m a t i o n which must be s u p p l i e d by t h e DBA. O b v i o u s l y the system w i l l need t o know which words i d e n t i f y each r e l a t i o n and f i e l d i n the d a t a b a s e . For example, the nouns "program" and " s o f t w a r e " c o u l d be used t o i d e n t i f y the r e l a t i o n program. I t i s a l s o n e c e s s a r y t o s p e c i f y the r e l a t i o n s h i p s between the r e l a t i o n s and f i e l d s i n the d a t a b a s e . For example, the v e r b "RUN" s p e c i f i e s a r e l a t i o n s h i p between the programs and computers. The s t r u c t u r e of the da t a b a s e c o n t a i n s c o n s i d e r a b l e semantic i n f o r m a t i o n which can be e x p l o i t e d . For example, i n the query: What has 64 KB of memory? the system s h o u l d r e a l i z e t h a t memory i s an a t t r i b u t e ( f i e l d ) of a computer and not of a program. T h i s would a l l o w t h e system t o p r e d i c t the i n t e n d e d meaning of "What". Some words appear t o be domain independent, but t h e i r s emantic meaning i s a c t u a l l y c l o s e l y t i e d t o the d a t a b a s e . The word "Who" c o u l d r e f e r t o any human (employee, customer) or a 37 group of humans • (computer m a n u f a c t u r e r , program p u b l i s h e r ) w h i l e the word "What" or " t h i n g " c o u l d r e f e r t o any p h y s i c a l or a b s t r a c t o b j e c t (computer, program, j o b ) . T h i s t y pe of semantic i n f o r m a t i o n must be s u p p l i e d f o r each domain, h o p e f u l l y i n a si m p l e and p r e c i s e way. 3.2.2 Some Problems In Semantic I n t e r p r e t a t i o n There a r e numerous problems i n s e m a n t i c a l l y i n t e r p r e t i n g q u e r i e s which need t o be r e s o l v e d i n such a way t h a t domain p o r t a b i l i t y i s m a i n t a i n e d . One common problem i s t h a t t h e r e may be words or p h r a s e s i n a query which cannot be i n t e r p r e t e d . O f t e n i t i s p o s s i b l e t o i g n o r e t h e s e words or p h r a s e s a t the semantic l e v e l and s t i l l answer the query a p p r o p r i a t e l y . For example, t h e words " q u i c k " and " q u i c k l y " c o u l d be i g n o r e d i n the f o l l o w i n g q u e r y w i t h o u t h a v i n g a major e f f e c t on the main o b j e c t i v e of t h e query. P r i n t the q u i c k F r e n c h programs q u i c k l y . The d e c i s i o n on the words which can be i g n o r e d i s o b v i o u s l y domain dependent s i n c e i t i s p o s s i b l e t o have a database where bo t h " q u i c k " and " q u i c k l y " a re s e m a n t i c a l l y d e f i n e d . One c l a s s i c a l problem i n semantic i n t e r p r e t a t i o n i s a t t a c h i n g m o d i f i e r s t o the c o r r e c t noun phrase ( m o d i f i e r a t t a c h m e n t ) . I n the que r y : What computers run LOGO w i t h 128 KB? the p r e p o s i t i o n a l phrase " w i t h 128 KB" s h o u l d modify "computers" 38 and not "LOGO" s i n c e the amount of memory i s t r e a t e d as an a t t r i b u t e of a computer and not a computer program. C o n j u n c t i o n a l s o p r e s e n t s many problems i n semantic i n t e r p r e t a t i o n . The q u e r y : P r i n t the Apple and IBM computers. i s ambiguous s i n c e the u s e r s ' i n t e n d e d i n t e r p r e t a t i o n may be e i t h e r : (1) P r i n t the IBM computers and p r i n t the Apple computers. ( P r i n t the Apple or IBM computers.) or (2) P r i n t the computers which a r e both A p p l e and IBM. U s i n g the h e u r i s t i c t h a t a user w i l l not ask an i m p o s s i b l e q u e s t i o n and the f a c t t h a t the database f o r c e s each computer t o have a t most one computer m a n u f a c t u r e r , the i n t e n d e d i n t e r p r e t a t i o n i s o b v i o u s l y ( 1 ) . The semantic i n f o r m a t i o n needed f o r m o d i f i e r attachment and i n t e r p r e t i n g some ty p e s of ambiguous q u e r i e s i s c o n t a i n e d i n the s t r u c t u r e of the database. Hence i t i s i m p o r t a n t f o r the semantic i n t e r p r e t e r t o be p r o v i d e d w i t h t h i s i n f o r m a t i o n . 3.3 Database I n t e r f a c e Many of the e a r l i e r NL query systems d i d not attempt t o r e s o l v e the problem of t r a n s p o r t i n g the system t o o t h e r DBMS. Some systems s i m p l y implemented the database component i n LISP (Woods 1967). T h i s method i s s u i t a b l e f o r r e s e a r c h o r i e n t e d systems, but i s not a p p r o p r i a t e f o r l a r g e r commercial d a t a b a s e s . 39 Other systems t i g h t l y bound t h e database component w i t h the semantic i n t e r p r e t e r ( H e n d r i x e t a l 1978; S a g a l o w i c z 1977). T h i s made i t v e r y d i f f i c u l t t o adapt the system t o a new DBMS. The most s u c c e s s f u l approach towards d e s i g n i n g a NL system i s t o d e s i g n i t t o a c t as a f r o n t - e n d t o e x i s t i n g database systems. I n t e l l e c t has taken t h i s approach and now can be i n t e r f a c e d t o numerous database systems. In o r d e r t o make the jo b of a d a p t i n g the system t o a new DBMS as easy as p o s s i b l e , the database i n t e r f a c e component s h o u l d be s e p a r a t e d from the semantic and s y n t a c t i c components. T h i s can be a c h i e v e d by h a v i n g the semantic component pass a s t r u c t u r e d r e p r e s e n t a t i o n of the query t o the database i n t e r f a c e r o u t i n e , which i n t u r n t r a n s l a t e s i t i n t o a query i n the database query language of the g i v e n DBMS. 3.4 Summary A l l of the programmed components, i n c l u d i n g b oth the s y n t a c t i c and semantic p r o c e s s o r s , d i s c u s s e d i n t h i s c h a p t e r a r e domain independent. To implement a new domain, these components o n l y r e q u i r e i n f o r m a t i o n i n d a t a form. By s u p p l y i n g the system w i t h s u f f i c i e n t i n f o r m a t i o n about the s t r u c t u r e of the d a t a b a s e , i t i s p o s s i b l e t o r e s o l v e even some of the more complex problems i n NL such as m o d i f i e r attachment and ambiguous c o n j u n c t i o n s . The system s h o u l d a l s o be g i v e n as much s y n t a c t i c and semantic i n f o r m a t i o n as p o s s i b l e i n orde r t o make i t more p o w e r f u l and t o make the j o b of implementing a new domain as easy as p o s s i b l e . A l t h o u g h f e a t u r e s such as an i n v e r t e d index 40 of the database and a d i c t i o n a r y of the s y n t a c t i c d e f i n i t i o n s of common E n g l i s h words w i l l o b v i o u s l y t a k e up c o n s i d e r a b l e s t o r a g e ( p r i m a r y s t o r a g e ) , they are r e q u i r e d f o r a t r u l y i n t e l l i g e n t system. In o r d e r f o r the system t o be p o r t a b l e a c r o s s v a r i o u s database systems, t h e database i n t e r f a c e component s h o u l d be s e p a r a t e d from the semantic and s y n t a c t i c components. By d e s i g n i n g the system as a f r o n t end p r o c e s s o r , i t i s p o s s i b l e t o adapt i t t o o t h e r a p p l i c a t i o n s such as s t a t i s t i c a l , s p r e a d s h e e t , or even g r a p h i c s packages. 41 IV. SYSTEM DESIGN I : SYNTACTIC ISSUES A p r o t o t y p e n a t u r a l language system has been d e v e l o p e d t o t e s t out the f e a s a b i l i t y of a system based on the c o n c e p t s p r e s e n t e d i n Chapter 3. The system a c t s as a n a t u r a l language f r o n t - e n d t o an e x i s t i n g r e l a t i o n a l database management system. T h i s c h a p t e r c o n t a i n s an o v e r v i e w of the system and a d e t a i l e d d e s c r i p t i o n of the s y n t a c t i c components of the system, i n c l u d i n g the p a r s e r , grammar, and u s e r i n t e r f a c e f a c i l i t i e s . 4.1 Overview Of The System The system has been d e s i g n e d so t h a t the domain independent component i s s e p a r a t e from the domain dependent i n f o r m a t i o n . The domain dependent i n f o r m a t i o n i s i n d a t a format and has been s t r u c t u r e d so t h a t i t i s e a s i l y m a i n t a i n e d by an average DBA. The system i s l o g i c a l l y d i v i d e d i n t o f o u r components - the s y n t a c t i c p r o c e s s o r , the semantic p r o c e s s o r , the database i n t e r f a c e , and the response g e n e r a t o r . Each of t h e s e components i s domain independent. The database i n t e r f a c e and the reponse g e n e r a t o r a r e the o n l y components which need t o be m o d i f i e d i f the system i s adapted t o a n o t h e r DBMS. A diagram of the system i s shown i n f i g u r e 4.1. The s y n t a c t i c p r o c e s s o r ' s j o b i s t o a n a l y s e each word and produce a s y n t a c t i c r e p r e s e n t a t i o n of the query c a l l e d the deep s t r u c t u r e . The semantic p r o c e s s o r t a k e s the deep s t r u c t u r e and t r a n s f o r m s i t i n t o a " S t a n d a r d Query R e p r e s e n t a t i o n " (SQR), which c o n t a i n s the i n f o r m a t i o n needed t o c o n s t r u c t a database 42 query. The database i n t e r f a c e then t r a n s f o r m s the SQR i n t o a query i n a s p e c i f i c database query language. T h i s query i s then e x e c u t e d by the DBMS and the r e s u l t s a r e passed on t o the reponse g e n e r a t o r . The response g e n e r a t o r ' s j o b i s t o tak e the i n f o r m a t i o n r e t u r n e d from the DBMS and p r e s e n t i t t o the user i n an a p p r o p r i a t e format. The A c t i v e Domain D i c t i o n a r y i s the main so u r c e of s y n t a c t i c and semantic knowledge used by the system. T h i s d i c t i o n a r y i s c o m p i l e d from the i n f o r m a t i o n c o n t a i n e d i n both the domain dependent and domain independent components. The domain independent i n f o r m a t i o n i s s t o r e d i n the G e n e r a l D i c t i o n a r y and the G l o b a l D i c t i o n a r y . The G e n e r a l D i c t i o n a r y c o n t a i n s s y n t a c t i c and semantic i n f o r m a t i o n about each word commonly used i n most domains. The G l o b a l D i c t i o n a r y c o n t a i n s s y n t a c t i c and some semantic i n f o r m a t i o n f o r most E n g l i s h words. T h i s d i c t i o n a r y s e r v e s two main p u r p o s e s . I t s p r i m a r y use i s t o su p p l y the s y n t a c t i c d e f i n i t i o n of every word s p e c i f i e d i n the domain dependent component. T h i s r e l i e v e s the DBA from s p e c i f y i n g t h i s i n f o r m a t i o n when a word i s used. I t i s a l s o used when a word i s not found i n the A c t i v e Domain D i c t i o n a r y . T h i s g i v e s the system the a b i l i t y t o p r o c e s s words which a r e not d i r e c t l y a s s o c i a t e d w i t h the domain. The domain dependent i n f o r m a t i o n i s s p e c i f i e d i n the "database schema", the "database schema l i b r a r y " , and an i n v e r t e d i n d e x . The database schema c o n t a i n s a s t r u c t u r e d d e s c r i p t i o n of the dat a b a s e , which i n c l u d e s the d e f i n i t i o n of a l l t he r e l a t i o n s and t h e i r a s s o c i a t e d f i e l d s i n the da t a b a s e , 43 the nouns which i d e n t i f y each r e l a t i o n and f i e l d , v e r b s which s p e c i f y r e l a t i o n s h i p s between r e l a t i o n s and/or f i e l d s , and i n f o r m a t i o n on how r e l a t i o n s i n the d a t a b a s e can be j o i n e d t o g e t h e r . The database schema may a c c e s s the database schema l i b r a r y . T h i s l i b r a r y c o n t a i n s d e f i n i t i o n s of f i e l d s and r e l a t i o n s h i p s which a r e commonly used i n many d i f f e r e n t d a t a b a s e s . For i n s t a n c e , a s t a n d a r d d e f i n i t i o n of "CITY" i s s t o r e d here s i n c e c i t i e s a r e c o n t a i n e d i n many d i f f e r e n t d a t a b a s e s . The i n v e r t e d index i s a t a b l e of a l l t h e words c o n t a i n e d i n the g i v e n d a t a b a s e . For each of these words, the index c o n t a i n s the database f i e l d ( s ) which the word i s a s s o c i a t e d w i t h , and some s y n t a c t i c i n f o r m a t i o n . The s y n t a c t i c i n f o r m a t i o n f o r each of t h e s e words i s s u p p l i e d by the database schema and the g l o b a l d i c t i o n a r y . The database schema s p e c i f i e s a d e f a u l t s y n t a c t i c c a t e g o r y f o r each e n t r y i n a f i e l d . A d d i t i o n a l s y n t a c t i c i n f o r m a t i o n needed t o determine v a r i o u s forms of a word (eg. p l u r a l ) i s a v a i l a b l e f o r words c o n t a i n e d i n the G l o b a l D i c t i o n a r y . For example, i f the word "phone" i s s t o r e d i n a database f i e l d c o n t a i n i n g p a r t s , the d a t a b a s e schema would s p e c i f y t h a t the word i s t o be t r e a t e d as a noun and not as a v e r b , and the G l o b a l D i c t i o n a r y would s u p p l y t h e i n f o r m a t i o n needed t o determine the p l u r a l and p o s s e s s i v e forms. 44 ATN Grammar Syntactic Errors/ Responses Query Semantic Errors/ Responses Response I Syntactic Processor I Deep Structure I Semantic Processor I SQR (Standard Query Repn.) I Database Interface I Database Query (Sequel) I Database (ORACLE) I Database Output I Response Generator Active Domain Dictionary C 0 M P 1 L E Global Dictionary General Dictionary Database Schema Library — f f ~ Database Schema Inverted Index Database KEY "A J o Program Component Data Structures & I/O F i g u r e 4.1 - Diagram of the System 45 In o r d e r f o r a new domain t o be implemented, the DBA needs o n l y t o d e v e l o p a new database schema. I f t h e r e a r e any words i n t he database schema which a r e not c o n t a i n e d i n the G l o b a l D i c t i o n a r y , the DBA may o p t i o n a l l y s p e c i f y the s y n t a c t i c i n f o r m a t i o n . The i n v e r t e d index i s a u t o m a t i c a l l y c o n s t r u c t e d by the system. The system has been t e s t e d out u s i n g two d a t a b a s e s . One database c o n t a i n s i n f o r m a t i o n about computer programs and the computers which they run on, and the o t h e r c o n t a i n s i n f o r m a t i o n about the p a r t s t h a t s u p p l i e r s s u p p l y t o v a r i o u s j o b s . The system can answer a wide range of q u e s t i o n s , i n c l u d i n g "yes/no" q u e s t i o n s , "how-many" q u e s t i o n s , and q u e s t i o n s which a r e answered i n a r e p o r t format. L i s t e d below i s a sample of the t y p e s of q u e s t i o n s t h a t the system can answer. P r i n t the good programs from MECC. I s LOGO a good computer program? T e l l me i f LOGO i s an e x c e l l e n t program. What runs what? How many c h e m i s t r y programs run on the Apple I I ? How much memory does the IBM PC have? What programs run on IBM computers w i t h 128 KB? P r i n t the computers which run s o f t w a r e from M i c r o s o f t w i t h 128 KB? P r i n t the good and e x c e l l e n t computer programs. Does Apple s e l l food? Does John Doe s e l l good programs? P r i n t the numbers. Who s u p p l i e s r e d , g r e e n , or b l u e b o l t s ? How much does p a r t P4 weigh? Does the c o l l a t o r p r o j e c t use b o l t s from Smith? F i g u r e 4.2 - Sample Q u e r i e s The system i s w r i t t e n i n LISP (UBC-LISP) and runs on an Amdahl 470-V8. The database management system used i s ORACLE 46 4.0, which uses the query language SQL. 4.2 The S y n t a c t i c P r o c e s s o r The main component of the s y n t a c t i c p r o c e s s o r i s the ATN p a r s e r . The p a r s e r i n v o k e s r o u t i n e s such as the morpher, the s p e l l i n g c o r r e c t o r , and v a r i o u s u s e r i n t e r f a c e r o u t i n e s . I f the query i s s u c c e s s f u l l y p a r s e d , c o n t r o l i s p a s s e d on t o the semantic p r o c e s s o r , o t h e r w i s e an a p p r o p r i a t e e r r o r message i s p r i n t e d o u t . 4.2.1 ATN P a r s e r The j o b of the ATN p a r s e r i s t o s y n t a c t i c a l l y a n a l y s e the query u s i n g a s p e c i f i e d grammar and produce a s t r u c t u r e d r e p r e s e n t a t i o n , or deep s t r u c t u r e , of i t . T h i s deep s t r u c t u r e c o n s i s t s of a s e t of r e g i s t e r s , which a r e passed t o the semantic p r o c e s s o r . The ATN p a r s e r used i n t h i s system i s w r i t t e n i n LISP and was d e v e l o p e d by Dr. R. R e i t e r ( R e i t e r 1978). T h i s p a r s e r i s based on the s p e c i f i c a t i o n s s t a t e d by Woods (Woods 1970). A s l i g h t m o d i f i c a t i o n has been made t o t h i s p a r s e r t o c o r r e c t l y p a r s e r e l a t i v e c l a u s e s . 4.2.2 ATN Grammar A grammar i s a s e t of r u l e s which d e s c r i b e the s entences a l l o w e d i n a language. The grammar used by the ATN p a r s e r i s e s s e n t i a l l y a s y n t a c t i c grammar enhanced w i t h some semantic 47 r o u t i n e s . A s y n t a c t i c grammar a n a l y s e s a sentence i n terms of s y n t a c t i c s t r u c t u r e s such as v e r b s , nouns, noun p h r a s e s , and p r e p o s i t i o n a l p h r a s e s . Semantic r o u t i n e s have been i n c l u d e d t o h e l p p r e v e n t s y n t a c t i c a m b i g u i t y . The grammar used i n t h e system i s based on the ATN grammar f o r E n g l i s h d eveloped by Winograd (Winograd 1983). I t i s c a p a b l e of p a r s i n g many forms of s e n t e n c e s , i n c l u d i n g p a s s i v e c o n s t r u c t i o n s , r e l a t i v e c l a u s e s , some forms of c o h j u c t i o n , and numerous forms of q u e s t i o n s . The ATN i s d i v i d e d i n t o f i v e s e p a r a t e components or n e t w o r k s . The main network a n a l y s e s the complete s e n t e n c e ( S ) , and i n v o k e s the o t h e r networks t o p a r s e o t h e r s y n t a c t i c components. The o t h e r f o u r networks a r e used t o a n a l y s e noun phrases (NP), p r e p o s i t i o n a l p h r a s e s ( P P ) , p r o p e r nouns (NAME), and " n o i s e words" (NOISE). N o i s e words a r e groups of words w h i c h f r e q u e n t l y appear a t the s t a r t of a q u e s t i o n but have no s e m a n t i c e f f e c t on the meaning. For example, the phrase "Can you p l e a s e " i n the query Can you p l e a s e p r i n t the good w o r d p r o c e s s i n g programs? can be i g n o r e d w i t h o u t c h a n g i n g the meaning of the q u e r y . Diagrams of t h e ATNs are c o n t a i n e d i n Appendix A. The p a r s e r has been d e s i g n e d t o l i m i t any s y n t a c t i c a m b i g u i t y which may o c c u r . To a v o i d the s y n t a c t i c a m b i g u i t y t h a t o c c u r s w i t h proper nouns such as "Wiley and Sons of Canada L i m i t e d " , a semantic r o u t i n e has been added t o the p a r s e r . The system a l s o a v o i d s ambiguous s y n t a c t i c p a r s e s a s s o c i a t e d w i t h c o n j u n c t i o n and m o d i f i e r attachment by p a r s i n g these p h r a s e s i n 48 t h e i r s u r f a c e s t r u c t u r e . The a c t u a l i n t e r p r e t a t i o n i s d e t e r m i n e d i n the semantic stage f o r the s e q u e r i e s . 4.2.3 ATN R e g i s t e r s The ATN uses a set of r e g i s t e r s t o r e p r e s e n t the s y n t a c t i c s t r u c t u r e of the query. These r e g i s t e r s a r e a l s o used as the p r i m a r y d a t a s t r u c t u r e d u r i n g the semantic i n t e r p r e t a t i o n . The r e g i s t e r s used a r e s i m i l a r t o the s e t used by Winograd (Winograd 1983). A sample p a r s e of a query i s c o n t a i n e d i n Appendix B. In a d d i t i o n t o the r e g i s t e r s d e s c r i b e d by Winograd, the r e g i s t e r s PRED and VFRAME are used i n the sentence (S) network. PRED i s used t o s t o r e the complement (or PREDicate) of c o p u l a v e r b s ( " i s " , "seems", e t c . ) . The r e g i s t e r VFRAME i s used by the semantic p r o c e s s o r and w i l l be e x p l a i n e d i n Chapter 6. Each noun phrase i n the query has a s e p a r a t e s e t of r e g i s t e r s . These a re a l s o d e s c r i b e d by Winograd w i t h the e x c e p t i o n of ANP, POSN, ANPR, and SEMREG. These r e g i s t e r s have been added t o a i d i n the semantic i n t e r p r e t a t i o n . The s u r f a c e form of a noun phrase i s s t o r e d i n the r e g i s t e r "ANP" ( a c t u a l noun p h r a s e ) . T h i s i s used t o form a prompt t o the user when the i n t e n t i o n of a noun phrase i s ambiguous. The p o s i t i o n of the noun phrase w i t h i n the o r i g i n a l query i s s t o r e d i n "POSN". T h i s i s used by the semantic i n t e r p r e t e r i n o r d e r t o a t t a c h m o d i f i e r s t o the c o r r e c t NP (see s e c t i o n 6.1.5). The r e g i s t e r "ANPR" i s used t o s t o r e the f u l l form of a compound p r o p e r noun. T h i s i s r e q u i r e d when u s e r s use s h o r t e n e d or a b b r e v i a t e d form of p r o p e r nouns. SEMREG i s used t o s t o r e the semantic meaning of 49 the NP w i t h r e s p e c t t o the database and w i l l be e x p l a i n e d i n Chapter 6. 4.2.4 The Morpher The morpher's f u n c t i o n i s t o determine the r o o t form of each word i n the query. The advantage of computing the r o o t form of a word i s t h a t f o r words w i t h r e g u l a r e n d i n g s , the d i c t i o n a r y o n l y needs t o c o n t a i n the r o o t forms. To a c c o m p l i s h t h i s the d i c t i o n a r y must c o n t a i n the s y n t a c t i c c a t e g o r y and a " m o r p h o l o g i c a l code" f o r the r o o t of every r e g u l a r word d e f i n e d . The m o r p h o l o g i c a l code i n f o r m s the morpher of the a l l o w a b l e s u f f i x e s a word may have. For example, the code "IES" i n the d e f i n i t i o n below i n f o r m s the system how the p l u r a l and p o s s e s s i v e forms of the word a r e c o n s t r u c t e d . (COMPANY N IES) In o r d e r t o f i n d the r o o t of a word, the morpher i s s u p p l i e d w i t h a t a b l e c o n t a i n i n g common s u f f i x e s w i t h a s s o c i a t e d i n f o r m a t i o n as shown i n f i g u r e 4.3. The morpher w i l l then s y s t e m a t i c a l l y check the i n p u t word f o r each ending i n the r o o t t a b l e . I f the i n p u t word has the same en d i n g , then the morpher w i l l add the g i v e n e n d i n g . For example, f o r the i n p u t word "companies", the endin g " i e s " w i l l be removed and the endin g "y" w i l l be added. At t h i s p o i n t the morpher w i l l check f o r the computed word i n the d i c t i o n a r y . I f t h i s word e x i s t s and the m o r p h o l o g i c a l code of the r o o t word matches the e n d i n g removed 50 from the i n p u t word, the i n p u t word i s t r e a t e d as a v a l i d form of the r o o t word and i s a s s i g n e d the g i v e n s y n t a c t i c c a t e g o r y and f e a t u r e s . Hence the word "companies" would be d e f i n e d as f o l l o w s : (COMPANIES N (COMPANY (PLURAL))) Once the r o o t form of the i n p u t word i s found, the d e f i n i t i o n i s added t o the d i c t i o n a r y f o r the remainder of the s e s s i o n . T h i s saves the r o o t form of a word from b e i n g computed more than once f o r any s e s s i o n . I t i s p o s s i b l e f o r the d i c t i o n a r y t o c o n t a i n words where the m o r p h o l o g i c a l code of the word i s not known. T h i s c o u l d happen i f a word i n the database i s not c o n t a i n e d i n the g l o b a l d i c t i o n a r y . I n t h i s case the system w i l l not f o r c e the m o r p h o l o g i c a l code of the r o o t word t o match t h a t of the i n p u t word. For example, f o r the word " c o l l a t o r " , the morpher would a c c e p t " c o l l a t o r s " or " c o l l a t o r e s " f o r the p l u r a l form. A l t h o u g h i t i s p o s s i b l e f o r the system t o make an i n c o r r e c t a s s u m p t i o n , i t i s h i g h l y u n l i k e l y . I f the morpher f a i l s t o f i n d a r o o t word f o r the g i v e n i n p u t word, the morpher w i l l i n v oke v a r i o u s r o u t i n e s , such as the s p e l l i n g c o r r e c t o r . These r o u t i n e s w i l l be d e s c r i b e d l a t e r . 51 Ending t o Remove Ending t o add S y n t a c t i c C a t e g o r y M o r p h o l o g i c a l Code S y n t a c t i c f e a t u r e s of new word S N S PLURAL ES - N ES PLURAL IES Y N IES PLURAL 'S - N S POSSESSIVE 'S — N IES POSSESSIVE F i g u r e 4.3 - P a r t i a l S u f f i x T a b l e 4.2.5 P a r s i n g A Query The sentence (S) ATN i s r e s p o n s i b l e f o r p a r s i n g the main components of the query. These components may i n c l u d e a u x i l i a r y v e r b s (AUX), the main v e r b ( V ) , noun p h r a s e s ( s u b j e c t (SUBJ), d i r e c t o b j e c t (DOBJ), i n d i r e c t o b j e c t (IOBJ)) and p r e p o s i t i o n a l p h r a s e s (PPOBJS). The S ATN i n v o k e s the noun phrase (NP), p r e p o s i t i o n a l phrase ( P P ) , and n o i s e word (NOISE) ATNs t o p a r s e s p e c i f i c components of the qu e r y . I f the query i s s u c c e s s f u l l y p a r s e d , the deep s t r u c t u r e i s produced from the v a r i o u s r e g i s t e r s , and the semantic p r o c e s s o r i s c a l l e d . I f the query cannot be p a r s e d , an e r r o r message i s p r i n t e d o u t . In t h i s case the u s e r i s e x p e c t e d t o r e p h r a s e the qu e r y . The f o l l o w i n g s e c t i o n s e x p l a i n some of the t e c h n i q u e s and f e a t u r e s used i n the system. These t o p i c s i n c l u d e the h a n d l i n g of c o n j u n c t i o n , compound p r o p e r nouns, p r e p o s i t i o n a l p h r a s e s , r e l a t i v e c l a u s e s , and n o i s e words. For a more d e t a i l e d e x p l a n a t i o n of the b a s i c ATN grammar, the reader i s r e f e r e d t o 52 Winograd (Winograd 1983). 4.2.5.1 Noun Phrases The system uses a s e p a r a t e ATN t o p a r s e noun phrases (NP). I t i s c a p a b l e of p a r s i n g v a r i o u s forms of NP's i n c l u d i n g r e l a t i v e c l a u s e s and some l i m i t e d forms of c o n j u n c t i o n . Some examples a r e shown below: computers the good computer programs what how many s u p p l i e r s a r e d p a r t from Smith the MECC programs which t e a c h c h e m i s t r y the A p p l e and IBM programs 4.2.5.2 C o n j u n c t i o n C o n j u n c t i o n i s a v e r y complex a r e a of n a t u r a l language. The NP network i s c a p a b l e of p a r s i n g some l i m i t e d forms of c o n j u n c t i o n as shown below. What Apple and IBM computers the r e d , g r e e n , or b l u e p a r t s the good or e x c e l l e n t computer programs When d e a l i n g w i t h c o n j u n c t i o n t h e r e i s a p o s s i b i l t y from a m b i g u i t y . The p a r s e r a v o i d s any p o s s i b l e a m b i g u i t y by p a r s i n g the query a t i t s s u r f a c e l e v e l . The r e s o l u t i o n of any a m b i g u i t y i s d e a l t w i t h by the semantic p r o c e s s o r (see s e c t i o n 6.2.3.2.3). 53 4.2.5.3 Compound Pro p e r Nouns One of the main f e a t u r e s of the system i s i t s a b i l i t y t o d e a l w i t h compound prop e r nouns. The system i s c a p a b l e of r e c o g n i z i n g s h o r t e n e d forms of p r o p e r nouns. For example, the s o f t w a r e s u p p l i e r : John W i l s o n and Sons L i m i t e d c o u l d be d e r i v e d from any of the f o l l o w i n g names: J . W i l s o n L t d . W i l s o n and sons John The system i s a l s o c a p a b l e of d e t e r m i n i n g p r o p e r nouns which a r e not c o n t a i n e d i n the database . For example, i n the q u e r y : Does Steve Smith p u b l i s h good programs? the system would p a r s e "Steve Smith" as a proper noun, and the semantic p r o c e s s o r would l a t e r i n f o r m the user t h a t "Steve Smith" i s not a v a l i d s o f t w a r e p u b l i s h e r . The system uses a s e p a r a t e ATN network augmented w i t h some semantic r o u t i n e s t o p a r s e p r o p e r nouns. The system w i l l f i r s t t r y t o p a r s e as many subsequent words as p o s s i b l e which may be p a r t of a p r o p e r noun. These words i n c l u d e p r o p e r nouns, a b b r e v i a t i o n s , i n i t i a l s , and nouns which a r e commonly p a r t s of names ( M i s t e r , company). Words such as " o f " and "and" a r e a l s o t r e a t e d as p r o p e r nouns i f they appear i n any compound p r o p e r nouns. I t w i l l then t r y and match t h i s s t r i n g of words w i t h p r o p e r nouns c o n t a i n e d i n the d a t a b a s e . In the case of an e x a c t 54 match or an unique match where e i t h e r a word, an a b b r e v i a t i o n , or an i n i t i a l c o r r e s p o n d s t o each word i n the compound p r o p e r noun, the system w i l l use the g i v e n NPR w i t h o u t p r o m p t i n g the u s e r . I f a p o s s i b l e match i s f o u n d , the system w i l l prompt the user f o r v e r i f i c a t i o n . For example, t h e ph r a s e : J W L t d c o u l d match e i t h e r : John W i l s o n and Sons L i m i t e d or John W So f t w a r e of Canada L i m i t e d . In t h i s case the user would s e l e c t t h e i n t e n d e d meaning. Once the system d e t e r m i n e s the i n t e n d e d p r o p e r noun, the f u l l v e r s i o n of the name i s s t o r e d i n the ANPR ( A c t u a l Noun-Proper) r e g i s t e r t o a i d i n the semantic i n t e r p r e t a t i o n . I f a s u i t a b l e match i s not found, the system w i l l t r y and match a s m a l l e r p o r t i o n of the i n p u t s t r i n g . For the p h r a s e : John W i l s o n and John Brown the system would f i r s t f a i l on m a t c h i n g the e n t i r e phrase w i t h an a p p r o p r i a t e p r o p e r noun, then i t would t r y the f o l l o w i n g : (1) John W i l s o n and John (2) John W i l s o n At t h i s p o i n t (2) a s u i t a b l e match would be found and the user would be prompted f o r v e r i f i c a t i o n . The system w i l l not t r y and match "John W i l s o n and" s i n c e such an a b b r e v i a t i o n i s c o n s i d e r e d u n a c c e p t a b l e . 55 The system uses two methods t o pa r s e compound proper nouns which a r e not c o n t a i n e d i n the da t a b a s e . The f i r s t method i s t o check whether the word(s) i s s t o r e d i n the G l o b a l D i c t i o n a r y as a p r o p e r noun. I f t h i s i s the c a s e , the system w i l l i n f e r t h a t the g i v e n p r o p e r noun(s) i s an unknown (compound) pro p e r noun and w i l l then c o n t i n u e t o p a r s e the query. In the query: Does Steve Smith p u b l i s h good programs? b o t h "STEVE" and "SMITH" a r e s t o r e d i n the g l o b a l d i c t i o n a r y as pr o p e r nouns. I f a word i s not c o n t a i n e d i n the g l o b a l d i c t i o n a r y , the system p r e s e n t l y prompts the user f o r a d d i t i o n a l i n f o r m a t i o n . I f the user s p e c i f i e s t h a t the word i s an "unknown database element", the system w i l l c o n t i n u e the pa r s e and attempt answer the q u e s t i o n a p p r o p r i a t e l y . For example, l e t ' s assume t h a t the system does not know about the word " V a v r i k " i n the f o l l o w i n g q uery: Does John V a v r i k s e l l good programs? In t h i s case the system w i l l p a r s e "John" as a proper noun and w i l l be informed by the u s e r t h a t " V a v r i k " i s an unknown databa s e element. In t h i s case the system w i l l assume t h a t "John V a v r i k " i s a compound p r o p e r noun and w i l l produce the f o l l o w i n g answer: (JOHN VAVRIK) i s not (THE NAME OF THE PUBLISHING COMPANY OF A PROGRAM) c o n t a i n e d i n t h i s d a tabase! T h i s a l g o r i t h m can be improved t o p a r s e a g r e a t e r 56 p e r c e n t a g e of compound pro p e r nouns by u s i n g a number of h e u r i s t i c s not p r e s e n t l y implemented. One s i m p l e h e u r i s t i c would be t o a t t a c h a semantic marker t o each proper noun i n the G l o b a l d i c t i o n a r y t o s p e c i f y i f i t commonly used as a f i r s t and/or l a s t name. The system c o u l d then i n f e r t h a t an unknown word f o l l o w i n g a known f i r s t name i s most l i k e l y a l a s t name. S i m i l a r l y , i f an unknown word precedes a known l a s t name and p o s s i b l e p roceeds a t i t l e (Mr., Dr., e t c . ) , the system c o u l d i n f e r t h a t i t i s an unknown f i r s t name. S i m i l a r t e c h n i q u e s c o u l d be used t o p a r s e a d d r e s s e s . For i n s t a n c e , an unknown word between a number and a word such as " s t r e e t " i s n o r m a l l y c o n s i d e r e d t o be a s t r e e t name. Another h e u r i s t i c we use i n w r i t t e n E n g l i s h t o i d e n t i f y p r o p e r nouns i s t o check f o r c a p i t a l i z a t i o n . T h i s has not been implemented s i n c e one of the g o a l s was t o e x p l o r e the i d e n t i f i c a t i o n of proper nouns u s i n g semantic and s y n t a c t i c i n f o r m a t i o n . Once a p r o p e r noun i s s u c c e s s f u l l y p a r s e d and matched t o some database element, the ATN w i l l s t o r e t h i s r e s u l t so t h a t the g i v e n s t r i n g of p r o p e r nouns w i l l not have t o be p a r s e d a g a i n i f b a c k t r a c k i n g o c c u r s . By t r e a t i n g words such as " o f " and "and" as p o s s i b l e p r o p e r nouns, the ATN can a l s o a v o i d u n necessary p a r s e s . For example, i n the q u e r y : P r i n t the John W i l s o n and Son's computer programs. the system w i l l not t r y t o p a r s e "and" as a c o n j u c t i o n s i n c e i t i s o b v i o u s l y p a r t of a p r o p e r noun. Both of these t e c h n i q u e s improve the performance of the ATN p a r s e r . 57 T h i s a l g o r i t h m f o r p a r s i n g compound pro p e r nouns has worked w e l l f o r r e c o g n i z i n g program p u b l i s h e r s i n the t e s t d a t a b a s e . A l t h o u g h i t r e q u i r e s the use of an i n v e r t e d d i c t i o n a r y and a g l o b a l d i c t i o n a r y , i t a l l o w s the system t o answer a g r e a t e r p e r c e n t a g e of q u e r i e s i n t e l l i g e n t l y and c o r r e c t l y . 4.2.5.4 P r e p o s i t i o n a l P h rases One of the major s o u r c e s of s y n t a c t i c a m b i g u i t y i s caused by the f a c t t h a t p r e p o s i t i o n a l p h r a s e s (PP) may be p h y s i c a l l y s e p a r a t e d from the head nouns which they m o d i f y . For example, i n the f o l l o w i n g query the phrase "from A p p l e " may e i t h e r modify "computers" or "Logo". What computers run Logo from A p p l e ? The p a r s e r a v o i d s the d e c i s i o n of a t t a c h i n g the PP t o the c o r r e c t head noun by p a r s i n g p r e p o s i t i o n a l p h r a s e s a t t h e i r s u r f a c e l e v e l . S i n c e t h i s d e c i s i o n r e q u i r e s semantic i n f o r m a t i o n , t h i s p r o c e s s i s done by the semantic p r o c e s s o r . The o n l y time the p a r s e r w i l l a t t a c h a PP t o a head noun i s when t h e r e i s no p o s s i b i l i t y f o r a m b i g u i t y . For example, the phrase "from A p p l e " can o n l y modify "computers" i n the f o l l o w i n g q u e r y . What computers from Apple run Logo? 58 4.2.5.5 R e l a t i v e C l a u s e s The system i s c a p a b l e of p a r s i n g r e l a t i v e c l a u s e s which b e g i n w i t h a r e l a t i v e pronoun (eg. which, t h a t ) . These r e l a t i v e c l a u s e s a c t as m o d i f i e r s of the head noun i n the g i v e n noun phrase. Some examples a r e g i v e n below: the programs which t e a c h c h e m i s t r y a computer t h a t runs LOGO any good program which runs on the IBM PC R e l a t i v e c l a u s e s a r e i n i t i a l l y i d e n t i f i e d by w a t c h i n g f o r a r e l a t i v e pronoun i n the NP ATN. I f one i s found, the S ATN i s c a l l e d u s i n g a s p e c i a l e n t r y p o i n t f o r r e l a t i v e c l a u s e s (S/REL). A copy of the head noun i n the noun phrase i s passe d t o the S ATN by the HOLD r e g i s t e r . The head noun w i l l t hen a c t as the s u b j e c t of the embedded s e n t e n c e . The system does not c u r r e n t l y p e r f o r m m o d i f i e r attachment f o r r e l a t i v e c l a u s e s . For example, i t w i l l not be a b l e t o p a r s e th e f o l l o w i n g q u e r y c o r r e c t l y : What programs run on computers which t e a c h French? I t does have the c a p a b i l i t y of a t t a c h i n g p r e p o s i t i o n a l p h r a s e s t o noun ph r a s e s c o n t a i n e d i n the r e l a t i v e c l a u s e . For example, th e system w i l l t r e a t "from T e r r a p i n " as a m o d i f i e r of "LOGO" and not of "computers". P r i n t t h e computers which run LOGO from T e r r a p i n ! 59 4.2.5.6 N o i s e Words Of t e n u s e r s w i l l p l a c e a s t r i n g of words on the f r o n t of a query which s u p p l y l i t t l e semantic meaning. These words, o f t e n c a l l e d n o i s e words, can be i g n o r e d . Below a r e some examples of these p h r a s e s : [Can you i n f o r m me i f ] LOGO i s a good program? [ P l e a s e l e t us know whether] LOGO i s a good program? [ T e l l me] what computers run LOGO? These ph r a s e s a re p a r s e d by a s e p a r a t e ATN which i s i n v o k e d a t the s t a r t of the p a r s e . A diagram of t h i s ATN (NAME) i s c o n t a i n e d i n Appendix A. T h i s t e c h n i q u e has been used s u c c e s s f u l l y i n o t h e r NL systems (Waltz 1978; S t r z a l k o w s k i 1983). 4.2.5.7 R e l a x a t i o n Of Grammatical R u l e s The g o a l of any NL system i s t o answer as many q u e r i e s as p o s s i b l e . S i n c e u s e r s may type i n q u e r i e s which a r e ungr a m m a t i c a l , i t i s i m p o r t a n t f o r the system t o p a r s e them w i t h o u t f o r c i n g s t r i c t g r a m m a t i c a l r u l e s . For example, even though the f o l l o w i n g query c o n t a i n s many g r a m m a t i c a l e r r o r s , i t i s s t i l l p o s s i b l e t o answer i t . What computer run a Apple good programs? 60 4.3 H a n d l i n g Unknown Words There a r e many c a t e g o r i e s of words which may be unknown t o the system. These i n c l u d e m i s p e l l e d words, a b b r e v i a t i o n s , words which the u s e r t h i n k s may be c o n t a i n e d i n the database (but a r e n o t ) , and words which a r e t o t a l l y u n r e l a t e d t o the a p p l i c a t i o n . For a u s e r - f r i e n d l y system, i t i s i m p o r t a n t t o t r y t o answer a query even i f i t c o n t a i n s words not d e f i n e d e d i n the system. In some c a s e s , i t i s p o s s i b l e f o r the system t o p r o c e s s the word w i t h o u t a d d i t i o n a l i n f o r m a t i o n from the u s e r . For o t h e r c a s e s , the user i s r e q u i r e d t o s p e c i f y some a d d i t i o n a l i n f o r m a t i o n about the unknown word. 4.3.1 A b b r e v i a t i o n s And Synonyms Both the A c t i v e Domain D i c t i o n a r y and the G l o b a l D i c t i o n a r y c o n t a i n l i s t s of p r e v i o u s l y d e f i n e d a b b r e v i a t i o n s and synonyms. When such a word i s found i n a query, the system w i l l s u b s t i t u t e the a p p r o p r i a t e word i n t o the query. In the f o l l o w i n g q u e r y , the system a u t o m a t i c a l l y s u b s t i t u t e s "KILOBYTE" f o r "KB" w i t h o u t prompting the u s e r f o r v e r i f i c a t i o n . P r i n t the computers w i t h 128 KB. In o r d e r t o i n f o r m the u s e r of t h i s s u b s t i t u t i o n , the f o l l o w i n g message i s p r i n t e d : (*** A b b r e v i a t i o n / S y n o n y m *** KB --> KILOBYTE) 61 4.3.2 The S p e l l i n g C o r r e c t o r I f a word cannot be found i n the A c t i v e Domain D i c t i o n a r y or the G l o b a l D i c t i o n a r y , the system t r i e s t o determine i f t h e word i s m i s s p e l l e d . T h i s i s a c c o m p l i s h e d by comparing the word t o a l i s t of f r e q u e n t l y m i s s p e l l e d words. I f a word from the l i s t s u b s t a n t i a l l y matches the unknown word, the system assumes the us e r m i s s p e l l e d the word and p r i n t s out an a p p r o p r i a t e message. I f the word i s not s u f f i c i e n t l y s i m i l a r t o any word i n the l i s t , the system w i l l t r y t o compare the word w i t h e v e r y word i n the A c t i v e Domain D i c t i o n a r y . T h i s f e a t u r e may be t u r n e d o f f i f t h i s p r o c e s s t a k e s up t o o much t i m e . The S p e l l i n g C o r r e c t o r uses a s i m p l e a l g o r i t h m f o r d e t e r m i n i n g m i s s p e l l e d words. I t compares known words w h i c h b e g i n w i t h the same l e t t e r t o the unknown word and then c h e c k s f o r p o s s i b l e e r r o r s such as m i s s i n g l e t t e r s , e x t r a l e t t e r s , i n c o r r e c t l e t t e r s , and t r a n s p o s e d l e t t e r s . I f the unknown word can be matched t o a known word g i v e n a r e a s o n a b l e r a t i o between the number of e r r o r s and the number of l e t t e r s i n the word, the word i s t a k e n t o be the i n t e n d e d s p e l l i n g . A l t h o u g h t h i s a l g o r i t h m i s f a i r l y ad-hoc, i t has proven t o be adequate f o r f i n d i n g a l a r g e number of m i s s p e l l e d words i n numerous t e s t s e s s i o n s . 62 4.3.3 The User I n t e r f a c e I f the meaning of the word cannot be found i n t h e A c t i v e Domain D i c t i o n a r y or by the s p e l l i n g c o r r e c t o r , t h e system prompts the user f o r a d d i t i o n a l i n f o r m a t i o n . One of two menus i s p r e s e n t e d t o the user depending on whether the s y n t a c t i c d e f i n i t i o n of the word i s c o n t a i n e d i n the G l o b a l D i c t i o n a r y or no t . I f the word i s not c o n t a i n e d i n the G l o b a l D i c t i o n a r y , the user i s p r e s e n t e d w i t h the f o l l o w i n g menu. Unknown Word - " a c t u a l word" What would you l i k e t o do? 1. S p e l l i n g E r r o r ! E n t e r the c o r r e c t s p e l l i n g . 2. A b b r e v i a t i o n ! E n t e r the f u l l word. 3. Synonym! E n t e r the replacement word. 4. Unknown d a t a element! C o n t i n u e p r o c e s s i n g . 5. Ignore the word! C o n t i n u e p r o c e s s i n g . 6. C a n c e l the query. In t h i s case the user i s e x p e c t e d t o choose one of t h e s t a t e d o p t i o n s . I f the word i s c o n t a i n e d i n the G l o b a l D i c t i o n a r y (but not i n the A c t i v e Domain D i c t i o n a r y ) , the user i s p r e s e n t e d w i t h a menu c o n t a i n i n g o p t i o n s 3, 4, 5, and 6. 4.3.3.1 S p e l l i n g E r r o r s The user i s g i v e n the o p t i o n of e n t e r i n g the c o r r e c t s p e l l i n g when the unknown "word" i s not found i n t h e A c t i v e Domain D i c t i o n a r y or the G l o b a l D i c t i o n a r y . T h i s o p t i o n i s n e c c e s s a r y s i n c e the system's s p e l l i n g c o r r e c t o r i s not c a p a b l e of f i n d i n g g r o s s s p e l l i n g e r r o r s or s p e l l i n g e r r o r s w h i c h o c c u r 63 i n words not c o n t a i n e d i n the A c t i v e Domain D i c t i o n a r y . 4.3.3.2 A b b r e v i a t i o n s The u s e r i s a l s o g i v e n the o p t i o n of s p e c i f y i n g the l e n g t h e n e d form of an a b b r e v i a t e d word. T h i s i s r e q u i r e d s i n c e the d i c t i o n a r i e s may not c o n t a i n a l l commonly used a b b r e v i a t i o n s . T h i s o p t i o n a l s o a l l o w s u s e r s t o d e f i n e t h e i r own a b b r e v i a t i o n s f o r b r e v i t y p u r p o s e s . 4.3.3.3 Synonyms In many s i t u a t i o n s a user may be a b l e t o s u p p l y a s u i t a b l e synonym f o r the unknown word. For example, the f o l l o w i n g query c o n t a i n s the word "machines" which may not be d e f i n e d i n the A c t i v e Domain D i c t i o n a r y . What machines run LOGO? In t h i s case the user c o u l d o b t a i n the d e s i r e d answer by s u b s t i t u t i n g a word l i k e "computer" f o r "machines". T h i s f e a t u r e i s v e r y u s e f u l i f the database schema does not c o n t a i n a comprehensive l i s t of words which d e s c r i b e the database and i t s r e l a t i o n s h i p s . 4.3.3.4 Unknown Database Elements One of the most common s o u r c e s of unknown words a r e words which u s e r s i n c o r r e c t l y assumes t o be c o n t a i n e d i n the database. In t h i s case the system s h o u l d t r y t o p r e d i c t the i n t e n d e d 64 semantic meaning of the word and form an a p p r o p r i a t e response. In the f o l l o w i n g query: Does IBM s e l l food? the word " f o o d " i s not d e f i n e d i n the A c t i v e Domain D i c t i o n a r y . S i n c e the s y n t a c t i c d e f i n i t i o n i s c o n t a i n e d i n the G l o b a l D i c t i o n a r y , the user i s e x p e c t e d t o s e l e c t one of the f o l l o w i n g o p t i o n s . 1. Synonym! E n t e r the replacement word. 2. Unknown data element! C o n t i n u e p r o c e s s i n g . 3. Ignore the word! C o n t i n u e p r o c e s s i n g . 4. C a n c e l the query. I f the user chooses o p t i o n #2, the semantic p r o c e s s o r w i l l a p p r o p r i a t e l y d i s p l a y the f o l l o w i n g prompt: (Food) i s not (a brand/type of a computer) c o n t a i n e d i n the da t a b a s e . For unknown database elements which a r e not c o n t a i n e d i n the G l o b a l D i c t i o n a r y , the system w i l l a l s o ask the user whether the word i s a noun, pro p e r noun, a d j e c t i v e , or an adverb. A l t h o u g h t h i s i n f o r m a t i o n i s not always mandatory f o r the system answer the que r y , i t i s o c c a s i o n a l l y u s e f u l f o r p a r s i n g the query c o r r e c t l y . In the f o l l o w i n g q u e r y , the user w i l l have t o s p e c i f y t h a t "LOTUS" i s b o t h an unknown database element and a p r o p e r noun. Does LOTUS run on the IBM PC? G i v e n t h i s i n f o r m a t i o n , the system w i l l c o r r e c t l y p a r s e the 65 query and p r i n t the f o l l o w i n g response: (LOTUS) i s not (a computer program) c o n t a i n e d i n t h e d a t a b a s e . 4.3.3.5 I g n o r i n g Words I t i s p o s s i b l e t o a d e q u a t e l y answer some q u e r i e s by i g n o r i n g s p e c i f i c unknown words. For i n s t a n c e , i f the word " c o l l o i d " i s i g n o r e d i n the f o l l o w i n g q u e r y , the system can c o r r e c t l y p a r s e the que r y and form a response f o r the remainder of the query. P r i n t the c o l l o i d c h e m i s t r y program! In t h i s case the user w i l l l e a r n t h a t " c o l l o i d " i s not a v a l i d m o d i f i e r f o r the "program" (or " c h e m i s t r y " ) . A d d i t i o n a l l y he/she w i l l be g i v e n a l i s t of a l l the c h e m i s t r y programs. The system a l s o has the a b i l i t y t o i g n o r e words a t the semantic l e v e l . T h i s w i l l be d i s c u s s e d i n c h a p t e r 6. 4.3.3.6 C a n c e l l i n g The Query Whenever an unknown word i s found i n a q u e r y , the user i s g i v e n the o p t i o n of c a n c e l l i n g the query. In many s i t u a t i o n s , the u s e r ' s query may be u n i n t e n t i o n a l l y answered when a word cannot be p a r s e d . An example of t h i s i s shown i n the f o l l o w i n g q u e r y : Does LOGO run on t h e Osborne? 66 I f the system i s unable t o parse "Osborne", the user i s e s s e n t i a l l y i n f o r m e d t h a t "Osborne" i s not a computer (or a n y t h i n g e l s e ) c o n t a i n e d i n the da t a b a s e . 4.3.4 Knowledge A q u i s i t i o n The system i s c a p a b l e of a q u i r i n g knowledge by s t o r i n g p r e v i o u s s p e l l i n g (or t y p i n g ) m i s t a k e s , unknown database e l e m e n t s , and newly d e f i n e d a b b r e v i a t i o n s and synonyms. T h i s i n f o r m a t i o n i s p r e s e n t l y o n l y s t o r e d f o r the d u r a t i o n of the c u r r e n t s e s s i o n . By s a v i n g t h i s i n f o r m a t i o n f o r each u s e r , i t i s p o s s i b l e t o b u i l d up a d i c t i o n a r y of the u s e r ' s i n d i v i d u a l v o c a b u l a r y which may be r e a c t i v a t e d f o r f o l l o w i n g s e s s i o n s . T h i s i n f o r m a t i o n i s a l s o v e r y u s e f u l f o r i m p r o v i n g the r o b u s t n e s s of t h e system. By examining t h i s i n f o r m a t i o n , the DBA can e a s i l y enhance the system as w e l l as f i n d i n g o m i s s i o n s i n the database schema. 4.4 Summary The S y n t a c t i c P r o c e s s o r has been d e s i g n e d t o be domain independent. I t i s c o m p l e t e l y d a t a d r i v e n . No a d d i t i o n a l programming i s r e q u i r e d t o adapt i t t o a new database domain. The system a v o i d s many ty p e s of s y n t a c t i c a m b i g u i t y by u s i n g semantic i n f o r m a t i o n . For example, the S y n t a c t i c P r o c e s s o r i s a b l e t o i n t e l l i g e n t l y i n t e r p r e t ambiguous and a b b r e v i a t e d forms of compound prop e r nouns by h a v i n g the p a r s e r c a l l some semantic r o u t i n e s . A m b i g u i t y r e s u l t i n g from c e r t a i n 67 typ e s of c o n j u n c t i o n and m o d i f i e r attachment i s r e s o l v e d by p a r s i n g t h e s e phrases a t t h e i r s u r f a c e l e v e l and r e s o l v i n g any a m b i g u i t y d u r i n g the semantic i n t e r p r e t a t i o n . The system has been enhanced w i t h v a r i o u s user i n t e r f a c e r o u t i n e s t o a i d i n the i n t e r p r e t a t i o n of a qu e r y . For example, i f an unknown word i s found or i f an a b b r e v i a t e d p r o p e r noun matches more than one database e n t r y , the system w i l l prompt the user f o r the r e q u i r e d i n f o r m a t i o n . I t makes more sense t o have the user r e s o l v e such problems r a t h e r than h a v i n g the system make i n c o r r e c t assumptions or f o r c i n g the user t o r e s t a t e the query i n a d i f f e r e n t form. 68 V. SYSTEM DESIGN I I : KNOWLEDGE BASE The main o b j e c t i v e i n the d e s i g n of t h i s system i s t o a c h i e v e a h i g h degree of p o r t a b i l i t y . T h i s i s a c h i e v e d by c l e a r l y s e p a r a t i n g a l l domain dependent i n f o r m a t i o n from the domain independent i n f o r m a t i o n . The domain dependent i n f o r m a t i o n c o n s i s t s of the Database Schema, s e l e c t e d d e f i n i t i o n s from the Database Schema L i b r a r y , and the i n v e r t e d index f o r the g i v e n d a t a b a s e . The domain independent i n f o r m a t i o n i s c o n t a i n e d i n the G e n e r a l D i c t i o n a r y and the G l o b a l D i c t i o n a r y . These s o u r c e s of i n f o r m a t i o n a r e c o m p i l e d t o g e t h e r t o form the A c t i v e Domain D i c t i o n a r y , which i s the p r i m a r y s o u r c e of s y n t a c t i c and semantic knowledge used by the system. In t h i s c h a p t e r we w i l l examine both the domain independent and domain dependent s o u r c e s of knowledge used by the system. We w i l l a l s o l o o k a t the main b e n e f i t s of u s i n g the r e l a t i o n a l database model as w e l l as the s t r u c t u r e of the t e s t databases used by the system. 5.1 R e l a t i o n a l Database Approach The system i s d e s i g n e d as a n a t u r a l language f r o n t - e n d t o a r e l a t i o n a l database system. There a r e many advantages i n u s i n g the r e l a t i o n a l database model as opposed t o the o t h e r two p o p u l a r database models — the h i e r a c h i c a l model and the network model. The main advantage i s t h a t the r e l a t i o n a l database p r o v i d e s a h i g h degree of d a t a independence. T h i s means t h a t 69 the a p p l i c a t i o n program does not need t o know how the a c t u a l d a t a i s s t o r e d i n secondary (or p r i m a r y ) s t o r a g e i n o r d e r t o a c c e s s i t (Date 1977). Another major advantage i s t h a t i t p r o v i d e s a s i m p l e view of the d a t a . T h i s a l l o w s c a s u a l u s e r s t o e a s i l y v i s u a l i z e the s t r u c t u r e of the d a t a b a s e . Both the network and h i e r a r c h i c a l models can become v e r y complex s i n c e they o f t e n use p o i n t e r s t o model c e r t a i n t y p e s of r e l a t i o n s h i p s . A nother reason f o r c h o o s i n g t h i s model i s t h a t i t has a sound t h e o r e t i c a l f o u n d a t i o n based on m a t h e m a t i c a l s e t t h e o r y (Date 1977). T h i s t h e o r e t i c a l f o u n d a t i o n has proven t o be u s e f u l f o r both the u n d e r l y i n g DBMS and t h i s n a t u r a l language f r o n t - e n d . 5.2 The Te s t Databases The system has been t e s t e d out u s i n g two demon s t a t i o n d a t a b a s e s . One database c o n t a i n s i n f o r m a t i o n about computers and programs, and the o t h e r database c o n t a i n s i n f o r m a t i o n about s u p p l i e r s , p a r t s , and j o b s . Most of the sample q u e r i e s a r e taken from the program d a t a b a s e . T h i s database c o n s i s t s of the. r e l a t i o n s PROGRAM, COMPUTER, DESCRIPTOR, and PROGDESC. A diagram of t h i s d a t a b a s e i s shown i n f i g u r e 5.1. A complete d e s c r i p t i o n of the S u p p l i e r / P a r t / J o b database i s c o n t a i n e d i n c h a p t e r 8. COMPUTER [CNO MAKE MODEL RAM] PROGRAM [PNO PUBLISHER NAME VERSION CNO RATING COST] PROGDESC [PNO DNO] DESCRIPTOR [DNO DSC] F i g u r e 5.1 - D e s i g n of the Program Database 70 5.3 R e l a t i o n a l Database T e r m i n o l o g y In o r d e r t o proceed w i t h t h i s c h a p t e r , i t i s n e c e s s a r y t o und e r s t a n d some c o n c e p t s and t e r m i n o l o g y i n r e l a t i o n a l database systems. To t h i s p o i n t i t has been assumed t h a t t h e reader understands the terms r e l a t i o n and f i e l d . A r e l a t i o n i s b a s i c a l l y a t a b l e of i n f o r m a t i o n . Each r e l a t i o n c o n t a i n s a number of a t t r i b u t e s c a l l e d f i e l d s . For example, the COMPUTER r e l a t i o n c o n t a i n s the f i e l d s CNO, MAKE, MODEL, and RAM. The s e t of a l l p o s s i b l e v a l u e s which a f i e l d may c o n t a i n i s c a l l e d a domain. The domain f o r the f i e l d MAKE i s the s e t of a l l computer makes ( o r m a n u f a c t u r e r s ) . Each r e l a t i o n must a l s o have a f i e l d which u n i q u e l y i d e n t i f i e s an e n t r y i n the d a t a b a s e . T h i s f i e l d i s c a l l e d the p r i m a r y key. The f i e l d CNO i s the pr i m a r y key of t h e COMPUTER r e l a t i o n . Another i m p o r t a n t concept i n the r e l a t i o n a l d a t a b a s e model i s t he " j o i n " . A j o i n r e p r e s e n t s a semantic r e l a t i o n s h i p between two r e l a t i o n s . For example, the PROGRAM and COMPUTER r e l a t i o n s c o n t a i n a r e l a t i o n s h i p w hich d e t e r m i n e s what computer runs each program. In o r d e r f o r a j o i n t o be p o s s i b l e , the two r e l a t i o n s must have a domain i n common. The j o i n between the COMPUTER and PROGRAM r e l a t i o n s i s performed t h r o u g h the computer number (CNO) f i e l d , which i s c o n t a i n e d i n both r e l a t i o n s . I t i s a l s o u s e f u l t o examine the r e s t r i c t i o n s p l a c e d on the r e l a t i o n s h i p s between r e l a t i o n s . In the program d a t a b a s e , a program i s r e s t r i c t e d t o run on one computer, w h i l e a computer may run many programs. T h i s i s c a l l e d a "one-to-many" r e l a t i o n s h i p . The o t h e r main r e l a t i o n s h i p i n t h e program 71 database i s between the programs and the d e s c r i p t o r s . T h i s i s termed as a "many-to-many" r e l a t i o n s h i p s i n c e a program may have many d e s c r i p t o r s , and a d e s c r i p t o r may be a s s o c i a t e d w i t h many programs. In o r d e r t o r e p r e s e n t a many-to-many r e l a t i o n s h i p i n the d a t a b a s e , a secondary r e l a t i o n i s r e q u i r e d . The PROGDESC r e l a t i o n i s used t o r e p r e s e n t t h i s r e l a t i o n s h i p . Note t h a t t h i s r e l a t i o n c o n t a i n s a domain i n common w i t h both the PROGRAM r e l a t i o n (PNO) and the DESCRIPTOR r e l a t i o n (DNO). T h i s i n f o r m a t i o n i s u s e f u l f o r r e s o l v i n g a m b i g u i t i e s a r i s i n g from c o n j u n c t i o n and w i l l be d e a l t w i t h i n Chapter 6. 5.4 S p e c i f i c a t i o n Of Domain Dependent I n f o r m a t i o n The domain dependent i n f o r m a t i o n c o n s i s t s of the database schema, the database schema l i b r a r y , and the i n v e r t e d i n d e x . Each of thes e w i l l be d i s c u s s e d i n d e t a i l i n the f o l l o w i n g s e c t i o n s . 5.4.1 The Database Schema The database schema c o n t a i n s a s t r u c t u r e d d e f i n i t i o n of the da t a b a s e . I t c o n s i s t s of t h r e e components — the database d e f i n i t i o n , a l i s t of r e l a t i o n v e r b s which r e p r e s e n t r e l a t i o n s h i p s i n the dat a b a s e , and i n f o r m a t i o n on how d i f f e r e n t r e l a t i o n s can be con n e c t e d , or j o i n e d t o g e t h e r . 72 5.4.1.1 The Database D e f i n i t i o n The database d e f i n i t i o n i s d e s i g n e d t o model t h e s t r u c t u r e of the dat a b a s e . I t d e f i n e s c o n s i d e r a b l e l e x i c a l and semantic i n f o r m a t i o n about the domain. The database d e f i n i t i o n i s s y n t a c t i c a l l y s i m i l a r t o the s p e c i f i c a t i o n o f the a c t u a l d a t a b a s e . Both c o n s i s t of a number of a t t r i b u t e s w h i c h c o n t a i n s v a r i o u s t y p e s of i n f o r m a t i o n about each r e l a t i o n and f i e l d i n the d a t a b a s e . T h i s format makes the job of i m p l e m e n t i n g and m a i n t a i n i n g a database easy and c o n c i s e . The database d e f i n i t i o n c o n s i s t s of v a r i o u s a t t r i b u t e s w hich d e s c r i b e each r e l a t i o n and f i e l d . I n f o r m a t i o n d e s c r i b i n g the database i t s e l f i s c o n t a i n e d i n the FIELDS, IDENTIFIERS, KEY, and COMMENT a t t r i b u t e s . I n f o r m a t i o n about t h e c o n t e n t s of the database i s s t o r e d i n the CONTENTS and MEM-OF a t t r i b u t e s . L i n g u i s t i c i n f o r m a t i o n i s c o n t a i n e d i n the L-NOUN, L-DESC, L-PREP, and L-ABR a t t r i b u t e s . Semantic and s y n t a c t i c i n f o r m a t i o n i s d e f i n e d i n the CATEG, TYPE, and LTYPE a t t r i b u t e s . Each of th e s e a t t r i b u t e s w i l l be d e s c r i b e d i n the f o l l o w i n g s e c t i o n s . The database schema f o r the program database i s i n Appendix C. 5.4.1.1.1 R e l a t i o n S p e c i f i c a t i o n A d e f i n i t i o n f o r each r e l a t i o n i n the d a t a b a s e must be i n c l u d e d i n the database d e f i n i t i o n . A l t h o u g h most of the f o l l o w i n g a t t r i b u t e s a r e o p t i o n a l , the FIELDS and COMMENT a t t r i b u t e s must be s t a t e d f o r e v e r y r e l a t i o n . FIELDS c o n t a i n s a l i s t of the f i e l d s c o n t a i n e d i n the g i v e n 73 r e l a t i o n . T h i s i s used t o i n f o r m the system which f i e l d s a r e i n a r e l a t i o n . The KEY a t t r i b u t e s p e c i f i e s the p r i m a r y key of the r e l a t i o n . The key i s needed i n the g e n e r a t i o n of s p e c i f i c t y p e s of database q u e r i e s . The IDENTIFIERS a t t r i b u t e c o n t a i n s a l i s t of database f i e l d s which a r e commonly used t o d i s t i n g u i s h one e n t r y from a n o t h e r . T h i s i s h e l p f u l i n a n s w e r i n g q u e s t i o n s which do not e x p l i c i t l y s p e c i f y the f i e l d s t o be p r i n t e d . The IDENTIFIERS f o r the program r e l a t i o n a r e : (IDENTIFIERS (PNO PUBLISHER NAME)) These t h r e e f i e l d s w i l l be used i n the response f o r q u e r i e s l i k e : What programs run on the IBM PC? The L-NOUN a t t r i b u t e l i s t s nouns which a r e used t o i d e n t i f y t h e r e l a t i o n . The f o l l o w i n g d e f i n i t i o n t e l l s t he system t h a t any of the f o l l o w i n g nouns may be used t o i d e n t i f y the PROGRAM r e l a t i o n . (L-NOUN (PROGRAM PACKAGE COURSEWARE SOFTWARE)) L-DESC s p e c i f i e s nouns and a d j e c t i v e s which can be used as c l a s s i f i e r s ( o r d e s c r i b e r s ) f o r the nouns i n L-NOUN. For example, the f o l l o w i n g d e f i n i t i o n s p e c i f i e s t h a t e i t h e r "computer" o r " e d u c a t i o n a l " may pr o c e e d "program" i n a query. (L-DESC (COMPUTER EDUCATIONAL)) 74 These words s u p p l y no semantic meaning except when the head noun (program) has more than one p o s s i b l e meaning. In t h i s case the a p p r o p r i a t e meaning can be dete r m i n e d by the g i v e n d e s c r i b e r . The L-PREP a t t r i b u t e l i s t s the p r e p o s i t i o n s which may be used i n a p r e p o s i t i o n a l phrase where the noun phrase r e p r e s e n t s the g i v e n r e l a t i o n . For example, the f o l l o w i n g p hrases show how the p r e p o s i t i o n s "on" and " i n " a r e n o r m a l l y used i n p r e p o s i t i o n a l p h r a s e s c o n t a i n i n g d e s c r i p t o r s . programs i n Fr e n c h s o f t w a r e on c h e m i s t r y T h i s knowledge i s u s e f u l f o r h e l p i n g t o dete r m i n e the meaning of unknown words which f o l l o w p r e p o s i t i o n s . The L-ABR a t t r i b u t e c o n t a i n s a b b r e v i a t i o n s which a r e a s s o c i a t e d w i t h the r e l a t i o n . For example, i f one wishes t o d e f i n e "pgm" as an a b b r e v i a t i o n f o r "program", the f o l l o w i n g statement can be i n c l u d e d : (L-ABR (PGM PROGRAM)) The COMMENT a t t r i b u t e c o n t a i n s a phrase which d e s c r i b e s what can be c o n t a i n e d i n the r e l a t i o n . T h i s i s termed the " i n t e n t i o n " of the r e l a t i o n i n database t e r m i n o l o g y . T h i s comment i s used f o r many pu r p o s e s , i n c l u d i n g f o r m i n g r e s p o n s e s , p r o m p t i n g i n ambiguous q u e r i e s , and even documenting the database schema i t s e l f . For example, the COMMENT f o r the program r e l a t i o n i s : 75 (COMMENT (a computer program)) I f the system needed t o form a response t o i n f o r m the u s e r t h a t "LOTUS" i s not a computer program, i t would use t h i s phrase as f o l l o w s : (LOTUS) i s not (a computer program) c o n t a i n e d i n the d a t a b a s e . The CATEG a t t r i b u t e s p e c i f i e s the "semantic c a t e g o r y " of the e n t i t y r e p r e s e n t e d by the r e l a t i o n . A l i s t of the semantic c a t e g o r i e s i s shown i n f i g u r e 5.2. CATEGORY DESCRIPTION EXAMPLES HUMAN GROUP PHYSICAL OBJECT ABSTRACT OBJECT EVENT any human a group of pe o p l e any c o n c r e t e o b j e c t any a b s t a c t o b j e c t something t h a t t a k e s p l a c e s t u d e n t , employee o r g a n i z a t i o n , company computers, p a r t s d e s c r i p t o r s , i d e a s s a l e , meeting F i g u r e 5.2 - Semantic CATEGORIES These semantic c a t e g o r i e s a r e d e r i v e d from McLeod's Semantic Database Model (McLeod 1980). He uses them t o ' model the semantic s t r u c t u r e of the da t a w i t h i n the da t a b a s e . A n a t u r a l e x t e n s i o n i s t o a p p l y them t o the sem a n t i c s of n a t u r a l language. For example, t h e q u e s t i o n noun "who" or the noun "anyone" can r e p r e s e n t e i t h e r a human or a group. CATEG i s a l s o used t o d e f i n e r e l a t i o n s h i p s which a r e d e f i n e d between the r e l a t i o n s and f i e l d s of the database. T h i s w i l l be f u l l y e x p l a i n e d i n the Database Schema L i b r a r y s e c t i o n . 76 5.4.1.1.2 F i e l d S p e c i f i c a t i o n The d a t a b a s e d e f i n i t i o n a l s o c o n t a i n s a l i s t of a t t r i b u t e s f o r e v e r y f i e l d i n each r e l a t i o n . Most of the a t t r i b u t e s used t o d e f i n e a f i e l d a r e s i m i l a r t o t h o s e which d e f i n e a r e l a t i o n . In a d d i t i o n t o t h e s e a t t r i b u t e s , t h e TYPE, L-TYPE, CONTENTS, and MEM-OF a t t r i b u t e s a r e used i n the d e f i n i t i o n of f i e l d s . Each f i e l d w i l l n o r m a l l y have an L-NOUN a t t r i b u t e t o d e f i n e the nouns w h i c h can be used t o i d e n t i f y the f i e l d i n a query. The L-DESC, L-PREP and L-ABR a t t r i b u t e s can a l s o be used f o r d e f i n i n g f i e l d s . The CONTENTS a t t r i b u t e d e s c r i b e s the a c t u a l i n f o r m a t i o n c o n t a i n e d i n t h e d a t a b a s e . I t a l s o s p e c i f i e s whether the f i e l d needs t o be i n v e r t e d or not. I f t h e f i e l d c o n t a i n s i n t e g e r s or r e a l numbers, CONTENTS i s set t o e i t h e r INTEGER, REAL, or MONEY. These f i e l d s a r e not i n v e r t e d s i n c e t h e i r domain i s known. I f the f i e l d c o n t a i n s c h a r a c t e r s , the f i e l d w i l l n o r m a l l y be i n v e r t e d . F o r example, the f o l l o w i n g statement w i l l cause the g i v e n f i e l d t o be i n v e r t e d . (CONTENTS (INVERT NPR)) The "NPR" s t a t e s t h a t the v a l u e s of the f i e l d w i l l be t r e a t e d as p r o p e r nouns. T h i s w i l l be d i s c u s s e d f u r t h e r i n S e c t i o n ***. Some f i e l d s c o n t a i n a key t o another r e l a t i o n . For example, the PROGRAM r e l a t i o n c o n t a i n s the f i e l d CNO which s p e c i f i e s w h i c h computer the g i v e n program runs on. In t h i s c a s e the CONTENTS a t t r i b u t e w i l l be s e t t o the f o l l o w i n g : 77 (CONTENTS (KEY PROGRAM)) In some c a s e s the domain of a f i e l d may be r e s t r i c t e d t o a s p e c i f i c s e t of v a l u e s . For example, the f i e l d PART COLOUR may be r e s t r i c t e d t o c o n t a i n e i t h e r r e d , b l u e , g r e e n , o r y e l l o w . When t h e domain of the f i e l d i s known, the s e t of p o s s i b l e v a l u e s may be d e f i n e d i n the database schema. T h i s i s a c h i e v e d by u s i n g the MEM-OF (member o f ) a t t r i b u t e as shown below: MEM-OF ((RED) (BLUE) (GREEN) (YELLOW)) The MEM-OF a t t r i b u t e r e l i e v e s the need f o r . i n v e r t i n g t he g i v e n f i e l d . I t a l s o a l l o w s the system t o be a b l e t o i d e n t i f y each p o s s i b l e v a l u e even i f i t i s not c o n t a i n e d i n t h e a c t u a l d a t a b a s e . Coding of database elements i s o f t e n used i n d a t a b a s e s t o save f i l e space and make the j o b of e n t e r i n g d a t a e a s i e r . For example, t h e PROGRAM RATING f i e l d i s coded by a number from 1 t o 5, where 1 r e p r e s e n t s "TERRIBLE" and 5 r e p r e s e n t s "EXCELLENT". The CODE-AS a t t r i b u t e can be used i n c o n j u n c t i o n w i t h the MEM-OF a t t r i b u t e t o d e f i n e coded i n f o r m a t i o n . A p a r t i a l d e f i n i t i o n of the PROGRAM RATING f i e l d i s shown below: MEM-OF ((EXCELLENT CODE-AS (* = 5)) (GOOD CODE-AS (* > 3)) Note t h a t i n t h i s d e f i n i t i o n "EXCELLENT" i s coded as 5 w h i l e "GOOD" i s coded as a number g r e a t e r than 3 (eg. 4 or 5 ) . One of the most i m p o r t a n t a t t r i b u t e s i n the d e f i n i t i o n of a 78 f i e l d i s the TYPE a t t r i b u t e . T h i s a t t r i b u t e i s used t o c a l l up p r e v i o u s l y d e f i n e d d e f i n i t i o n s c o n t a i n e d i n the database schema l i b r a r y . For example, many databases may have a f i e l d which c o n t a i n s money. By s i m p l y s t a t i n g a f i e l d t o be of "TYPE MONEY", the a s s o c i a t e d semantic d e f i n i t i o n s f o r words such as "how much", "money", and "amount" w i l l a u t o m a t i c a l l y be d e f i n e d . The TYPE a t t r i b u t e i s not as r i g i d l y d e f i n e d as the CATEG a t t r i b u t e . A l t h o u g h i t i s an o p t i o n a l a t t r i b u t e , most f i e l d s w i l l n o r m a l l y be g i v e n a p a t i c u l a r t y p e . F i g u r e 5.3 c o n t a i n s a l i s t of the p r e s e n t l y d e f i n e d TYPES. These TYPES have been d e v e l o p e d f o r use i n the two t e s t d a t a b a s e s and i s not meant t o be c o m p l e t e . A commercial system would o b v i o u s l y c o n t a i n many more d e f i n i t i o n s . 79 TYPE DESCRIPTION EXAMPLES INTEGER an i n t e g e r SIN, age REAL a r e a l number b a t t i n g average MONEY an amount of money s a l a r y , c o s t GNAME a g e n e r a l name program name, p a r t name CNAME a company name p u b l i s h e r , m a n u f a c t u r e r Q-MASS a mass q u a n t i t y weight of a p a r t Q-QTY a c o u n t a b l e q u a n t i t y number of p a r t s LOCATION a l o c a t i o n c i t y , o f f i c e l o c a t i o n RATING any c o n c r e t e o b j e c t r a t i n g of a program COLOUR a common c o l o u r c o l o u r of a p a r t F i g u r e 5.3 - TYPES The L-TYPE, or " L i n g u i s t i c Type" a t t r i b u t e i s a l s o used t o e x t r a c t l i n g u i s t i c d e f i n i t i o n s of words from the database schema l i b r a r y . T h i s a t t r i b u t e i s i n t e n d e d t o g i v e more s p e c i f i c d e f i n i t i o n t o f i e l d s than the TYPE a t t r i b u t e . For example, a f i e l d c o n t a i n i n g money (TYPE MONEY) may be set t o s a l a r y (LTYPE SALARY), c o s t (LTYPE COST), or some o t h e r d e f i n e d L-TYPE r e p r e s e n t i n g money. T h i s a t t r i b u t e i s o p t i o n a l and i s u s u a l l y used i n c o n j u n c t i o n w i t h the TYPE a t t r i b u t e . A p a r t i a l l i s t of L- TYPES i s shown i n f i g u r e 5.4. 80 L-TYPE DESCRIPTION EXAMPLE RNUMBER CITY . COST SALARY a key or index a c i t y , town, e t c . the c o s t of something an employee's s a l a r y program number s u p p l i e r ' s c i t y program c o s t s a l a r y F i g u r e 5.4 - L-TYPES Some f i e l d s may c o n t a i n a CATEG a t t r i b u t e . T h i s o c c u r s when a f i e l d r e p r e s e n t s a d i f f e r e n t e n t i t y than the one r e p r e s e n t e d by the r e l a t i o n . For example, the PUBLISHER f i e l d i n the PROGRAM r e l a t i o n w i l l be d e f i n e d as a GROUP, even though the PROGRAM r e l a t i o n i s d e f i n e d as a PHYSICAL OBJECT. 5.4.1.2 Verb S p e c i f i c a t i o n The d a t a b a s e schema a l s o c o n t a i n s the d e f i n i t i o n of domain dependent v e r b s . These v e r b s u s u a l l y r e p r e s e n t semantic r e l a t i o n s h i p s between the r e l a t i o n s and f i e l d s i n the da t a b a s e . For example, the ve r b " r u n " r e p r e s e n t s a r e l a t i o n s h i p between a computer and a program i n the program d a t a b a s e , and the v e r b " p u b l i s h " r e p r e s e n t s a r e l a t i o n s h i p between a p u b l i s h e r and a program. Each v e r b d e f i n i t i o n c o n s i s t s of a "verb frame" (VFRAME) and p o s s i b l y some i n f o r m a t i o n on how the r e l a t i o n s a r e j o i n e d t o g e t h e r (JOININFO and JOINTYPE). A v e r b frame c o n s i s t s of the v e r b and a number of s l o t s . Each s l o t r e p r e s e n t s a s y n t a c t i c component of t h e query, such as the s u b j e c t (SUBJ) or the d i r e c t o b j e c t (DOBJ). The s l o t s a r e a s s o c i a t e d w i t h a p a r t i c u l a r 81 r e l a t i o n or f i e l d i n the d a t a b a s e . The f o l l o w i n g i s a d e f i n i t i o n f o r the v e r b " r u n " . (RUN VFRAME (((S U B J (RELATION COMPUTER)) (DOBJ (RELATION PROGRAM)))) JOININFO ("COMPUTER.CNO = PROGRAM.CNO") JOINTYPE ((COMPUTER 1) (PROGRAM MANY)) When the s o l u t i o n of a query r e q u i r e s t h a t more than one r e l a t i o n needs t o be a c c e s s e d , i t i s n e c e s s a r y t o j o i n the g i v e n r e l a t i o n s . T h i s i s a c h i e v e d by s p e c i f y i n g c o n s t r a i n t s on f i e l d s common t o each r e l a t i o n . S i n c e i t i s p o s s i b l e t h a t two r e l a t i o n s may be j o i n e d t o g e t h e r i n more than one way, i t i s n e c e s s a r y t o s p e c i f y the j o i n which i s used by a s p e c i f i c v e r b . T h i s i n f o r m a t i o n i s s p e c i f i e d by the JOININFO a t t r i b u t e . The p r e v i o u s example shows how the PROGRAM and COMPUTER r e l a t i o n s a r e j o i n e d when the v e r b " r u n " i s used. I t i s u s e f u l i n the semantic i n t e r p r e t a t i o n of a query t o know about any r e s t r i c t i o n s on how two r e l a t i o n s a r e r e l a t e d . For i n s t a n c e , i n the program d a t a b a s e , each program i s r e s t r i c t e d t o run on one computer, w h i l e one computer may run many programs. In database t e r m i n o l o g y t h i s i s known as a one-to-many r e l a t i o n s h i p . T h i s i n f o r m a t i o n i s s t o r e d i n the JOINTYPE a t t r i b u t e . 82 5.4.1.3 D e f a u l t J o i n s In some s i t u a t i o n s two r e l a t i o n s w i l l need t o be j o i n e d t o g e t h e r where the i n f o r m a t i o n on how t o p e r f o r m the j o i n i s not e x p l i c i t l y s t a t e d . T h i s i s common i n noun phrases when a m o d i f i e r and the head noun r e f e r t o d i f f e r e n t r e l a t i o n s . T h i s i s shown i n the f o l l o w i n g query, where the m o d i f i e r " b u s i n e s s " r e p r e s e n t s the DESCRIPTOR r e l a t i o n and "programs" r e p r e s e n t s the PROGRAM r e l a t i o n . P r i n t the b u s i n e s s programs. In t h i s case the system uses the d e f a u l t j o i n , which s p e c i f i e s the most l i k e l y way two r e l a t i o n s a r e j o i n e d t o g e t h e r . These j o i n s a r e s t a t e d by the DBA f o r e v e r y l i k e l y c o m b i n a t i o n of r e l a t i o n s i n the d a t a b a s e . I f a d e f a u l t j o i n i s not g i v e n , i t i s assumed t h a t one r e l a t i o n cannot modify the o t h e r . In t h i s case the query w i l l not be answered. Each d e f a u l t j o i n a l s o has a "semantic a s s o c i a t i o n " f a c t o r a s s o c i a t e d w i t h i t . The semantic c l o s e n e s s of two r e l a t i o n s i s a f a c t o r which r e p r e s e n t s how c l o s e l y a s s o c i a t e d the e n t i t i e s used by the r e l a t i o n s a r e t o each o t h e r . For example, the r e l a t i o n s COMPUTER and PROGRAM a r e more c l o s e l y r e l a t e d t o each o t h e r than the COMPUTER and DESCRIPTOR r e l a t i o n s . T h i s f a c t o r i s used i n a t t a c h i n g p r e p o s i t i o n a l p h r a s e s and r e s o l v i n g semantic a m b i g u i t i e s . Both of t h e s e t o p i c s a r e d i s c u s s e d i n Chapter 6. There has been c o n s i d e r a b l e work done i n d e v e l o p i n g a l g o r i t h m s which a u t o m a t i c a l l y determine a complex j o i n between 83 two r e l a t i o n s ( C a r l s o n and Kaplan 1976; S a g a l o w i c z 1977). T h i s would a l l o w the DBA t o s p e c i f y o n l y the d i r e c t j o i n s between two r e l a t i o n s i n s t e a d of s p e c i f y i n g a l l l i k e l y j o i n s w i t h t h e i r a s s o c i a t e d semantic c l o s e n e s s f a c t o r s . The problem w i t h t h e s e a l g o r i t h m s i s t h a t they r e l y h e a v i l y on how w e l l the p a r t i c u l a r database s t r u c t u r e models the a c t u a l d a t a . They a l s o assume t h a t the semantic c l o s e n e s s of two r e l a t i o n s can be d e t e r m i n e d by the number of l i n k s performed t o j o i n the r e l a t i o n s t o g e t h e r . A l t h o u g h these a l g o r i t h m s work f o r some c a s e s , they can a l s o f a i l . For example, the p h r a s e : Dr. Lee's s t u d e n t s may be i n t e r p e t e d a s : (1) the s t u d e n t s i n the same department as Dr. Lee, or (2) the s t u d e n t s e n r o l l e d i n c o u r s e s t a u g h t by Dr. Lee. I f the g i v e n database r e q u i r e d t he system t o j o i n the r e l a t i o n s STUDENT and PROFESSOR t h r o u g h the DEPARTMENT r e l a t i o n f o r i n t e r p r e t a t i o n ( 1 ) , and t h r o u g h the STUDENT-COURSE and COURSE r e l a t i o n s f o r i n t e r p r e t a t i o n ( 2 ) , t h e system would i n f e r t h a t (1) i s the i n t e n d e d i n t e r p r e t a t i o n . However, the most a p p r o p r i a t e i n t e r p r e t a t i o n i s ( 2 ) . T h i s example shows the need f o r the DBA t o s u p p l y r i c h i n f o r m a t i o n on how v a r i o u s r e l a t i o n s can be j o i n e d t o g e t h e r . 84 5.4.2 Database Schema L i b r a r y The Database Schema L i b r a r y c o n t a i n s d e f i n i t i o n s of f i e l d s and r e l a t i o n s h i p s which a r e commonly used i n many d i f f e r e n t d a t a b a s e s . T h i s i n f o r m a t i o n i s u s e f u l s i n c e many databases c o n t a i n s i m i l a r t y p e s of d a t a . F or i n s t a n c e , a e n t i t y such as a p e r s o n may be d e f i n e d i n a d a t a b a s e c o n t a i n i n g employees, s t u d e n t s , i n s t r u c t o r s , c u s t o m e r s , c r i m i n a l s , c o n s u l t a n t s , or a t h l e t e s . Most e n t i t i e s a l s o have many common p r o p e r t i e s or a t t r i b u t e s a s s o c i a t e d w i t h them. For example, p h y s i c a l o b j e c t s ( p a r t s , computers, programs, e t c . ) may have p r o p e r t i e s such as w e i g h t , c o l o u r , and c o s t a s s o c i a t e d w i t h them. S i n c e i t i s l i k e l y t h a t such d e f i n i t i o n s w i l l be used i n many d i f f e r e n t d a t abases by many u s e r s , i t i s w o r t h w h i l e t o i n c l u d e a r i c h l i b r a r y of s y n t a c t i c and semantic d e f i n i t i o n s . T h i s i s u s e f u l f o r many r e a s o n s . F i r s t , i t makes the t a s k of implementing s t a n d a r d d a t a b a s e s such as i n v e n t o r i e s and p e r s o n n e l r e c o r d s q u i c k and easy. I t a l s o f o r c e s s t a n d a r d and more complete d e f i n i t i o n s f o r d i f f e r e n t d a t a b a s e s . Hence the d e f i n i t i o n of an employee would be the same no matter who implemented the database. The d e f i n i t i o n s i n t h e Database Schema L i b r a r y a r e r e t r i e v e d by the CATEG, TYPE, and L-TYPE a t t r i b u t e s i n the Database Schema. For example, t h e PROGRAM r e l a t i o n c o n t a i n s the f i e l d COST. By s p e c i f y i n g t h a t the L-TYPE of t h i s f i e l d i s c o s t , the nouns " c o s t " , " p r i c e " , and " v a l u e " a r e a u t o m a t i c a l l y d e f i n e d as i d e n t i f i e r s (L-NOUN) f o r t h i s f i e l d . The v e r b " c o s t " i s a l s o d e f i n e d as a r e l a t i o n v e r b where the s u b j e c t r e f e r s t o 85 PROGRAM and the d i r e c t o b j e c t i s the COST. A sample p o r t i o n of the Database Schema L i b r a r y i s c o n t a i n e d i n Appendix D. 5.4.3 I n v e r t e d Index The I n v e r t e d Index i s a t a b l e of most the words c o n t a i n e d i n the d a t a b a s e . T h i s index e n a b l e s the system t o determine which words belongs t o each f i e l d i n the d a t a b a s e . The system does not have t o i n v e r t e v e r y f i e l d i n o r d e r t o determine the v a l u e s of each f i e l d . For example, f i e l d s c o n t a i n i n g numbers or words from a d e f i n e d s e t (MEM-OF) don't need t o be i n v e r t e d s i n c e the p o s s i b l e v a l u e s of the f i e l d can be p r e d e t e r m i n e d . The INVERT o p t i o n of the CONTENTS a t t r i b u t e i n the database schema i s used t o i d e n t i f y database f i e l d s which a r e t o be i n v e r t e d . For example, the CONTENTS a t t r i b u t e f o r the PROGRAM PUBLISHER f i e l d i s : (CONTENTS (INVERT NPR)) In t h i s c a s e each v a l u e i n the PROGRAM PUBLISHER f i e l d w i l l be t r e a t e d as a proper noun. For example, the f o l l o w i n g i n f o r m a t i o n w i l l be s t o r e d i n the A c t i v e Domain D i c t i o n a r y : (TERRAPIN NPR ? ELM-OF ((PROGRAM PUBLISHER))) S i n c e the word "TERRAPIN" i s not c o n t a i n e d i n the G l o b a l D i c t i o n a r y , the m o r p h o l o g i c a l code i s not known. In t h i s case the code i s s e t t o a q u e s t i o n mark ( ? ) . In a d d i t i o n t o t h i s i n f o r m a t i o n , a l i s t of the f i e l d names i s s t o r e d under the 86 a t t r i b u t e "ELM-OF" (element o f ) . One of the main arguments a g a i n s t the use of an i n v e r t e d index i s the problems of keeping i t up- t o - d a t e as new da t a i s added i n t o the database. One p o s s i b l e s o l u t i o n i s t o r e c r e a t e the i n v e r t e d index p e r i o d i c a l l y . T h i s can cause some problems s i n c e the system w i l l not be a b l e t o i d e n t i f y some new e n t r i e s u n t i l a new i n v e r t e d index i s g e n e r a t e d . The o t h e r a l t e r n a t i v e i s t o update t h e i n v e r t e d index as the new d a t a i s added. A l t h o u g h t h e r e i s some a d d i t i o n a l overhead w i t h t h i s method, i t w i l l ensure t h a t the i n v e r t e d index i s always u p - t o - d a t e . 5.5 S p e c i f i c a t i o n Of Domain Independent I n f o r m a t i o n A p a r t from the program components and the grammar, t h e r e a r e two s o u r c e s of domain independent i n f o r m a t i o n . They are the G e n e r a l D i c t i o n a r y and the G l o b a l d i c t i o n a r y . These w i l l be d i s c u s s e d i n t h e f o l l o w i n g s e c t i o n s . 5.5.1 G e n e r a l D i c t i o n a r y The G e n e r a l D i c t i o n a r y c o n t a i n s d e f i n i t i o n s of words commonly used i n q u e r i e s r e g a r d l e s s of the p a r t i c u l a r database domain. These words come from e v e r y s y n t a c t i c c a t e g o r y i n c l u d i n g nouns ( i n f o r m a t i o n , d e s c r i p t i o n ) , d e t e r m i n e r s ( t h e , a ) , q u e s t i o n pronouns (who, what), p e r s o n a l pronouns (me, y o u ) , r e l a t i v e pronouns ( t h a t , w h i c h ) , p r e p o s i t i o n s ( a t , o n ) , c o n j u n c t i o n s ( o r , a n d ) , v e r b s (show, i s ) , and ad v e r b s ( p l e a s e , i m m e d i a t e l y ) . The d i c t i o n a r y c o n t a i n s a s y n t a c t i c d e s c r i p t i o n 87 of t h e s e words. I t a l s o c o n t a i n s s e mantic i n f o r m a t i o n f o r some nouns, q u e s t i o n pronouns and v e r b s . For example, the semantic i n f o r m a t i o n a s s o c i a t e d w i t h the q u e s t i o n pronoun "who" i s : SEM ((CATEG HUMAN) (CATEG GROUP)) T h i s i n f o r m a t i o n i n f o r m s the system t h a t "who" can r e f e r t o any human or group of humans (company, e t c ) . The d e f i n i t i o n of the noun " i n f o r m a t i o n " i s : SEM ((GENERAL *ANY*)) T h i s t e l l s the system t h a t " i n f o r m a t i o n " can r e f e r t o any f i e l d ( s ) or r e l a t i o n ( s ) c o n t a i n e d i n t h e d a t abase. The semantic d e f i n i t i o n of v e r b s i s more complex. Domain independent v e r b s f a l l i n t o t h r e e c a t e g o r i e s : command v e r b s (show, c o u n t ) , r e l a t i o n v e r b s (be, h a v e ) , a n d . a u x i l i a r y v e r b s (do, i s ) . The G e n e r a l D i c t i o n a r y c o n t a i n s semantic i n f o r m a t i o n f o r b o t h command v e r b s and r e l a t i o n v e r b s . The meaning of a u x i l i a r y v e r b s i s d e f i n e d w i t h i n t h e semantic p r o c e s s o r and w i l l be e x p l a i n e d i n the next c h a p t e r . Two t y p e s of command ve r b s a r e p r e s e n t l y d e a l t w i t h by the system. These are v e r b s which r e q u e s t i n f o r m a t i o n i n a r e p o r t f ormat (show, p r i n t ) , and v e r b s w h i c h r e q u e s t the system t o p e r f o r m a f u n c t i o n ( c o u n t ) . These v e r b s are i d e n t i f i e d by the s e m a n t i c markers COM-VERB and COUNT-VERB r e s p e c t i v e l y . These v e r b s a l s o r e q u i r e v e r b frames. The v e r b frame f o r the v e r b "show" i s shown below. 88 (SHOW VFRAME ((SUBJ OPT (GENERAL *YOU*)) (DOBJ (GENERAL *ANY*)) (IOBJ OPT (GENERAL *ME*)))) T h i s d e f i n i t i o n a l l o w s the system t o answer q u e r i e s l i k e : You show me the computers. Show the good programs t o me. A complete d e s c r i p t i o n of v e r b frames and t h e i r f u n c t i o n i s g i v e n i n c h a p t e r 6. A p o r t i o n of the G e n e r a l D i c t i o n a r y i s l i s t e d i n Appendix E. 5.5.2 G l o b a l D i c t i o n a r y The G l o b a l D i c t i o n a r y i s a l a r g e d i c t i o n a r y which c o n t a i n s the s y n t a c t i c c a t e g o r y and o t h e r s y n t a c t i c f e a t u r e s of a l a r g e number of words. T h i s d i c t i o n a r y s u p p l i e s the s y n t a c t i c d e f i n i t i o n s f o r words c o n t a i n e d i n the Database Schema, the Database Schema L i b r a r y , the i n v e r t e d i n d e x , and the p a r s e r ( f o r unknown wo r d s ) . The main b e n e f i t of i n c l u d i n g such a d i c t i o n a r y i n the system i s t o f r e e the DBA from s p e c i f y i n g numerous s y n t a c t i c d e f i n i t i o n s of words f o r each domain. I t a l s o e n s u r e s t h e s e d e f i n i t i o n s a r e c o r r e c t and a v a i l a b l e f o r e v e r y domain. In the c u r r e n t system t h i s d i c t i o n a r y o n l y c o n t a i n s the words which a r e used i n the two t e s t domains. A more r e a s o n a b l e d i c t i o n a r y f o r commercial a p p l i c a t i o n s may c o n t a i n over 100,000 words. S i n c e the m a j o r i t y of t h e s e words w i l l not be used i n a g i v e n domain, the d i c t i o n a r y c o u l d be s t o r e d on e i t h e r a d i s k or even a t a p e . Only the words c o n t a i n e d i n the domain d e f i n i t i o n 89 a r e e x t r a c t e d and s t o r e d i n f a s t a c c e s s memory ( d i s k or main) . A sample p o r t i o n of the G l o b a l D i c t i o n a r y i s l i s t e d i n Appendix F. 5.6 C o m p i l i n g The I n f o r m a t i o n : A c t i v e Domain D i c t i o n a r y The A c t i v e Domain D i c t i o n a r y i s the p r i m a r y source of s y n t a c t i c and semantic knowledge used by the system. I t i s c o m p i l e d from the domain independent i n f o r m a t i o n ( G e n e r a l D i c t i o n a r y and s e l e c t e d words from the G l o b a l D i c t i o n a r y ) and the domain dependent i n f o r m a t i o n (Database Schema, Database Schema L i b r a r y , and the i n v e r t e d i n d e x ) . T h i s s y n t a c t i c and semantic i n f o r m a t i o n i s s t o r e d on LISP'S p r o p e r t y l i s t s f o r each word. F i g u r e 5.5 c o n t a i n s a sample of some of the e n t r i e s i n the A c t i v e Domain D i c t i o n a r y used f o r the Program d a t a b a s e . A l a r g e r sample i s c o n t a i n e d i n Appendix G. WORD PROPERTY VALUES COMPUTER N S L-NOUN ((RELATION COMPUTER)) L-DESC ((RELATION PROGRAM)) NPR ? PART-OF '((COLOUR COMPUTER)) FRENCH N S ELM-OF ((FIELD DESCRIPTOR DSC)) PART-OF ((FRENCH TEACHER)) GOOD ADJ * ADV * MEM-OF (TYPE RATING) CODE-AS (* > 3) RUN V ES FEATURES (REL-VERB TRANS PASSIVE INTRANS) VFRAME • • • • ( ( ( S U B J 90 F i g u r e 5.5 - P a r t i a l A c t i v e Domain D i c t i o n a r y A l t h o u g h the A c t i v e Domain D i c t i o n a r y i s w e l l s u i t e d f o r use by the system, i t i s not s u i t a b l e f o r use by the DBA. In or d e r t o make any changes t o the s p e c i f i c a t i o n of a d a t a b a s e , the DBA can make changes t o the Database Schema and then r e c o m p i l e the A c t i v e Domain D i c t i o n a r y . Hence the DBA o n l y needs t o d e a l w i t h the s t r u c t u r e d r e p r e s e n t a t i o n of the database c o n t a i n e d i n the Database Schema. 91 V I . SYSTEM DESIGN I I I - SEMANTIC INTERPRETATION The j o b of the semantic p r o c e s s o r i s t o t a k e the deep s t r u c t u r e produced by the s y n t a c t i c p r o c e s s o r and t r a n s f o r m i t i n t o a S t a n d a r d Query R e p r e s e n t a t i o n (SQR). The SQR w i l l then be t r a n s f o r m e d i n t o a database query i n the d e s i r e d query language by the database i n t e r f a c e r o u t i n e . 6.1 I n t e r p r e t i n g The Deep S t r u c t u r e The i n i t i a l t a s k i n i n t e r p r e t i n g the deep s t r u c t u r e i s t o de t e r m i n e the database r e l a t i o n s or f i e l d s r e p r e s e n t e d by each phrase i n the qu e r y . Once t h i s p r o c e s s i s done, the system w i l l t r y t o match t h e deep s t r u c t u r e t o a v e r b frame. D u r i n g t h i s p r o c e s s , v a r i o u s t y p e s of semantic a m b i g u i t y may be r e s o l v e d . 6.1.1 A s s o c i a t i n g P hrases W i t h R e l a t i o n s And F i e l d s The f i r s t s t e p i n the semantic i n t e r p r e t a t i o n of the deep s t r u c t u r e i s t o de t e r m i n e e v e r y p o s s i b l e f i e l d and r e l a t i o n t h a t each noun p h r a s e , p r e p o s i t i o n a l p h r a s e , adverb, and p r e d i c a t e may r e p r e s e n t w i t h r e s p e c t t o the d a t a b a s e . Once t h e s e r e l a t i o n s and f i e l d s a re d e t e r m i n e d , the r e s u l t i s s t o r e d i n the SEMREG r e g i s t e r f o r l a t e r use. Adverbs and A d j e c t i v e s which a c t as p r e d i c a t e s (complements) a r e the s i m p l e s t c a s e s s i n c e they c o n s i s t of o n l y one word. These words a r e n o r m a l l y d e f i n e d as database elements of s p e c i f i c f i e l d s . I n the q u e r y : 92 What programs are good? the a d j e c t i v e "good" i s d e f i n e d as a member of (MEM-OF) the f i e l d PROGRAM RATING. T h i s f i e l d w i l l then be s t o r e d i n the SEMREG r e g i s t e r . P r e p o s i t i o n a l p hrases and noun p h r a s e s a r e more c o m p l i c a t e d s i n c e they u s u a l l y c o n t a i n more than one word. In t h i s case the semantic i n t e r p r e t e r checks the s y n t a c t i c c a t e g o r y of the head noun of the g i v e n p h r a s e . U s i n g t h i s i n f o r m a t i o n , i t w i l l d e t e r m i n e what r e l a t i o n s and f i e l d s the word i d e n t i f i e s by e x a m i n i n g the semantic r o l e s (L-NOUN, ELM-OF, MEM-OF, CATEG, TYPE, LTYPE) of the word. For example, i n the p h r a s e : What programs... the noun "program" r e p r e s e n t s the r e l a t i o n PROGRAM s i n c e i t i s d e f i n e d as a noun which i d e n t i f i e s the r e l a t i o n (L-NOUN). The p r o p e r noun "Apple" i n the p h r a s e : the A p p l e c o u l d r e f e r t o e i t h e r the f i e l d COMPUTER MAKE or PROGRAM PUBLISHER s i n c e i t i s a database element (ELM-OF) of each of t h e s e f i e l d s . Q u e s t i o n pronouns, such as "What", a r e s l i g h t l y more complex when they appear as head nouns. In t h i s case the system n o t e s t h a t "WHAT" may r e p r e s e n t the semantic c a t e g o r i e s (CATEG) PHYSICAL OBJECT, ABSTRACT OBJECT, and EVENTS. In the program d a t a b a s e , "WHAT" may r e f e r t o the p h y s i c a l o b j e c t s PROGRAM or COMPUTER, or the a b s t a c t o b j e c t DESCRIPTOR. Hence the SEMREG 93 r e g i s t e r w i l l then be s e t t o t h e s e t h r e e r e l a t i o n s . The semantic i n t e r p r e t e r a l s o examines the d e s c r i b e r s of the head noun i n o r d e r t o h e l p r e s o l v e any semantic a b i g u i t y . For example, i n the p h r a s e : What IBM t h i n g s , the word " t h i n g " may r e p r e s e n t any p h y s i c a l o b j e c t (COMPUTER or PROGRAM) i n the database. The system w i l l then f i n d t h a t "IBM" i s an element of the f i e l d "COMPUTER MAKE", and c o n c l u d e t h a t the g i v e n p h r a s e r e f e r s t o computers and not programs. As shown above i t i s p o s s i b l e f o r a phrase t o r e f e r t o more than one f i e l d and/or r e l a t i o n . T h i s a m b i g u i t y can be r e s o l v e d by e i t h e r a n a l y s i n g the phrase's r o l e w i t h r e s p e c t t o the main v e r b , or by p r o m p t i n g the user f o r the i n t e n d e d i n t e r p r e t a t i o n . Both of t h e s e cases w i l l be d e a l t w i t h i n the f o l l o w i n g s e c t i o n s . 6.1.2 Verb Frames One of t h e main p r o c e s s e s i n the semantic i n t e r p r e t a t i o n of a query i s d e t e r m i n i n g whether a v e r b frame s u i t a b l y matches the deep s t r u c t u r e . Every v e r b which can a c t as a main v e r b must have a t l e a s t one v e r b frame a s s o c i a t e d w i t h i t or be d e f i n e d as a synonym f o r a v e r b which does. 94 6.1.2.1 Verb Frame S t r u c t u r e A v e r b frame c o n s i s t s of a v e r b f o l l o w e d by a number of s l o t s . Each s l o t r e f e r s t o a s y n t a c t i c component i n the query which p l a y s a semantic r o l e w i t h r e s p e c t t o the v e r b . F i g u r e 6.1 l i s t s t he s y n t a c t i c components which can a c t as s l o t s . S y n t a c t i c D e s c r i p t i o n C a t e g o r y SUBJ S u b j e c t of a s e n t e n c e PRED P r e d i c a t e ( o r complement) DOBJ D i r e c t O b j e c t I OBJ I n d i r e c t O b j e c t MOD M o d i f i e r s of the main verb (Adverbs) POBJS P r e p o s i t i o n a l p h r a s e s are r e q u i r e d by the c o e c u r r e n c e r e s t r i c t i o n s of c e r t a i n v e r b s (eg. d o u b l e o b j e c t v e r b s — s e l l , s e n d ) . F i g u r e 6.1 - V e r b Frame S l o t s Each s l o t has a semantic c l a s s i f i c a t i o n a s s o c i a t e d w i t h i t . T h i s semantic c l a s s i f i c a t i o n d e t e r m i n e s what each s y n t a c t i c p hrase can r e p r e s e n t w i t h r e s p e c t t o the d a t a b a s e . F i g u r e 6.2 c o n t a i n s a l i s t of the s e m a n t i c c l a s s i f i c a t i o n s d e f i n e d i n t h e system. 95 Semantic C l a s s i f i c a t i o n Example D e f i n i t i o n . Example Phrase RELATION FIELD CATEG TYPE L-TYPE GENERAL (RELATION PROGRAM) (FIELD COMPUTER RAM) (CATEG HUMAN) (TYPE LOCATION) (L-TYPE CITY) (GENERAL *YOU*) (GENERAL *ME*) (GENERAL *ANY*) What programs r u n . . . How much memory... Who i s . . . Where i s . . . What c i t i e s a r e . . . You pr i n t . . . Show me t h e . . . P r i n t a n y t h i n g . F i g u r e 6.2 - Semantic C l a s s i f i c a t i o n s f o r Verb Frame S l o t s An example of a ve r b frame f o r the ve r b ' p u b l i s h ' i s shown below: (PUBLISH VFRAME ((SUBJ (FIELD PROGRAM PUBLISHER)) (DOBJ (RELATION PROGRAM)))) T h i s d e f i n i t i o n w i l l a l l o w the system t o p r o c e s s q u e r i e s such Who p u b l i s h e s LOGO? Does TERRAPIN p u b l i s h good programs? I s LOGO p u b l i s h e d by TERRAPIN? I t i s a l s o p o s s i b l e t o d e f i n e v e r b frames which c o n t a i n s l o t s which a re o p t i o n a l . T h i s a l l o w s a s p e c i f i e d phrase t o be o m i t t e d from a query w i t h o u t h a v i n g t o s p e c i f y a s e p a r a t e v e r b frame. For example, b o t h the s u b j e c t and i n d i r e c t o b j e c t a r e d e f i n e d as o p t i o n a l (OPT) f o r the v e r b "show". 96 (SHOW VFRAME ((SUBJ (GENERAL *YOU*) OPT ) (DOBJ (GENERAL *ANY*)) (IOBJ (GENERAL *ME*) OPT ) ) ) Below a re some example q u e r i e s which w i l l match t h i s v e r b frame. Show the good programs t o me. You show me the good programs. Show the good programs. When p r e p o s i t i o n a l p hrases appear i n a v e r b frame, i t i s p o s s i b l e t o s t a t e the p r e p o s i t i o n s which a r e a c c e p t a b l e i n a quer y . For example, the f o l l o w i n g v e r b frame s t a t e s t h a t e i t h e r the p r e p o s i t i o n "on" or " w i t h " may appear i n the s t a t e d p h r a s e . (RUN VFRAME ((SUBJ (RELATION PROGRAM)) (PP (ON WITH) (RELATION COMPUTER)))) T h i s d e f i n i t i o n w i l l match the f o l l o w i n g q u e r i e s : What runs on the IBM PC? Does LOGO run w i t h the M a c i n t o s h computer? I t i s a l s o p o s s i b l e t o d e f i n e a p r e p o s i t i o n a l phrase v e r b frame s l o t which w i l l not p l a c e any r e s t r i c t i o n s on which p r e p o s i t i o n s can be matched. T h i s i s done by r e p l a c i n g the p r e p o s i t i o n d e f i n i t i o n by an a s t e r i s k (*). T h i s s t r u c t u r e f o r v e r b frames i s s i m i l a r t o F i l m o r e ' s Case Grammar, except t h a t the s l o t s c o r r e s p o n d t o s y n t a c t i c components ( S u b j e c t , D i r e c t o b j e c t , e t c . ) r a t h e r than the t r a d i t i o n a l c a ses ( R e c i p i e n t , A c t o r , e t c . ) . The main m o t i v a t i o n i n a p p l y i n g a case grammar i n a case grammar based system i s t h a t i t i s o f t e n p o s s i b l e t o d e t e r m i n e the i n t e n d e d r o l e of a 97 phrase w i t h o u t knowing the f u l l meaning of the words i n the phr a s e . T h i s a b i l i t y i s u s e f u l f o r p r o c e s s i n g words which a re not d e f i n e d i n the system. I t w i l l be shown i n t h i s c h a p t e r t h a t such p r o c e s s i n g can a l s o be done by u s i n g v e r b frames. The b i g advantage of u s i n g v e r b frames over the t r a d i t i o n a l case system i s t h a t they do not r e q u i r e a person w i t h a s t r o n g l i n g u i s t i c background t o d e s i g n them. 6.1.2.2 M a t c h i n g The Deep S t r u c t u r e To A Verb Frame The p r o c e s s of d e t e r m i n i n g whether a v e r b frame matches the deep s t r u c t u r e i s a c c o m p l i s h e d by s y s t e m a t i c a l l y comparing the p o s s i b l e c o n t e n t s of each s l o t i n the v e r b frame w i t h the c o r r e s p o n d i n g phrase i n the deep s t r u c t u r e . The p o s s i b l e c o n t e n t s f o r each phrase i n the deep s t r u c t u r e i s s t o r e d i n the SEMREG r e g i s t e r , w h i l e the p o s s i b l e c o n t e n t s of each s l o t i n the ver b frame i s d e f i n e d by the semantic c l a s s i f i c a t i o n . T h i s matching p r o c e s s i s n o r m a l l y done by t a k i n g the i n t e r s e c t i o n of the f i e l d s d e f i n e d by SEMREG and the semantic c l a s s i f i c a t i o n . For example, i n the query: Who p u b l i s h e s LOGO? "who" can r e f e r t o e i t h e r a PROGRAM PUBLISHER or a COMPUTER MAKE. However the v e r b frame f o r the v e r b " p u b l i s h " s t a t e s t h a t the s u b j e c t must be a PROGRAM PUBLISHER. Hence the s u b j e c t i s taken t o be a PROGRAM PUBLISHER. "LOGO" i s then matched as a PROGRAM s i n c e i t i s c o n t a i n e d i n the f i e l d PROGRAM NAME, which i s d e f i n e d as an IDENTIFIER f o r t h e r e l a t i o n PROGRAM. 98 Once a v e r b frame has been s u c c e s s f u l l y matched t o the deep s t r u c t u r e , the g i v e n verb frame i s s t o r e d i n the VFRAME r e g i s t e r i n the deep s t r u c t u r e . I f any a m b i g u i t y r e s u l t i n g from a phrase r e f e r i n g t o more than one r e l a t i o n or f i e l d can be r e s o l v e d by the match, the SEMREG r e g i s t e r i s updated. I f the query cannot be matched t o a v e r b frame, the system i s unable t o answer the query and an a p p r o p r i a t e e r r o r message i s p r i n t e d o u t . I t i s p o s s i b l e t o have more than one v e r b frame d e f i n e d f o r a g i v e n v e r b . T h i s a l l o w s a v e r b t o have more than one meaning. In the program database, the v e r b " r u n " has the f o l l o w i n g t h r e e d e f i n i t i o n s : COMPUTERS RUN PROGRAMS PROGRAMS RUN ON COMPUTERS. PROGRAMS RUN RATING, (eg. good) A l t h o u g h t h e meanings of the s e t h r e e d e f i n i t i o n s a r e s i m i l a r , i t i s p o s s i b l e t o have c o m p l e t e l y d i f f e r e n t meaning such a s : ATHLETES RUN RACES, or TRAIN RUNS FROM LOCATION TO LOCATION. When more than one v e r b frame i s d e f i n e d f o r a v e r b , t h e r e i s a p o s s i b i l i t y t h a t more than one can be a p p l i e d t o a g i v e n query. In o r d e r t o match the query t o the most a p p r o p r i a t e v e r b frame, two s e p a r a t e a t t e m p t s , or passes a r e made. T h i s f i r s t pass r e q u i r e s the deep s t r u c t u r e t o e x a c t l y match the v e r b frame. In t h i s case e v e r y v e r b frame s l o t must match w i t h the c o r r e s p o n d i n g r e g i s t e r i n the deep s t r u c t u r e . P h r a s e s which do not c o r r e s p o n d t o a ve r b frame s l o t a r e not p e r m i t t e d i n t h i s 99 p a s s . I f a v e r b frame does not e x a c t l y match the quer y , a second pass i s made. T h i s pass a l l o w s f o r p h r a s e s r e f e r r i n g t o o p t i o n a l (OPT) s l o t s t o be ommitted as w e l l as s l o t s w i t h more than one p o s s i b l e i n t e r p r e t a t i o n t o be matched. I t a l s o a l l o w s f o r some phrases i n the deep s t r u c t u r e t o be l e f t unmatched ( T a y l o r and Rosenberg 1975). In some c a s e s t h e s e p h r a s e s may be i g n o r e d . I f a p r e p o s i t i o n a l phrase i s not matched t o a v e r b frame s l o t , i t can be t r e a t e d as a m o d i f i e r of a noun p h r a s e . Each of these t o p i c s w i l l be d i s c u s s e d i n the f o l l o w i n g s e c t i o n s . 6.1.3 R e s o l v i n g M u l t i p l e Matched S l o t s I t i s p o s s i b l e t h a t a phrase may have more than one p o s s i b l e i n t e r p r e t a t i o n even a f t e r a v e r b frame has been matched. In t h i s case the user i s prompted f o r the d e s i r e d i n t e r p r e t a t i o n . For example, i n the que r y : P r i n t t he A p p l e s . " A p p l e s " can r e f e r t o e i t h e r a PROGRAM PUBLISHER or a COMPUTER MAKE. In t h i s case the user i s p r e s e n t e d w i t h the f o l l o w i n g prompt: Does 'APPLE' b e s t r e f e r t o 1. a computer m a n u f a c t u r e r 2. a computer program p u b l i s h e r The query w i l l then be answered w i t h the chosen i n t e r p r e t a t i o n . 100 6.1.4 I g n o r i n g Phrases In some c a s e s t h e r e may be a phrase or a word i n a query which p r e v e n t s i t from b e i n g matched t o a v e r b frame. These words or p h r a s e s can o f t e n be i g n o r e d w i t h o u t e f f e c t i n g the meaning of the query. For example, i n the query: P r i n t the programs q u i c k l y . the adverb " q u i c k l y " w i l l p r event the query from b e i n g matched t o the v e r b frame f o r " p r i n t " . In t h i s case the system w i l l d i s p l a y the f o l l o w i n g prompt t o the u s e r : OK t o i g n o r e " q u i c k l y " ? I f the user chooses t o i g n o r e the p h r a s e , the query can be s u c c e s s f u l l y answered. Otherwise an e r r o r message w i l l be p r i n t e d o u t . 6.1.5 P r o c e s s i n g P r e p o s i t i o n a l P h r a s e s P r e p o s i t i o n a l p hrases can p l a y two d i f f e r e n t r o l e s i n E n g l i s h . They can e i t h e r p l a y a r o l e w i t h r e s p e c t t o the main v e r b , or they can a c t as a m o d i f i e r of a head noun. The p a r s e r p l a c e s each p r e p o s i t i o n a l phrase which appears a t the end of a query i n the PPOBJS r e g i s t e r ( w i t h the e x c e p t i o n of i n d i r e c t o b j e c t s and p a s s i v e c o n s t r u c t i o n s ) . Hence i t i s the j o b of the semantic p r o c e s s o r t o determine the r o l e t h a t the p r e p o s i t i o n a l p h r ase p l a y s i n the query. D u r i n g the matching p r o c e s s , the system w i l l t r y t o match 101 p r e p o s i t i o n a l p hrases w i t h any v e r b frame s l o t s which s p e c i f y them. For example, the query: What programs run on the IBM? w i l l match the f o l l o w i n g v e r b frame: (RUN VFRAME ((SUBJ (RELATION COMPUTER)) (PP (ON WITH) (RELATION PROGRAM))) I f a p r e p o s i t i o n a l phrase (PP) c o n t a i n e d i n t h e PPOBJS r e g i s t e r i s not matched t o a v e r b frame s l o t , i t w i l l be t r e a t e d as a m o d i f i e r t o a head noun (HN). As s t a t e d i n c h a p t e r 4, t h e r e may be more than one p o t e n t i a l HN which a g i v e n PP may m o d i f y . In o r d e r t o determine which HN a g i v e n PP m o d i f i e s , the system uses the f o l l o w i n g t h r e e h e u r i s t i c s : (1) The PP must appear t o the r i g h t of the HN (2) PP's t e n d t o i m m e d i a t e l y f o l l o w the HN which t h e y m o d i f y (3) PP's a r e u s u a l l y s e m a n t i c a l l y r e l a t e d t o the HN (eg. PP may r e f e r t o a p r o p e r t y of the g i v e n HN) In o r d e r t o d e t e r m i n e which HN a g i v e n PP may m o d i f y , t h e system uses two f a c t o r s d e r i v e d from the above h e u r i s t i c s . They a r e : (1) the number of head nouns between the g i v e n head noun and the p r e p o s i t i o n phrase ( d i s t a n c e ) , and (2) a d e f i n e d "semantic a s s o c i a t i o n f a c t o r " which measures the semantic r e l a t i o n s h i p between the database e n t i t i e s r e p r e s e n t e d by the p h r a s e s . The semantic a s s o c i a t i o n f a c t o r c o n s i s t s of a number between 1 and 5, and i s d e f i n e d i n f i g u r e 6.3. The semantic a s s o c i a t i o n f a c t o r s between r e l a t i o n s are d e f i n e d i n the d a t a b a s e schema, 1 02 w h i l e the a s s o c i a t i o n f a c t o r between the f i e l d s w i t h i n r e l a t i o n s a r e i m p l i c i t l y d e f i n e d . A s s o c i a t i o n D e s c r i p t i o n and Example F a c t o r 1 F i e l d s w i t h i n the same r e l a t i o n (computer model and computer make) 2 R e l a t i o n s which r e f e r t o the same e n t i t y (computer and c o m p u t e r - s u b p a r t s ) 3 R e l a t i o n s which are commonly a s s o c i a t e d t o g e t h e r and a r e commonly l i n k e d t o g e t h e r by v e r b s . (computers and programs) ( p a r t s and s u p p l i e r s ) 4 R e l a t i o n s which are not commonly a s s o c i a t e d t o g e t h e r but have some p o s s i b l e semantic t i e s . ( d e s c r i p t o r s and computers) 5 R e l a t i o n s which a r e not r e l a t e d t o g e t h e r a t a l l . (computer programs and hockey p l a y e r s ) F i g u r e 6.3 - Semantic A s s o c i a t i o n F a c t o r s These two f a c t o r s ( d i s t a n c e and semantic a s s o c i a t i o n ) a r e combined t o g t h e r i n the f o l l o w i n g e q u a t i o n t o d e t e r m i n e which HN t o a t t a c h the g i v e n PP t o . The HN w i t h the l o w e s t c a l c u l a t i o n w i l l be chosen t o be the most a p p r o p r i a t e c h o i c e . DISTANCE + f a c t o r i a l ( A S S O C I A T I O N FACTOR) Note t h a t the f a c t o r i a l e x p r e s s i o n i n t h i s e q u a t i o n g i v e s more weight t o the a s s o c i a t i o n f a c t o r than the d i s t a n c e f a c t o r when the a s s o c i a t i o n f a c t o r s between the g i v e n PP and the v a r i o u s HN v a r y s u b s t a n t i a l l y . For example, i f the a s s o c i a t i o n f a c t o r between a PP and one HN i s 1, w h i l e the a s s o c i a t i o n f a c t o r between the same PP and a n o ther HN i s 5, the PP w i l l be a t t a c h e d t o the HN where the a s s o c i a t i o n f a c t o r i s 1 r e g a r d l e s s of the 1 03 d i s t a n c e f a c t o r ( e.g. (D1 + 1!) < (D2 + 5!) u n l e s s D1 > 2 5 ) . The f o l l o w i n g query w i l l demonstrate how t h i s c a l c u a t i o n works. What computers run good programs w i t h 128 KB? In the above qu e r y , the phrase " w i t h 128 KB" c o u l d modify e i t h e r "computers" or "programs". In t h i s case the a s s o c i a t i o n f a c t o r between "128 KB" and "computers" i s 1 and the d i s t a n c e i s 1 (1 + 1! = 2 ) , w h i l e between "128 KB" and "programs" the a s s o c i a t i o n f a c t o r i s d e f i n e d as 3 and the d i s t a n c e i s 0 (0 + 3! = 6 ) . By a p p l y i n g the above e q u a t i o n , the system w i l l a p p r o p r i a t e l y c o n c l u d e t h a t "128 KB" m o d i f i e s "computers". A l t h o u g h t h i s e q u a t i o n i s ad-hoc, i t has proven t o be adequate i n the two sample d a t a b a s e s . For a commercial system t h i s e q u a t i o n would p r o b a b l y have t o be r e f i n e d . 6.1.6 R e l a t i v e C l a u s e s In o r d e r t o p r o c e s s q u e r i e s c o n t a i n i n g r e l a t i v e c l a u s e s , the system must match v e r b frames t o both the o u t e r query and the embedded query. T h i s i s implemented by matching the o u t e r query f i r s t , and then matching the the embedded q u e r y , where the s u b j e c t of the embedded query i s f o r c e d t o match the i n t e r p r e t a t i o n of the phrase which i t m o d i f i e s . T h i s p r o c e s s i s shown by t h e f o l l o w i n g example: What computers run programs which t e a c h c h e m i s t r y ? F i r s t the o u t e r query i s matched t o each v e r b frame f o r the v e r b 1 04 " r u n " . Once a s u c c e s s f u l match i s found, the v e r b frame f o r " t e a c h " i s then matched t o the embedded query. In t h i s case the d i r e c t o b j e c t of the v e r b "run" i s t r e a t e d as the s u b j e c t f o r the v e r b " t e a c h " . 6.2 B u i l d i n g The S t a n d a r d Query R e p e s e n t a t i o n A f t e r each phrase i n the query has been e i t h e r matched t o a v e r b frame s l o t , i g n o r e d , or a t t a c h e d t o a head noun (PP's o n l y ) , the system w i l l attempt t o b u i l d the S t a n d a r d Query R e p r e s e n t a t i o n (SQR) from the r e s u l t i n g deep s t r u c t u r e . T h i s i s done by a n a l y z i n g each phrase i n the deep s t r u c t u r e and d e t e r m i n i n g the semantic f u n c t i o n of each word i n them. Once the SQR has been c o n s t r u c t e d , the system w i l l i n v o k e the database i n t e r f a c e r o u t i n e , which w i l l t r a n s f o r m the SQR i n t o an a c t u a l database query. 6.2.1 SQR S t r u c t u r e The SQR c o n s i s t s of a number of r e g i s t e r s , c a l l e d l i s t s , w hich c o n t a i n s the needed i n f o r m a t i o n t o c o n s t r u c t a d a t a b a s e query i n most common database query languages. T h i s i n f o r m a t i o n i n c l u d e s the a c t u a l database f i e l d s which a r e t o be d i s p l a y e d i n the r e s p o n s e , the r e s t r i c t i o n s t o be p l a c e d on the d a t a t o be s e l e c t e d , the r e l a t i o n s which need t o be a c c e s s e d , i n f o r m a t i o n on how the r e l a t i o n s a r e t o be j o i n e d t o g e t h e r , and some a d d i t i o n a l i n f o r m a t i o n on how the q u e s t i o n i s t o be answered. 1 05 6.2.1.1 The S e l e c t L i s t The S e l e c t L i s t (SLIST) c o n t a i n s a l i s t of f i e l d s which a re t o be p r i n t e d i n t h e res p o n s e . For t h e que r y : P r i n t the program p u b l i s h e r s . the SLIST w i l l c o n t a i n the f i e l d PROGRAM PUBLISHER. 6.2.1.2 The R e s t r i c t i o n L i s t The R e s t r i c t i o n L i s t (RLIST) c o n t a i n s a l i s t of r e s t r i c t i o n s which w i l l be p l a c e d on the items t o be s e l e c t e d from the d a t a b a s e . The RLIST f o r the que r y : P r i n t t he M a c i n t o s h programs. w i l l be s e t t o : (COMPUTER.MODEL = "MACINTOSH"). For more complex q u e r i e s , the RLIST may i n c l u d e many r e s t r i c t i o n s . 6.2.1.3 The R e l a t i o n L i s t The R e l a t i o n L i s t , or the " f r o m - w h a t - r e l a t i o n " l i s t ( F L I S T ) , c o n t a i n s each r e l a t i o n name which needs t o be a c c e s s e d i n o r d e r f o r the q u e r y t o be answered. The FLIST f o r t h e query: What A p p l e computers run LOGO? w i l l c o n t a i n the r e l a t i o n s COMPUTER and PROGRAM, s i n c e b o t h of 106 t h e s e r e l a t i o n s w i l l need t o be a c c e s s e d i n o r d e r f o r the query t o be answered. 6.2.1.4 The J o i n L i s t The J o i n L i s t (JLIST) c o n s i s t s of any j o i n s needed t o j o i n the r e l a t i o n s t o g e t h e r . For example, the JLI S T f o r the que r y , What computers run LOGO? w i l l be s e t t o : (COMPUTER.CNO = PROGRAM.CNO). I f t he s o l u t i o n of the query o n l y r e q u i r e s one r e l a t i o n t o be a c c e s s e d , t h e J L I S T w i l l be empty. 6.2.1.5 The D i s t i n c t F l a g The D i s t i n c t F l a g d e t e r m i n e s whether the s o l u t i o n of a query w i l l c o n t a i n d u p l i c a t e database elements. For example, the q u e r y : P r i n t the d i s t i n c t program p u b l i s h e r s . w i l l cause each p u b l i s h e r t o be p r i n t e d once, r a t h e r than h a v i n g the p u b l i s h e r p r i n t e d s e p a r a t e l y f o r each program t h a t they p u b l i s h . The words "unique", " d i f f e r e n t " , and " d i s c r e t e " a r e a l s o used t o s t a t e t h i s o p t i o n . 1 07 6.2.2 D e t e r m i n i n g The Q u e s t i o n Type The f i r s t s t e p i n b u i l d i n g the SQR i s t o determine the type of the query. The system can answer f o u r b a s i c t y p e s of q u e r i e s . Each of the s e query t y p e s has some unique f e a t u r e s p e r t a i n i n g t o how the q u e s t i o n i s t o be answered. The most common t y p e s of query a r e tho s e which c o n t a i n a q u e s t i o n d e t e r m i n e r (or q u e s t i o n noun) such as "who", "what", "where", or "when". These a r e r e f e r e d t o as "WH" q u e r i e s and are answered i n a r e p o r t format. Another p o p u l a r type of query i s termed as PRINT query. T h i s type of query has a main v e r b such as " p r i n t " , "show", " d i s p l a y " , or " l i s t " . These v e r b s a r e i d e n t i f i e d by h a v i n g the semantic marker "COM-VERB" a s s o c i a t e d w i t h them i n the A c t i v e Domain D i c t i o n a r y . These q u e r i e s a r e a l s o answered i n a r e p o r t f o r m a t . The main f e a t u r e of the s e q u e r i e s i s t h a t the d i r e c t o b j e c t c o n t a i n s the i n f o r m a t i o n r e q u e s t e d by the u s e r . Some examples a r e l i s t e d below: P r i n t the good programs. Can you d i s p l a y the IBM computer t y p e s f o r me? The system a l s o answers q u e s t i o n s where the answer i s e i t h e r "yes" or "no". These q u e r i e s , termed "YES/NO" q u e r i e s , have many d i f f e r e n t c o n s t r u c t i o n s . Sentences i n d e c l a r a t i v e form, such a s : The IBM PC runs LOGO. i s taken t o be a YES/NO q u e r y . They a r e a l s o i d e n t i f i e d by 108 b e i n g i n the i n t e r o g a t i v e mood and e i t h e r h a v i n g a c o p u l a v e r b as a main v e r b , as i n : Is LOGO a good program? or h a v i n g e i t h e r "do", "be", "have" or a modal (e.g. can, w i l l ) as an a u x i l i a r y v e r b , as shown be the f o l l o w i n g q u e r i e s : Does LOGO run on the APPLE? I s T e r r a p i n p u b l i s h i n g LOGO? Can LOGO run on the APPLE? The o t h e r two t y p e s of q u e r i e s answered by the system a r e ones which r e q u e s t the system t o count some s e l e c t e d database i t e m s . These a r e i d e n t i f i e d by e i t h e r h a v i n g "count" as a main v e r b (COUNT), or h a v i n g the q u e s t i o n d e t e r m i n e r "how many" (HOWMANY). Examples of t h e s e a r e shown below: Count the program p u b l i s h e r s . How many programs run on the IBM PC? In some ca s e s the q u e s t i o n d e t e r m i n e r "how many" can be used t o s i m p l y p r i n t out the c o n t e n t s of a f i e l d . For example, i f the TYPE of the f i e l d i s s e t t o Q-CNT (a c o u n t a b l e q u a n t i t y ) , a query would be c o n s t r u c t e d t o r e q u e s t the number c o n t a i n e d i n t h i s f i e l d . T h i s i s shown be the f o l l o w i n g query: How many b o l t s were s u p p l i e d t o j ob J2 by Smith? 109 6.2.3 A n a l y z i n g The Verb Frame In a d d i t i o n t o d e t e r m i n i n g which s l o t s must be matched i n a q u e r y , the v e r b frame may a l s o s p e c i f y how the r e l a t i o n s a s s o c i a t e d w i t h the v e r b a r e t o be j o i n e d t o g e t h e r . For example, the v e r b " r u n " r e q u i r e s the program and computer r e l a t i o n s t o be j o i n e d t o g e t h e r . T h i s i n f o r m a t i o n i s e x t r a c t e d from the v e r b d e f i n i t i o n i n the d a t a b a s e schema (JOININFO) and added t o the J L I S T ( j o i n l i s t ) f o r use i n the a c t u a l database q u e r y . The v e r b frame i s a l s o used t o s e l e c t the p h r a s e s i n the deep s t r u c t u r e which are t o be p r o c e s s e d i n the f o r m a t i o n of the SQR. 6.2.3.1 D e t e r m i n i n g The Q u e s t i o n Element In o r d e r t o d e t e r m i n e what i n f o r m a t i o n the u s e r wishes t o be d i s p l a y e d i n q u e r i e s which a r e answered i n a r e p o r t f o r m a t , the system must i d e n t i f y the p h r a s e which i s the q u e s t i o n element. The q u e s t i o n element i s u s u a l l y i d e n t i f i e d by the p a r s e r and i s s t o r e d i n the QE r e g i s t e r . T h i s i s the case f o r q u e s t i o n s c o n t a i n i n g words l i k e "who" or "how many". I f the query type i s s e t t o PRINT (command v e r b s ) , the q u e s t i o n element i s t a ken t o be the d i r e c t o b j e c t of t h e q u e r y . 1 10 6.2.3.2 A n a l y z i n g Noun And P r e p o s i t i o n a l P h rases The major t a s k i n b u i l d i n g the SQR i s d e t e r m i n i n g the semantic r o l e of each word i n each noun p h r a s e . S i n c e a p r e p o s i t i o n a l phrase c o n s i s t s of p r e p o s i t i o n and a noun p h r a s e , p r e p o s i t i o n a l p h r a s e s and noun p h r a s e s are h a n d l e d e s s e n t i a l l y by the same p r o c e s s . 6.2.3.2.1 Head Noun The head noun of the noun p h r a s e i s a n a l y z e d f i r s t u s i n g the semantic i n t e r p r e t a t i o n c o n t a i n e d i n the SEMREG r e g i s t e r . Depending on the s y n t a c t i c c a t e g o r y of the head noun, a number of t h i n g s can happen. I f the head noun i s e i t h e r a number, noun, or p r o p e r noun, and i s d e f i n e d as a database element (ELM-OF or MEM- OF), then t h e word and i t s a s s o c i a t e d database f i e l d w i l l be formed i n t o a r e s t r i c t i o n and w i l l be added t o the RLIST. A l s o the r e l a t i o n i n which the f i e l d i s c o n t a i n e d w i l l be added t o the FLIST. For example, the p h r a s e : "the M a c i n t o s h " (SEMREG (FIELD COMPUTER MODEL)) w i l l have the f o l l o w i n g e f f e c t s : RLIST <- (COMPUTER.MODEL = "MACINTOSH") FLIST <- (COMPUTER) I f the head noun i s a compound p r o p e r noun c o n t a i n e d i n the d a t a b a s e , the e n t i r e p h r a s e i s used i n the r e s t r i c t i o n . The f o l l o w i n g example c o n t a i n s an a b b r e v i a t e d compound proper noun: 111 John W i l s o n L t d (SEMREG (FIELD PROGRAM PUBLISHER)) (ANPR (JOHN WILSON AND SONS LIMITED)) In t h i s case the matched database element s t o r e d i n t h e ANPR r e g i s t e r ( a c t u a l proper noun) i s used t o form the f o l l o w i n g r e s t r i c t i o n : (PROGRAM PUBLISHER = JOHN WILSON AND SONS LIMITED) Another case i s when the head noun i s d e f i n e d as word which i d e n t i f i e s a r e l a t i o n or a f i e l d (L-NOUN). In t h i s c a s e the r e l a t i o n r e p r e s e n t e d by the f i e l d or r e l a t i o n w i l l be added t o the FLIST. The f o l l o w i n g examples w i l l b o t h cause t h e g i v e n ef f e c t . "the program" (SEMREG (RELATION PROGRAM)) "a p u b l i s h e r " (SEMREG (FIELD PROGRAM PUBLISHER)) FLIST <- (PROGRAM) Note t h a t i f the phrase happens t o be the q u e s t i o n element of the q u e r y , the f i e l d s i d e n t i f i e d by the head noun w i l l be added t o the SLIST. For example, the ph r a s e : "which p u b l i s h e r s " (SEMREG (FIELD PROGRAM PUBLISHER)) w i l l cause the f o l l o w i n g e f f e c t : SLIST <- (PROGRAM PUBLISHER) I t i s a l s o p o s s i b l e t h a t the head noun i s not s e m a n t i c a l l y d e f i n e d w i t h r e s p e c t t o the da t a b a s e . In t h i s case the SEMREG 1 12 r e g i s t e r i s s e t t o "*UNKNOWN". For example, the word "LOTUS" i n the f o l l o w i n g query i s not d e f i n e d . What computers run LOTUS (SEMREG (*UNKNOWN)) S i n c e the v e r b frame f o r "r u n " s p e c i f i e s t h a t the d i r e c t o b j e c t s h o u l d r e p r e s e n t a computer program, the system w i l l p r i n t the f o l l o w i n g r esponse: (LOTUS) i s not (a computer program) c o n t a i n e d i n t h i s d a t a b a s e . I f the head noun i s a q u e s t i o n pronoun, the f i e l d s r e p r e s e n t e d i n SEMREG a r e added t o the SLIST and the r e l a t i o n name i s added t o the FLIST. Note t h a t i f SEMREG i s s e t t o a r e l a t i o n , the f i e l d s t o be s e l e c t e d a r e d e t e r m i n e d by the IDENTIFIER a t t r i b u t e i n the datab a s e schema. The f o l l o w i n g phrase w i l l cause the g i v e n l i s t s t o be updated. "What" (SEMREG (RELATION COMPUTER)) SLIST <- (COMPUTER.CNO COMPUTER.MAKE COMPUTER.MODEL) FLIST <- (COMPUTER) The system handles p e r s o n a l pronouns i n a l i m i t e d way. I t can o n l y d e a l w i t h pronouns such as "you", "me", and "us" when they a r e matched w i t h a v e r b frame c o n t a i n i n g the semantic c l a s s i f i c a t i o n "GENERAL *YOU*" or "GENERAL *ME*". For example, the v e r b frame f o r "show" w i l l match the f o l l o w i n g q uery: You show me the good programs. S i n c e these pronouns have no e f f e c t on t h e a c t u a l database 1 13 qu e r y , they a r e s i m p l y i g n o r e d . 6.2.3.2.2 M o d i f i e r s The next t a s k i s t o a n a l y s e t h e a d j e c t i v e s and the p r e p o s i t i o n a l p h r a s e s which a c t as m o d i f i e r s t o head nouns. O f t e n words which a r e c o n t a i n e d i n the database (ELM-OF, MEM-OF) a c t as m o d i f i e r s . Such words w i l l be used t o put r e s t r i c t i o n s on the d a t a t o be s e l e c t e d . E i t h e r of the m o d i f i e r s i n the f o l l o w i n g q u e r i e s w i l l cause the g i v e n r e s t r i c t i o n s t o be added t o t he SLIST. the good T e r r a p i n programs the good programs from T e r r a p i n SLIST <- ((PROGRAM.RATING > 3) (PROGRAM.PUBLISHER = "TERRAPIN")) A more complex case a r i s e s when the m o d i f i e r r e f e r s t o a databas e f i e l d i n another r e l a t i o n . I n the f o l l o w i n g p h r a s e s , the d e s c r i p t o r " c h e m i s t r y " m o d i f i e s "programs". the c h e m i s t r y programs the programs on c h e m i s t y In t h i s case the d e f a u l t j o i n between t h e r e l a t i o n s PROGRAM and DESCRIPTOR w i l l be added t o the J L I S T . The r e s t r i c t i o n (DESCRIPTOR.DSC = "CHEMISTRY") w i l l be added t o the RLIST. Words which a r e used t o i d e n t i f y r e l a t i o n s (L-NOUN) can a c t as m o d i f i e r s t o the head noun. For example, the word "program" i n t h e f o l l o w i n g p h rases i s used t o i d e n t i f y the program r e l a t i o n . 1 14 the program LOGO (SEMREG (PROGRAM NAME)) the program p u b l i s h e r (SEMREG (PROGRAM PUBLISHER)) The meaning of such words were used t o determine the c o n t e n t s of SEMREG and have no o t h e r e f f e c t i n the semantic i n t e r p r e t a t i o n . Words which a r e d e f i n e d as d e s c r i b e r s (L-DESC) a l s o have no e f f e c t i n the c o n t r u c t i o n of the SQR. For example, the word "computer" i n the f o l l o w i n g phrase a c t s as a d e s c r i b e r ( c l a s s i f i e r ) f o r the r e l a t i o n program. the computer program (SEMREG (PROGRAM)) Note t h a t the word "computer" i s a l s o d e f i n e d as an i d e n t i f i e r (L-NOUN) f o r t h e r e l a t i o n computer. In t h i s case the a m b i g u i t y i s r e s o l v e d by t h e c o n t e x t i n which the word i s used. As p r e v i o u s l y s t a t e d , words such as " d i s t i n c t " and "unique" a r e used t o p r e v e n t the system from p r i n t i n g out d u p l i c a t e d atabase i t e m s . I f these words appear as m o d i f i e r s , they w i l l s i m p l y cause t h e DISTINCT f l a g t o be s e t . A phrase may c o n t a i n a m o d i f i e r which does not make any semantic sense, w i t h r e s p e c t t o the d a t a b a s e , even though i t i s s y n t a c t i c a l l y c o r r e c t . In the q u e r y : P r i n t the German programs. the word "German" i s not s e m a n t i c a l l y d e f i n e d . The system w i l l t h en respond w i t h the prompt? 1 15 The word (GERMAN) i n the phrase (the German programs) has no menaing i n t h i s q u e r y . I s i t OK t o i g n o r e i t ? (Y/N) I f the user answers "YES", the remainder of the query w i l l be answered. O t h e r w i s e the f o l l o w i n g message i s p r i n t e d : There a re no such database items which s a t i s f y the above r e s t r i c t i o n . I t i s p o s s i b l e f o r p r e p o s i t i o n a l p h r a s e s which modify head nouns t o c o n t a i n m o d i f i e r s as w e l l . The semantic i n t e r p r e t a t i o n of these p h r a s e s i s done i n the same way as d e s c r i b e d above. For example, t h e PP i n the p h r a s e : the computers w i t h 64 KB memory w i l l cause the f o l l o w i n g r e s t r i c t i o n : SLIST <- (COMPUTER.RAM = 64) 6.2.3.2.3 C o n j u n c t i o n As s t a t e d i n c h a p t e r 4, the system i s c a p a b l e of h a n d l i n g some l i m i t e d forms of c o n j u n c t i o n . I t a l l o w s a number of d e s c r i b e r s ( a d j e c t i v e s , nouns, e t c . ) t o be connected t o g e t h e r by "and", " o r " , or a comma. I f the words a r e database e l e m e n t s , they w i l l be i n t e r p r e t e d as a complex r e s t r i c t i o n . An example of t h i s t y p e of c o n j u n c t i o n i s shown below: the r e d , green, or b l u e p a r t s 1 16 The system w i l l c o n s t r u c t the f o l l o w i n g r e s t r i c t i o n f o r t h i s p h r a s e : RLIST <- ((P.COLOUR = "RED" OR P.COLOUR = "GREEN" OR P.COLOUR = "BLUE")) In many c a s e s where c o n j u n c t i o n i s used t h e r e may be more than one p o s s i b l e i n t e r p r e t a t i o n . For example, the p h r a s e : the A p p l e and IBM computers c o u l d be i n t e r p r e t e d as e i t h e r : (1) t h e Apple computers and the IBM computers (the Apple or IBM computers) or (2) the Apple and IBM computers (computers which are both Apple and IBM) In o r d e r t o d e t e r m i n e the i n t e r p r e t a t i o n of t h i s q uery, the system uses t h e h e u r i s t i c t h a t u s e r s w i l l not ask q u e r i e s which c o n t a i n i m p o s s i b l e r e s t r i c t i o n s . S i n c e the database schema d e f i n e s t h a t any computer can o n l y have one computer make a s s o c i a t e d w i t h i t (one t o o n e ) , the second i n t e r p r e t a t i o n i s i m p o s s i b l e . Hence the system w i l l use the f i r s t i n t e r p r e t a t i o n . The above h e u r i s t i c cannot r e s o l v e a l l a m b i g u i t y of t h i s t y p e . For example, the f o l l o w i n g phrase can a l s o be i n t e r p r e t e d i n the same two ways: the F r e n c h and C h e m i s t r y programs S i n c e a program can have many d e s c r i p t o r s , i t i s not o b v i o u s 1 1 7 which i n t e r p r e t a t i o n i s i n t e n d e d by the u s e r . In t h i s case the use r s h o u l d be asked which i n t e r p r e t a t i o n i s i n t e n d e d and have the system answer the query a c c o r d i n g l y . At p r e s e n t , the system w i l l o n l y p r i n t out a message s a y i n g t h a t i t cannot d e a l w i t h the g i v e n c o n j u n c t i o n . 6.2.3.2.4 R e l a t i v e C l a u s e s Q u e r i e s c o n t a i n i n g r e l a t i v e c l a u s e s a r e d e a l t w i t h by h a v i n g the Semantic P r o c e s s o r b u i l d a s e p a r a t e SQR s t r u c t u r e f o r the embedded query. T h i s SQR s t r u c t u r e w i l l e s s e n t i a l l y a c t as a r e s t r i c t i o n t o the head noun i n the o u t e r q u e r y which i t m o d i f i e s . An example of a database query r e s u l t i n g from a query c o n t a i n i n g a r e l a t i v e c l a u s e i s g i v e n i n s e c t i o n 7.2. 6.2.3.2.5 D e t e r m i n e r s The system does not p r e s e n t l y e x t r a c t any semantic knowledge from d e t e r m i n e r s . For example, f o r t h e q u e r y : P r i n t a good spr e a d s h e e t program. the system w i l l p r i n t out i n f o r m a t i o n on e v e r y "good s p r e a d s h e e t program" c o n t a i n e d i n the d a t a b a s e . 118 6 . 2 . 3 . 3 A n a l y z i n g A d v e r b s A n d P r e d i c a t e s The t a s k o f a n a l y z i n g v e r b f r a m e s l o t s w h i c h c o r r e s p o n d t o a d v e r b s a n d p r e d i c a t e s c o n s i s t i n g o f an a d j e c t i v e i s f a i r l y s i m p l e s i n c e t h e y u s u a l l y f u n c t i o n a s a r e s t r i c t i o n . F o r e x a m p l e , t h e w o r d " e x c e l l e n t " i n t h e f o l l o w i n g q u e r y a c t s a s a m o d i f i e r o f " p r o g r a m s " . W h i c h p r o g r a m s a r e e x c e l l e n t ? S i n c e t h e SEMREG r e g i s t e r f o r t h i s p h r a s e i s s e t t o " F I E L D PROGRAM R A T I N G " , t h e f o l l o w i n g r e s t r i c t i o n w i l l be a d d e d t o t h e S L I S T : S L I S T <- (PROGRAM.RATING = 5) N o t e t h a t " e x c e l l e n t " i s d e f i n e d a s a c o d e d d a t a b a s e e l e m e n t (CODE -AS ( * = 5 ) ) . 6 . 3 Summary Of S e m a n t i c A n a l y s i s The S e m a n t i c I n t e r p r e t e r h a s b e e n d e s i g n e d t o be d o m a i n i n d e p e n d e n t . I t i s c o m p l e t e l y d a t a d r i v e n . No a d d i t i o n a l p r o g r a m m i n g i s r e q u i r e d t o a d a p t i t t o a new d o m a i n . The s y s t e m a l l o w s w o r d s , i n c l u d i n g n o u n s a n d v e r b s , t o h a v e more t h a n one s e m a n t i c d e f i n i t i o n . The c h o s e n m e a n i n g w i l l be d e t e r m i n e d by t h e c o n t e x t i n w h i c h i t a p p e a r s . I f a n y a m b i g u i t y c a n n o t be r e s o l v e d by t h e s y s t e m , t h e u s e r i s p r o m p t e d f o r t h e i n t e n d e d i n t e r p r e t a t i o n . The s e m a n t i c d e f i n i t i o n s o f m o s t w o r d s , i n c l u d i n g a l l t h e 1 19 domain dependent ones, are c o n t a i n e d i n the A c t i v e Domain D i c t i o n a r y . The d e f i n i t i o n of some domain independent words, such as some a u x i l i a r y v e r b s , i s b u i l t i n t o the semantic i n t e r p r e t e r . 1 20 V I I . SYSTEM DESIGN IV - DATABASE INTERFACE AND RESPONSE GENERATION A f t e r the S t a n d a r d Query R e p r e s e n t a t i o n (SQR) has been produced by the Semantic P r o c e s s o r , i t i s passed t o the Database I n t e r f a c e t o be t r a n s f o r m e d i n t o a database query i n a s p e c i f i e d query language. T h i s query i s then e x e c u t e d by the DBMS. The r e s u l t s o b t a i n e d by t h i s query a r e then passed t o the Response G e n e r a t o r , which i n t u r n d i s p l a y s the r e s u l t ( s ) t o the u s e r . 7.1 The S e q u e l Query Language The Database I n t e r f a c e c u r r e n t l y c o n s t r u c t s database q u e r i e s i n the Sequel (SQL) database query language. T h i s language, which was d e v e l o p e d a t IBM i n the e a r l y 70's, i s used by many r e l a t i o n a l DBMS, i n c l u d i n g O r a c l e ( C h a m b e r l i n 1977). The fundamental r e t r i e v a l command i n Sequel i s the "SELECT-FROM-WHERE" command. For example, the q u e r y : How much memory does the PC have? can be e x p r e s s e d i n Sequel a s : SELECT COMPUTER.RAM FROM COMPUTER WHERE COMPUTER.MODEL = 'PC T h i s w i l l cause the amount of memory t o be p r i n t e d f o r e v e r y e n t r y i n the computer r e l a t i o n where the computer model v a l u e i s s e t t o "PC". I f the query s t a t e s t h a t d u p l i c a t e database v a l u e s a r e not 121 t o be p r i n t e d , the DISTINCT o p e r a t o r i s used. The query: P r i n t the unique program p u b l i s h e r s . can be coded as f o l l o w s : SELECT DISTINCT PROGRAM.PUBLISHER FROM PROGRAM I f t he v a l u e s s e l e c t e d by the query a r e t o be co u n t e d , the COUNT f u n c t i o n i s used. For example, the q u e r y : Count the M i c r o s o f t programs. i s t r a n s l a t e d t o the f o l l o w i n g S e q u e l q u e r y : SELECT COUNT (PROGRAM.PNO) FROM PROGRAM WHERE PROGRAM.PUBLISHER = 'MICROSOFT' A complete d e f i n i t i o n of the Sequel query language can be found i n Date (Date 1977) . 7.2 A s s e m b l i n g The Seq u e l Query The t a s k of a s s e m b l i n g the S e q u e l query i s f a i r l y s t r a i g h t f o r w a r d . The Database I n t e r f a c e i s passed the s e l e c t l i s t ( S L I S T ) , r e l a t i o n l i s t ( F L I S T ) , r e s t r i c t i o n l i s t ( R L I S T ) , j o i n l i s t ( J L I S T ) , DISTINCT f l a g , and t h e query type (QTYPE) by the semantic p r o c e s s o r . For example, t h e semantic i n t e r p r e t e r w i l l pass the f o l l o w i n g r e s u l t s f o r the g i v e n query: 1 22 What Apple or IBM computers run LOGO? QTYPE - WH SLIST - (COMPUTER.CNO COMPUTER.MAKE COMPUTER.MODEL) FLIST - (COMPUTER PROGRAM) JLIS T - ((COMPUTER.CNO = PROGRAM.CNO)) RLIST - ((COMPUTER.MAKE = "APPLE" OR COMPUTER.MAKE = "IBM") (PROGRAM.NAME = "LOGO")) DISTINCT - NIL (FALSE) In t h i s case the Database I n t e r f a c e w i l l d e t e r m i n e what database v a l u e s are t o be p r i n t e d by the SLIST. These f i e l d s a r e then used w i t h the SELECT c l a u s e . The r e l a t i o n s c o n t a i n e d i n FLIST ar e used w i t h the FROM c l a u s e . The j o i n s ( J L I S T ) and r e s t r i c t i o n s (RLIST) a r e i n c l u d e d w i t h the WHERE c l a u s e . Note t h a t these r e s t r i c t i o n s a r e co n n e c t e d t o g e t h e r by the "AND" o p e r a t o r s i n c e they a l l w i l l a p p l y t o the end r e s u l t . The Sequel query f o r the p r e v i o u s example i s shown below: SELECT COMPUTER.CNO, COMPUTER.MAKE, COMPUTER.MODEL FROM COMPUTER, PROGRAM WHERE (COMPUTER.CNO = PROGRAM.CNO) AND (COMPUTER.MAKE = 'APPLE' OR COMPUTER.MAKE = 'IBM' AND (PROGRAM.NAME = 'LOGO') The f o r m a t i o n of database q u e r i e s f o r q u e r i e s which c o n t a i n r e l a t i v e c l a u s e s i s s l i g h t l y more complex. Sequel has the f e a t u r e of a l l o w i n g an embedded query w i t h i n a que r y . The embedded query w i l l s i m p l y a c t as a r e s t r i c t i o n f o r the d a t a s e l e c t e d by the o u t e r query. S i n c e t h i s c o n s t r u c t i o n c l o s e l y models the semantic c o n s t r u c t i o n of a r e l a t i v e c l a u s e , i t makes sense t o use i t . The f o l l o w i n g example shows how a query c o n t a i n i n g a r e l a t i v e c l a u s e i s e x p r e s s e d i n S e q u e l . 1 23 What Apple computers run programs w h i c h t e a c h c h e m i s t r y ? SELECT FROM WHERE AND AND COMPUTER.CNO , COMPUTER.MAKE , COMPUTER.MODEL COMPUTER , PROGRAM (COMPUTER.CNO = PROGRAM.CNO) (COMPUTER.MAKE = 'APPLE') (PROGRAM.PNO IN (SELECT PROGRAM.PNO FROM PROGRAM , DESCRIPTOR , PROGDESC WHERE (PROGRAM.PNO = PROGDESC.PNO AND PROGDESC.DNO = DESCRIPTOR.DNO) AND (DESCRIPTOR.DSC = 'CHEMISTRY'))) In t h i s case the embedded query w i l l e x t r a c t a l l of the program keys (PROGRAM.PNO) f o r programs on c h e m i s t r y . Each program s e l e c t e d i n the o u t e r query w i l l have t o be c o n t a i n e d i n t h i s s e t . 7.3 Response G e n e r a t i o n The Response G e n e r a t o r r e c e i v e s t h e d a t a e x t r a c t e d by the DBMS and forms a re s p o n s e . Most o f t h e responses a r e d i s p l a y e d i n r e p o r t f o r m a t . In t h i s case t h e system s i m p l y d i s p l a y s the da t a i n the format produced by t h e DBMS. Examples of t h i s format are shown i n Appendices H and I . The o n l y t r u e response g e n e r a t i o n performed by the Response Generator i s f o r YES/NO q u e r i e s . I n t h i s case the system c h e c k s t o see i f any d a t a has been e x t r a c t e d by the DBMS and r e p l i e s a c c o r d i n g l y . I f the que r y : Does LOGO run on the Ap p l e I I ? does not produce any r e s u l t s , the f o l l o w i n g response i s p r i n t e d : 124 I don't know. There a r e no r e c o r d s i n the database which s a t i s f y the g i v e n c r i t e r i a . O t h e r w i s e a response l i k e the f o l l o w i n g i s p r i n t e d : YES. There i s 1 r e c o r d i n the database which s a t i f i e s t h e g i v e n c r i t e r i a . Would you l i k e t o see i t ? (Y/N) Note t h a t the user i s g i v e n the o p t i o n of s e e i n g the e x t r a c t e d i n f o r m a t i o n . Most of the e f f o r t i n f o r m i n g responses has been put i n t o e r r o r messages and messages f o r q u e r i e s which cannot be answered. These responses have been d e a l t w i t h by the s y n t a c t i c and semantic p r o c e s s o r s . 125 V I I I . PORTABILITY ISSUES One of the main f o c u s e s of the d e s i g n of the system i s p o r t a b i l i t y . In t h i s c h a p t e r we w i l l l o o k a t both domain p o r t a b i l i t y and database p o r t a b i l i t y . 8.1 Domain P o r t a b i l i t y In o r d e r t o implement a new domain, a Database Schema must f i r s t be d e v e l o p e d . T h i s i s then c o m p i l e d t o g e t h e r w i t h the i n v e r t e d i n d e x , s e l e c t e d i n f o r m a t i o n from the Database Schema L i b r a r y and the G l o b a l D i c t i o n a r y , and the domain independent i n f o r m a t i o n t o form an A c t i v e Domain D i c t i o n a r y f o r the g i v e n domain. The e a s i e s t way t o d e v e l o p a new domain i s t o f i r s t examine a number of p o s s i b l e q u e r i e s which may a p p l y t o the g i v e n d a t a b a s e . A f t e r the DBA has an i d e a of the v o c a b u l a r y which w i l l have t o be d e f i n e d , the Database Schema i s d e v e l o p e d . As p r e v i o u s l y mentioned, the Database Schema c o n s i s t s of the database d e f i n i t i o n , v e r b frames, and the d e f a u l t j o i n s . The p r o c e s s of d e v e l o p i n g the database d e f i n i t i o n i s v e r y s i m i l a r t o d e v e l o p i n g the database i t s e l f . T h i s p r o c e s s r e q u i r e s the DBA t o a n a l y z e each r e l a t i o n and f i e l d , and t o s u p p l y v a r i o u s t y p e s of semantic i n f o r m a t i o n t o the a t t r i b u t e s a s s o c i a t e d w i t h them (L-NOUN, CONTENTS, e t c ) . The p r o c e s s of d e f i n i n g new v e r b frames f o r domain dependent v e r b s i s the most d i f f i c u l t t a s k s i n c e i t r e q u i r e s some e l e m e n t a r y knowledge of l i n g u i s t i c s . To c o n s t r u c t a new 126 ve r b frame, the DBA must a n a l y z e numerous q u e r i e s f o r each verb and d e t e r m i n e what s y n t a c t i c p h r a s e s ( s u b j e c t , o b j e c t , e t c . ) the ve r b r e q u i r e s , and what each of these s y n t a c t i c p hrases r e p r e s e n t s w i t h r e s p e c t t o the d a t a b a s e . For v e r b s which a c c e s s more than one r e l a t i o n , i n f o r m a t i o n on how the r e l a t i o n s are j o i n e d t o g e t h e r must a l s o be s p e c i f i e d (JOININFO and JOINTYPE). The DBA must a l s o s p e c i f y the d e f a u l t j o i n s f o r ev e r y two r e l a t i o n s t h a t can be j o i n e d t o g e t h e r . T h i s s h o u l d be t r i v i a l f o r DBA's who a r e f a m i l i a r w i t h r e l a t i o n a l database systems. They must a l s o d e f i n e the semantic c l o s e n e s s f a c t o r , which i s s i m p l e i f the DBA i s f a m i l i a r w i t h the domain of the d a t a b a s e . In some c a s e s the DBA may want t o add new d e f i n i t i o n s t o the Database Schema L i b r a r y . T h i s i s u s e f u l i f the d e f i n i t i o n s a r e t o be used i n o t h e r d a t a b a s e s i n the o r g a n i z a t i o n . I t i s a l s o p o s s i b l e t h a t the DBA may want t o modify a d e f i n i t i o n . For example, he/she may want t o change the d e f i n i t i o n of r a t i n g from a 5 p o i n t s c a l e t o a 3 p o i n t s c a l e . The DBA may a l s o want t o add new words t o the G l o b a l D i c t i o n a r y . O b v i o u s l y the G l o b a l D i c t i o n a r y w i l l not c o n t a i n e v e r y p o s s i b l e word which a user may need. 8.2 The SUPPLIER/PART/JOB Database In o r d e r t o t e s t out the f e a s i b i l i t y of implementing a new domain, a s u p p l i e r / p a r t / j o b database was implemented (Date 1977). T h i s database c o n s i s t s of the f o l l o w i n g f o u r r e l a t i o n s : 1 27 SUPPLIER [SNO SNAME STATUS CITY] PART [PNO PNAME COLOUR WEIGHT] JOB [JNO JNAME CITY] SPJ [SNO PNO JNO QTY] T h i s database was s i m i l a r enough t o the Program database so t h a t many of the same d e f i n i t i o n s c o u l d be used ( P h y s i c a l O b j e c t , G e n e r a l Name, e t c . ) . T h i s was u s e f u l s i n c e i t demonstrates t h a t the semantic i n f o r m a t i o n a s s o c i a t e d w i t h t h e s e d e f i n i t i o n s i s p o r t a b l e . S i n c e the Database Schema L i b r a r y o n l y c o n t a i n e d d e f i n i t i o n s which were used i n the Program d a t a b a s e , the f o l l o w i n g d e f i n i t i o n s were added: Verb frames f o r the v e r b s " l o c a t e " and "weigh" were a l s o added t o the Database Schema L i b r a r y . Each of t h e s e a r e p o p u l a r d e f i n i t i o n s and would be i n c l u d e d i n a c o m m e r c i a l l y o r i e n t e d system. Once the new d e f i n i t i o n s were added t o the Database Schema L i b r a r y , the c o n s t r u c t i o n of the Database Schema was f a i r l y e a sy. Most of the words which a r e d e f i n e d i n i t were e x t r a c t e d from the sample q u e r i e s c o n t a i n e d i n Date (Date 1977). Some sample q u e r i e s which the system h a n d l e s a r e l i s t e d i n f i g u r e 8.1. The a c t u a l Database Schema f o r t h i s database i s c o n t a i n e d i n Appendix I , w h i l e a sample s e s s i o n i s i n Appendix J . L-TYPE TYPE TYPE TYPE TYPE Q-MASS (WEIGHT) Q-CNT (QUANTITY) COLOUR LOCATION CITY 128 P r i n t the s t a t u s of the s u p p l i e r s i n P a r i s . What s u p p l i e r s s u p p l y j o b J1 w i t h r e d n u t s i n P a r i s ? Who s u p p l i e s P3s t o j o b s i n Athens? Does the c o l l a t o r p r o j e c t use b o l t s from Smith? Where i s Smith? How much does p a r t P4 weigh? F i g u r e 8.1 - Sample Q u e r i e s t o S u p p l i e r / P a r t / J o b database A l t h o u g h the e n t i r e database took a p p r o x i m a t e l y 3 hours t o implement and t e s t , the p o r t i o n which would n o r m a l l y be done by a DBA i n a c o m m e r c i a l i z e d system took 1.5 h o u r s . C o n s i d e r i n g the c o m p l e x i t y of the q u e r i e s the system can answer, t h i s time i s m i n i m a l . T h i s time c o u l d be f u r t h e r reduced by h a v i n g a s p e c i a l i z e d e d i t o r t o c r e a t e the Database Schema. T h i s would a v o i d the problem of matching p a r e n t h e s e s i n the c u r r e n t Database Schema d a t a s t r u c t u r e . The amount of time spent on each t a s k d u r i n g the i m p l e m e n t a t i o n of t h i s d atabase i s l i s t e d i n f i g u r e 8.2. A c t i v i t y Time (min.) Implement Database i n O r a c l e 20 Load t e s t d a t a i n t o database 20 Add e x t e n s i o n s the Database Schema L i b r a r y 30 D esign and Implement Database Schema 60* Add l i n g u i s t i c i n f o r m a t i o n t o G l o b a l D i c t i o n a r y 20 Test q u e r i e s and make a few c o r r e c t i o n s 30* * - R e q u i r e d a c t i v i t i e s f o r c o m m e r c i a l i z e d system. F i g u r e 8.2 - Implementation Time f o r S u p p l i e r / P a r t / J o b Database The o n l y i m p l e m e n t a t i o n problem which, was e n c o u n t e r e d d u r i n g the of t h i s d atabase was t h e s e l e c t i o n of d u p l i c a t e 129 i t e m s . Q u e r i e s t o the s u p p l i e r / p a r t / j o b database tended t o produce many d u p l i c a t e database v a l u e s , which seemed t o suggest t h a t the DISTINCT a t t r i b u t e s h o u l d a c t as a d e f a u l t . S i n c e the r o o t of t h i s problem l i e s i n the d e s i g n and s e m a n t i c s of the g i v e n d a t a b a s e , t h i s problem was not d e a l t w i t h . The i m p l e m e n t a t i o n of t h i s domain has r e i n f o r c e d the i n i t i a l e x p e c t a t i o n s of the ease w i t h which a new domain can be implemented. I t a l s o shows t h a t the DBA does not need t o have any knowledge of A l programming or have a s t r o n g l i n g u i s t i c background. The DBA i s r e q u i r e d t o have an u n d e r s t a n d i n g of r e l a t i o n a l d atabase d e s i g n and a l i m i t e d knowledge of the s t r u c t u r e of E n g l i s h . 8.3 Database P o r t a b i l i t y Database P o r t a b i l i t y i n v o l v e s a d a p t i n g the system t o a new r e l a t i o n a l DBMS. In o r d e r t o a c h i e v e t h i s , the Database I n t e r f a c e r o u t i n e must be m o d i f i e d . Some changes t o the Response G e n e r a t o r may be r e q u i r e d ^ depending on the r e p o r t g e n e r a t i o n c a p a b i l i t i e s of the g i v e n DBMS. No changes i n e i t h e r the s y n t a c t i c or semantic p r o c e s s o r s a r e r e q u i r e d . Database p o r t a b i l i t y i s not as b i g an i s s u e as domain p o r t a b i l i t y s i n c e the system d e s i g n e r , and not the DBA, i s ex p e c t e d t o do t h i s t a s k . Due t o the l a c k of another r e l a t i o n a l DBMS system on MTS, t h i s c a p a b i l i t y has not been t e s t e d o u t . However, the concept of d e s i g n i n g the S t a n d a r d Query R e p r e s e n t a t i o n (SQR), which c o n t a i n s a l l of the needed i n f o r m a t i o n t o c o n s t r u c t a query i n a v a r i e t y of query 130 l a n g u a g e s , w i l l i n t h e o r y a c h i e v e a h i g h degree of d a t a b a s e p o r t a b i l i t y . 131 IX. CONCLUSION The main achievement of t h i s system i s i t s h i g h degree of domain p o r t a b i l i t y . Domain p o r t a b i l i t y has been approached from the view of h a v i n g a DBA implement a new domain, i n s t e a d of the more t r a d i t i o n a l approach where an A l system d e s i g n e r would implement i t . A new domain can be implemented u s i n g the s k i l l s of a t y p i c a l DBA i n a r e a s o n a b l e amount of t i m e . C o n s i d e r a b l e e f f o r t has a l s o gone i n t o the d e s i g n of a f r i e n d l y user i n t e r f a c e . I f a problem cannot be a d e q u a t e l y r e s o l v e d by the system, i t i s b e t t e r t o p r e s e n t the u s e r w i t h an u n d e r s t a n d a b l e prompt than t o make an i m p r e c i s e a s s u m p t i o n . The system h a n d l e s some t y p e s of s y n t a c t i c and semantic a m b i g u i t y , as w e l l as c e r t a i n unknown words and p h r a s e s i s t h i s manner. The system can answer a wide range of q u e r i e s w h i c h are commonly used i n database q u e r i e s . However, many a r e a s of n a t u r a l language d i s c o u r s e have been untouched or u n r e s o l v e d by t h i s system. By no means a r e the a l g o r i t h m s and d a t a s t r u c t u r e s s u i t a b l e ( a p p l i c a b l e ) f o r g e n e r a l d i s c o u r s e . However t h e y a r e s u i t a b l e f o r d e v e l o p i n g complex n a t u r a l language i n t e r f a c e s t o l a r g e d a t a b a s e s . 9.1 F u t u r e Work There a r e numerous a r e a s which would r e q u i r e f u r t h e r work b e f o r e t h i s system c o u l d be used i n a commercial s e t t i n g . These t o p i c s are b r i e f l y p r e s e n t e d i n the. f o l l o w i n g s e c t i o n s . 1 32 9.1.1 F u r t h e r E x p l o r e Domain P o r t a b i l i t y In o r d e r t o f u l l y t e s t out the concept of domain p o r t a b i l i t y , many more domains would have t o be implemented. T h i s would r e q u i r e t h a t both the Database Schema L i b r a r y and the G l o b a l D i c t i o n a r y be expanded t o i n c o r p o r a t e t h e s e domains. I t would a l s o be n e c e s s a r y t o e v a l u a t e the ease i n which a DBA w i t h o u t NL e x p e r i e n c e c o u l d implement a new domain. O b v i o u s l y p r o p e r documentation would have t o be devel o p e d i n or d e r t o t e s t t h i s o u t . 9.1.2 A d a p t a t i o n To Other DBMS In o r d e r t o t e s t out the soundness of the Database I n t e r f a c e c oncept f o r database p o r t a b i l i t y , the system c o u l d be adapted t o o t h e r r e l a t i o n a l DBMS u s i n g d i f f e r e n t query l a n g u a g e s . I t would a l s o be i n t e r e s t i n g t o l o o k i n t o t he p o s s i b i l i t y of a d a p t i n g the system t o work w i t h the network and h i e r a r c h i c a l d atabase models. 9.1.3 E x t e n s i o n s To The ATN The ATN c o u l d be extended t o answer a wider range of q u e s t i o n s . F o r example, i t does not p r e s e n t l y p a r s e q u e r i e s c o n t a i n i n g r e l a t i v e c l a u s e s of the f o l l o w i n g form: What j o b s u s i n g r e d p a r t s a r e l o c a t e d i n London? I t c o u l d a l s o be m o d i f i e d t o de t e r m i n e a l l p o s s i b l e p a r s e s . P r e s e n t l y i t w i l l o n l y produce the f i r s t v a l i d p a r s e found. 1 33 9.1.4 Pronoun Reference Pronouns a re v e r y u s e f u l i n NL systems s i n c e t hey can g r e a t l y reduce the number of words needed t o s t a t e a query. I t would be i n t e r e s t i n g t o e x p l o r e the use of ve r b frames and semantic d e f i n i t i o n s of pronouns i n o r d e r t o determine the c o r r e c t pronoun r e f e r e n t . For example, i n the f o l l o w i n g q u e r i e s : What s u p p l i e r s a re l o c a t e d i n London? Do any of them s u p p l y b o l t s ? the pronoun "them" w i l l r e f e r t o the s u p p l i e r s l o c a t e d i n London. The system c o u l d d etermine t h i s by n o t i n g t h a t the s u b j e c t of the ve r b " s u p p l y " i s d e f i n e d as a s u p p l i e r , both "them" and " s u p p l i e r s " can be d e f i n e d as groups of i n d i v i d u a l s (CATEG GROUP), and the q u e s t i o n element t o t h e p r e v i o u s query was a group of s u p p l i e r s . 9.1.5 E l l i p s e s The a b i l i t y t o p r o c e s s e l l i p s e s ( s entence fragments) i s i m p o r t a n t f o r any NL system which i s t o be used i n a commercial s e t t i n g . T h i s a l l o w s u s e r s t o e n t e r s i m i l a r q u e r i e s w i t h o u t h a v i n g t o r e t y p e the e n t i r e q u e r y . For example, a f t e r the query: What spreadsheet programs run on the Apple I I ? 1 34 the u s e r may t y p e : on the IBM PC? In t h i s case the system s h o u l d s u b s t i t u t e "IBM PC" f o r "APPLE I I " t o answer the query. T h i s f e a t u r e has been s u c c e s s f u l l y implemented i n PLANES ( W a l t z 1978). I t appears t h a t the v e r b frame concept would a l l o w t h i s f e a t u r e t o be added t o the system f a i r l y e a s i l y . 9.1.6 Quant i f i e r s The system does not c u r r e n t l y p r o c e s s any c o m p l i c a t e d type of q u a n t i f i e r s , i n c l u d i n g n e g a t i o n . T h i s i s an i m p o r t a n t a r e a f o r NL database systems s i n c e database q u e r i e s o f t e n c o n t a i n complex q u a n t i f i c a t i o n . T h i s t o p i c has p r e v i o u s l y been r e s e a r c h e d i n numerous NL systems (Woods 1968; W a l t z 1978; Booth 1983). 9.1.7 Complex C o n j u n c t i o n C o u n j u n c t i o n i s an e x t r e m e l y complex a r e a i n l i n g u i s t i c s w i t h many u n r e s o l v e d problems. The c u r r e n t system has s u c c e s s f u l l y d e a l t w i t h a l i m i t e d form of c o n j u n c t i o n by u s i n g the s t r u c t u r e of the d a t a b a s e t o r e s o l v e a m b i g u i t y . I t would be i n t e r e s t i n g t o e x t e n d t h e approaches used t o more complex c o n j u n c t i o n . 1 35 9.1.8 Response G e n e r a t i o n The p r e s e n t system c u r r e n t l y o n l y forms responses ( o t h e r than r e p o r t s ) f o r "yes/no" q u e s t i o n s and f o r q u e r i e s which cannot be answered due t o problems such as a m b i g u i t i e s or unknown database e l e m e n t s . I t would be u s e f u l t o e x p l o r e how q u e r i e s can be answered i n complete s e n t e n c e s . In some c a s e s , the g e n e r a t i o n of a complete response can a v o i d a m i s i n t e r p r e t a t i o n . For example, i n the f o l l o w i n g q u e r y : Does t h e Apple run LOGO? the user may not be aware t h a t t h e r e a r e more than one "Apple" (or LOGO) i n the system. Hence an answer l i k e : Yes. The Apple I I runs T e r r a p i n LOGO. would p r e v e n t any p o s s i b l e m i s i n t e r p r e t a t i o n which may r e s u l t from an i n c o m p l e t e r e s p o n s e . Another a r e a i n response g e n e r a t i o n which i s not d e a l t w i t h i n t h i s t h e s i s i s s u p p l y i n g the user w i t h a d d i t i o n a l r e l e v a n t i n f o r m a t i o n when the answer of a q u e s t i o n may not be s a t i s f a c t o r y . F o r example, the answer "NO" t o the f o l l o w i n g query w i l l not i n f o r m the user i f Joe Smith has a c t u a l l y taken Math 100 (assume Joe Smith i s a s t u d e n t and Math 100 i s a c o u r s e ) . D i d Joe Smith pass Math 100? No. Joe Smith d i d not ta k e Math 100. No. Joe Smith f a i l e d Math 100. 1 36 9.2 Summary A p o r t a b l e n a t u r a l language f r o n t - e n d t o a r e l a t i o n a l d a tabase has been d e s c r i b e d i n t h i s t h e s i s . I t c o n s i s t s of 4 components — the s y n t a c t i c p r o c e s s o r , the semantic p r o c e s s o r , the database i n t e r f a c e , and the response g e n e r a t o r . A l l of t h e s e components a r e domain independent. Domain p o r t a b i l i t y i s a c h i e v e d by s e p a r a t i n g the domain dependent from the domain independent i n f o r m a t i o n . The domain dependent i n f o r m a t i o n i s c o n t a i n e d i n the domain d e f i n i t i o n . I t has been d e s i g n e d so t h a t a new domain d e f i n i t i o n can be b u i l t u s i n g the s k i l l s of a t y p i c a l DBA i n a r e a s o n a b l e amount of t i m e . A r i c h s u p p l y of s t a n d a r d d e f i n i t i o n s a r e a v a i l a b l e t o b o t h a i d i n the development p r o c e s s and t o h e l p f o r c e c o n s i s t e n c y amongst d a t a b a s e s . Database p o r t a b i l i t y i s a c h i e v e d by p l a c i n g a l l database dependent components i n t o the database i n t e r f a c e . The semantic p r o c e s s o r i s c o m p l e t e l y database independent. The system can be adapted t o a new DBMS by s i m p l y m o d i f y i n g t h i s r o u t i n e . 1 37 BIBLIOGRAPHY B a t e s , M. (1978), The Theory and P r a c t i c e of Augmented T r a n s i t i o n Network Grammars, N a t u r a l Language Based  Computer Systems, Leonard B o l e ( e d . ) , S p r i n g e r V e r l a g , pp. 191-260. B a r r , A. and Feigenbaum, E. A. (1981), The Handbook of A r t i f i c i a l I n t e l l i g e n c e , V o l . 1, W. Kaufmann, I n c . , Los A l t o s , CA. Bobrow, D. G. (1968), N a t u r a l Language Input f o r a Computer P r o b l e m - S o l v i n g System, ( i n M i n s k y , pp 135-215). Booth, D. A. (1983), D e s i g n i n g a P o r t a b l e N a t u r a l Language Database I n t e r f a c e , M.Sc. T h e s i s , Department of Computer S c i e n c e , U n i v e r s i t y of B r i t i s h Columbia. Brown, J . S., B u r t o n , R. R. and B e l l , A. G. (1974) SOPHIE: A S o p h i s t i c a t e d I n s t r u c t i o n a l Environment f o r Teaching E l e c t r o n i c T r o u b l e s h o o t i n g , B o l t Beranek and Newman, Report No. 2790, Cambridge, Mass. C a r l s o n , R. C. and K a p l a n , R. K. (1976), A G e n e r a l i z e d Access P a t h Model and i t s A p p l i c a t i o n t o a R e l a t i o n a l Data Base System, Sigmod P r o c e e d i n g s , Washington, D. C , pp.143-154. C h a m b e r l i n , D. D., e t . a l l . , ( 1 9 77), Sequel I I : A U n i f i e d Approach t o Data D e f i n i t i o n , M a n i p u l a t i o n , and C o n t r o l , IBM  J o u r n a l of Research and Development , 10 ( 6 ) , IBM, New York. C h a r n i a k , Eugene (1981), S i x T o p i c s i n Search of a P a r s e r : An Overview of A l Language R e s e a r c h , P r o c e e d i n g s of the Seventh I n t . Conf. on A r t i f i c i a l I n t e l l i g e n c e Vancouver, pp.1079-1087. Cerone, N., Had l e y , R., and S t r z a l k o w s k i , T. (1983), The Automated Acedemic A d v i s o r : An I n t r o d u c t i o n and I n i t i a l E x p e r i e n c e s , T e c h n i c a l Report TR 83-11, L a b r a t o r y f o r Computer and Comunication R e s e a r c h , Simon F r a s e r U n i v e r i t y Chomsky, Noam (1965), A s p e c t s of the Theory of Syntax, MIT P r e s s , Cambridge, Mass. D a v i s , D. B. (1984), E n g l i s h : The Newest Computer Language, H i g h T e c h n o l o g y , F e b r u a r y , pp. 59-64. Furukawa, K o i c h i (1977), A D e d u c t i v e Q u e s t i o n Anwering System on R e l a t i o n a l Data Bases, P r o c e e d i n g s of the F i f t h I n t . J t . Conf. on A r t i f i c i a l I n t e l l i g e n c e , MIT., pp.59-66. G r i f f i t h , R obert L. (1982), Three P r i n c i p l e s of R e p r e s e n t a t i o n 1 38 f o r Semantic Networks, ACM T r a n s a c t i o n s on Database  Systems, 7 ( 3 ) , pp.417-442. H a r r i s , L a r r y R. (1977a), ROBOT, A H i g h Performance Language I n t e r f a c e f o r Database Query, T e c h n i c a l R e p o r t TR77-1, Mathematics Department, N.H. H a r r i s , L a r r y R. (1977b), N a t u a l Language Database Query: U s i n g the Database I t s e l f as the D e f i n i t i o n of World Knowledge and as an E x t e n s i o n of the D i c t i o n a r y , T e c h n i c a l Report TR77-2, Mathematics Department, N.H. H a r r i s , L a r r y R. (197 7 c ) , User O r i e n t e d Data Base Query w i t h the ROBOT N a t u r a l Language Query System, P r o c . 3 r d . I n t .  Conf. on Very Large Data Bases, Tokyo, Japan. H a r r i s , L a r r y R. (1984), E x p e r i e n c e w i t h INTELLECT: A r t i f i c i a l I n t e l l i g e n c e Technology T r a n s f e r , The A l Magazine, 5 ( 2 ) , pp.105-147. H e n d r i x , G. G. (1977), Human E n g i n e r r i n g f o r A p p l i e d N a t u r a l Language P r o c e s s i n g , P r o c e e d i n g s of the F i f t h I n t . J t . Conf. on A r t i f i c i a l I n t e l l i g e n c e , MIT., pp.183-191. H e n d r i x , G. G., S a c e r d o t i , E. D., S a g a l o w i c z , D., and Slocum, J . (1978), D e v e l o p i n g a N a t u r a l Language I n t e r f a c e t o Complex D a t a , ACM T r a n s a c t i o n s on Database Systems, 3 ( 2 ) , pp.105-147. J a c k e n f o f f , Ray S. (1972), Semantic I n t e r p r e t a t i o n i n  G e n e r a t i v e Grammar, MIT P r e s s , Cambridge, Mass. Johnson, Jan (1981), I n t e l l e c t on Demand, D a t a m a t i o n , November, pp. 73-78. Johnson, Jan (1984), Easy Does I t : N a t u r a l Language Query Systems have g a i n e d new a t t e n t i o n i n r e c e n t months, Datam a t i o n , June 15, pp. 48-60. K a p l a n , S. J . (1979), C o o p e r a t i v e Responses from a P o r t a b l e N a t u r a l Language Database Query System, Ph.D. D i s s e r t a t i o n , U n i v e r s i t y of P e n n s y l v a n i a K a p l a n , S. J . (1984), D e s i g n i n g a P o r t a b l e N a t u r a l Language Database Query System, ACM T r a n s a c t i o n s on Database  Systems, 9 ( 1 ) , pp.1 -19. McLeod, D. (1972), The Semantic Database Model, Ph.D. T h e s i s , MIT P r e s s , Cambridge, Mass. M i n s k y , M. (1968), Semantic I n f o r m a t i o n P r o c e s s i n g , MIT P r e s s , Cambridge, Mass. R a p h a e l , B. (1968), SIR: A Computer Program f o r Semantic 1 39 I n f o r m a t i o n R e t r i e v a l , ( i n M i n s k y , pp 33-134). Rosenberg, R i c h a r d S. (1980), A p p r o a c h i n g D i s c o u r c e C o m p u t a t i o n a l l y : A Review, R e p r e s e n t i n g and P r o c e s s i n g of  N a t u r a l Language, C a r l Hanser V e r l a g , pp. 10-83 R e i t e r , Ray (1978), The Woods Augmented T r a n s i t i o n Network P a r s e r , T e c h i n c a l Note 78-3, Department of Computer S c i e n c e , U n i v e r i t y of B r i t i s h Columbia S a c e r d o t i , E. D. (1977), Language Access t o D i s t r i b u t e d Data w i t h E r r o r Recovery, P r o c e e d i n g s of the F i f t h I n t . J t . Conf. on A r t i f i c i a l I n t e l l i g e n c e , MIT., pp.196-202. S a g a l o w i c z , D a n i e l ( 1 977), IDA : An I n t e l l i g e n t Data Access Program, P r o c . 3 r d . I n t . Conf. on Very L a r g e Data  Bases, Tokyo, Japan. S t o c k w e l l , R. P. (1977), F o u n d a t i o n s of S y n t a c t i c Theory, P r e n t i c e - H a l l , Englewood C l i f f s , New J e r s y . S t r z a l k o w s k i (1983), T., ENGRA - Tey Another P a r s e r f o r E n g l i s h , T e c h n i c a l Report TR 83-10, L a b r a t o r y f o r Computer and Comu n i c a t i o n R e s e a r c h , Simon F r a s e r U n i v e r i t y . T a y l o r , Brock H. and Rosenberg, R i c h a r d S. (1975), A Case-D r i v e n P a r s e r f o r N a t u r a l Language, American J o u r n a l f o r C o m p u t a t i o n a l L i n g u i s t i c s , AJCL M i c r o f i c h e 31. W a l t z , D a v i d L. (1978), An E n g l i s h Language Q u e s t i o n Answering System f o r a Large R e l a t i o n a l Database, Communications of  the ACM, 2 1 ( 2 ) , pp. 526-539. Winograd, T e r r y (1983), Language as a C o g n i t i v e P r o c e s s , V o l .  1 , S y n t a x , Addison-Wesley. Woods, W. A. (1967), Semantics f o r a Q u e s t i o n Answering System, Ph.D. t h e s i s . , Report NSF-19, A i k e n C o m p u t a t i o n a l Lab., H a r v a r d U n i v e r s i t y , Cambridge, Mass. Woods, W. A. (1970), T r a n s i t i o n Network Grammars f o r N a t u r a l Language A n a l y s i s , Comm. ACM, 13(10), pp. 591-606. Woods, W. A., K a p l a n , R. M. and Nash-Webber, B. (1972) The Lunar S c i e n c e s N a t u r a l Language I n f o r m a t i o n System: F i n a l R e p o r t , B o l t Beranek and Newman, Report No. 2378, Cambridge, Mass. Woods, W. A. (1980), Cascaded ATN Grammars, American J o u r n a l  f o r C o m p u t a t i o n a l L i n g u i s t i c s , 6 ( 1 ) , pp. 1-12. 1 40 APPENDIX A - AUGMENTED TRANSITION NETWORK DIAGRAMS T h i s s e c t i o n c o n t a i n s diagrams of the ATN's used by the S y n t a c t i c P r o c e s s o r . The NOUN PHRASE Network PUSH PP (Subj Only) PUSH S/REL 141 The SENTENCE Network 142 The PREPOSITIONAL PHRASE Network.. The NAME ( P r o p e r Noun) Network POP (Good NPR?) The NOISE WORD Network 1 43 APPENDIX B - SAMPLE PARSE The f o l l o w i n g i s an example of a deep s t r u c t u r e produced by the ATN. Note t h a t the SEMREG and VFRAME r e g i s t e r s have been s e t by the Semantic P r o c e s s o r . The Sequel query produced by t h i s query i s l i s t e d a f t e r the deep s t r u c t u r e . *WHAT SHACK COMPUTERS RUN GOOD OR EXCELLENT PROGRMS ON CHEMISTRY? > (Assume (RADIO SHACK) f o r (SHACK) ?) *OK > (*** S p e l l i n g M i s t a k e *** PROGRMS — > PROGRAM) > > S > MOOD INT ; I n t e r o g a t i v e > QE SUBJ ; Q u e s t i o n Element > SUBJ > NP > DET WHAT > DESC > NPR > NPR SHACK > ANPR ; c o n t a i n s f u l l v e r s i o n of > RADIO SHACK ;the proper noun > HEAD > N COMPUTER > QUAL NIL > QUES YES > NUM PL > PER 3 > CASE SUBJ > POSN 3 > ANP > WHAT SHACK COMPUTERS > SEMREG ; s e t by semantic p r o c e s s o r > RELATION COMPUTER > AUX NIL > V RUN > VFRAME > SUBJ ; s e t by semantic p r o c e s s o r > RELATION COMPUTER > DOBJ > RELATION PROGRAM > MOD NIL > PRED NIL 144 > DOBJ > NP > DET NIL > DESC > ADJ GOOD > CON OR > ADJ EXCELLENT > HEAD ; c o n t a i n s c o r r e c t s p e l l i n g > N PROGRAM > QUAL ; t h i s PP has been a t t a c h e d > PP ; t o the head noun "program" > PREP ON ;by the semantic p r o c e s s o r . > POBJ ; I n i t i a l l y t h i s phrase was > NP ; i n the PPOBJS r e g i s t e r . > DET NIL > DESC NIL > HEAD > N CHEMISTRY > QUAL NIL > QUES NO > NUM SG > PER 3 > CASE PPOBJ > POSN 10 > ANP > CHEMISTRY > SEMREG ; a l s o s e t by Sem. P r o c e s s o r > FIELD DESCRIPTOR DSC > PPPOSN NIL > QUES NO > NUM SG > PER 3 > CASE OBJ > POSN 8 > ANP > GOOD OR EXCELLENT PROGRAM > SEMREG > RELATION PROGRAM > IOBJ NIL > PPOBJS NIL > > > ; R e s u l t i n g Sequel Query > > SELECT COMPUTER.CNO , COMPUTER.MAKE , COMPUTER.MODEL > FROM COMPUTER , PROGRAM , DESCRIPTOR , PROGDESC > WHERE > (COMPUTER.CNO = PROGRAM.CNO) > AND ((PROGRAM.PNO = PROGDESC.PNO) AND > (PROGDESC.DNO = DESCRIPTOR.DNO)) > AND (COMPUTER.MAKE = 'RADIO SHACK') > AND (PROGRAM.RATING > 3 OR PROGRAM.RATING = 5) > AND (DESCRIPTOR.DSC = 'CHEMISTRY') 1 45 APPENDIX C - DATABASE SCHEMA FOR THE PROGRAM DATABASE The Database Schema i s d e f i n e d by the DBA f o r each domain implemented i n t h e system. I t c o n s i s t s of 3 components — the database d e f i n i t i o n , the ve r b d e f i n i t i o n , and the d e f a u l t j o i n i n f o r m a t i o n . Database D e f i n i t i o n (COMPUTER ((FIELDS (KEY (IDENTIFIERS (CATEG (L-TYPE (L-NOUN (L-PREP (L-DESC (COMMENT ) (CNO (TYPE (CONTENTS (L-TYPE (COMMENT ) (MAKE (TYPE (CONTENTS (CATEG (L-NOUN (COMMENT ) (MODEL (TYPE (CONTENTS (L-NOUN (COMMENT ) (RAM (TYPE (CONTENTS (L-NOUN (L-ABR (COMMENT ) ; ** Computer r e l a t i o n ** (CNO MAKE MODEL RAM)) (CNO)) (CNO MAKE MODEL)) (PHYSICAL-OBJECT)) (PRODUCT)) (COMPUTER MICROCOMPUTER MICRO )) (ON WITH)) (MICRO PERSONAL)) (a t y p e / b r a n d of a computer)) (INTEGER)) (INTEGER)) (RNUMBER)) (a unique i n d e n t i f i e r f o r each computer)) (CNAME)) (INVERT NPR ? ) ) (GROUP)) (MAKE TYPE BRAND)) (t h e make/manufacturer of the computer)) (CHAR 20)) (INVERT NPR ? ) ) (MODEL VERSION)) (t h e model of the computer)) (INTEGER)) (INTEGER)) (MEMORY KILOBYTE RAM)) ((KB KILOBYTE))) ( t h e s t a n d a r d amount of memory on the computer i n k i l o b y t e s ) ) 146 (PROGRAM ((FIELDS (KEY (IDENTIFIERS (CATEG (L-TYPE (L-NOUN (L-PREP (L-DESC (COMMENT ) (PNO (TYPE (CONTENTS (L-TYPE (L-DESC (COMMENT ) (PUBLISHER (TYPE (CONTENTS (CATEG (L-NOUN (COMMENT ) (NAME (TYPE (CONTENTS (COMMENT ) (VERSION (TYPE (CONTENTS (L-NOUN (L-DESC (COMMENT ) (CNO (TYPE (CONTENTS (COMMENT ) (RATING (TYPE (CONTENTS (COMMENT ) ; ** Program r e l a t i o n ** (PNO PUBLISHER NAME VERSION CNO RATING COST)) (PNO)) (PNO PUBLISHER NAME)) (PHYSICAL-OBJECT)) (PRODUCT)) (PROGRAM PACKAGE COURSEWARE SOFTWARE)) (WITH BY)) (COMPUTER EDUCATION (EDUCATIONAL ADJ * ) ) ) (a computer program)) (INTEGER)) (INTEGER)) (RNUMBER )) (LIBRARY REFERENCE)) (a unique i n d e n t i f i e r f o r each program)) (CNAME )) (INVERT NPR ? ) ) (GROUP )) (PUBLISHER PRODUCER)) (the name of the p u b l i s h i n g company of the program)) (GNAME )) (INVERT NPR ? ) ) (the name of the computer program)) (REAL)) (REAL)) (EDITION VERSION)) (EDITION VERSION)) (the v e r s i o n of the computer program, i f any)) (INTEGER)) (KEY '"COMPUTER.CNO")) (the computer number which the program runs on)) (RATING)) (INTEGER (1 - 5 ) ) ) (the r a t i n g of a program)) 147 ) (COST (TYPE (CONTENTS (L-TYPE (L-TYPE (COMMENT ) ** Program r e l a t i o n c o n t . ** (MONEY)) (MONEY)) (COST)) (RETAIL)) (the o r i g i n a l c o s t of the program)) (DESCRIPTOR ** D e s c r i p t o r r e l a t i o n ** ((FIELDS (KEY (IDENTIFIERS (CATEG (L-NOUN (L-PREP (COMMENT ) (DNO (TYPE (CONTENTS (COMMENT ) (DSC (TYPE (CONTENTS (L-NOUN (L-ABR (COMMENT ) ) (PROGDESC ((FIELDS (CATEG (COMMENT ) (PNO (CONTENTS (COMMENT ) (DNO (CONTENTS (COMMENT ) ) ) ) (DNO DSC)) (DNO)) (DNO DSC)) (ABSTRACT-OBJECT)) (DESCRIPTOR)) (ON FOR)) ( D e s c r i p t o r s which d e s c r i b e the program)) (INTEGER)) (INTEGER)) (a unique i n d e n t i f i e r f o r each d e s c r i p t o r ) ) (GNAME)) (INVERT N ? ) ) (DESCRIPTOR)) ((MATH MATHEMATICS) (ED EDUCATION))) (a s e t of words which d e s c r i b e computer programs)) . ** P r o g r a m - D e s c r i p t o r r e l a t i o n ( f o r j o i n ) ** (PNO DNO)) (JOIN)) ( J o i n s PROGRAM & DESCRIPTOR r e l a t i o n s ) ) ((KEY PROGRAM PNO))) (a unique i n d e n t i f i e r f o r each program)) ((KEY DESCRIPTOR DNO))) (a unique i n d e n t i f i e r f o r each d e s c r i p t o r ) ) 1 48 Verb D e f i n i t i o n (RUN ; Note the 3 v e r b frames f o r t h i s v e r b VFRAME (((S U B J (RELATION COMPUTER)) (DOBJ (RELATION PROGRAM))) ((SUBJ (RELATION PROGRAM)) (MOD (FIELD PROGRAM RATING))) ((SUBJ (RELATION PROGRAM)) (PP (*) (RELATION COMPUTER)))) JOININFO ("COMPUTER.CNO = PROGRAM.CNO") JOINTYPE ((COMPUTER 1)(PROGRAM MANY)) ) (EXECUTE SYNONYM RUN ) (TEACH VFRAME (((S U B J (RELATION PROGRAM)) (DOBJ (RELATION DESCRIPTOR)))) JOINREL (PROGDESC) JOININFO ("PROGRAM.PNO = PROGDESC.PNO AND PROGDESC.DNO = DESCRIPTOR.DNO") JOINTYPE ((PROGRAM MANY)(DESCRIPTOR MANY)) ) (PRODUCE VFRAME (((S U B J (FIELD COMPUTER MAKE)) (DOBJ (RELATION COMPUTER))) ((SUBJ (FIELD PROGRAM PUBLISHER)) (DOBJ (RELATION PROGRAM))) ) JOININFO NIL JOINTYPE NIL (SELL VFRAME (( ( S U B J (DOBJ ((SUBJ (DOBJ JOININFO NIL JOINTYPE NIL (FIELD COMPUTER MAKE)) (RELATION COMPUTER))) (FIELD PROGRAM PUBLISHER)) (RELATION PROGRAM))) ) 1 49 D e f a u l t J o i n s and A s s o c i a t i o n F a c t o r s (DEFAULTJOIN 'COMPUTER 'PROGRAM ; R e l a t i o n Names '3 ; A s s o c i a t i o n F a c t o r '((COMPUTER 1) (PROGRAM MANY)) ;JOIN TYPE '"(COMPUTER.CNO = PROGRAM.CNO)" ;Jo i n NIL) ;Secondary R e l a t i o n ; ( i f any) (DEFAULTJOIN 'DESCRIPTOR 'PROGRAM ' 2 '((DESCRIPTOR MANY) (PROGRAM MANY)) '"((PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO))" '(PROGDESC)) (DEFAULTJOIN 'DESCRIPTOR 'COMPUTER 1 4 '((DESCRIPTOR MANY) (COMPUTER MANY)) '"((COMPUTER.CNO = PROGRAM.CNO) AND (PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO))" '(PROGRAM PROGDESC)) 1 50 APPENDIX D - PARTIAL DATABASE SCHEMA LIBRARY The Database Schema L i b r a r y c o n t a i n s l i n g u i s t i c and semantic i n f o r m a t i o n t h a t may be used i n v a r i o u s domains. For i n s t a n c e , i f a domain c o n t a i n s a r e l a t i o n which i s i n the c a t e g o r y of p h y s i c a l o b j e c t s , (CATEG PHYSICAL-OBJECT), then the needed i n f o r m a t i o n i s read i n t o the A c t i v e Domain D i c t i o n a r y ( t h e d i c t i o n a r y i n c o r e f o r a p a r t i c u l a r domain) ************ CATEG **************** '((CATEG PHYSICAL-OBJECT) (L-NOUN (OBJECT THING ANYTHING)) (L-DESC (PHYSICAL)) ) ((CATEG HUMAN) (L-NOUN (SOMEONE ANYONE PERSON HUMAN)) ) ************ TYPE **************** '( (TYPE CNAME) (L-NOUN (COMPANY ORGANIZATION)) ) ((TYPE COLOUR) (L-NOUN (COLOUR SHADE)) (MEM-OF ((BLUE) (GREEN) (RED) (YELLOW)) ) ) ((TYPE MONEY) (L-NOUN (VALUE DOLLAR MONEY)) ) ((TYPE Q-MASS) (L-NOUN (WEIGHT MASS)) (VERB WEIGH FEATURES (RELATION-VERB TRANS PASSIVE) VFRAME (((SU B J (GENERAL *ANY*)) (DOBJ (TYPE Q-MASS))))) 151 ((TYPE RATING) (L-NOUN (MEM-OF ) )) (RATING QUALITY)) ((EXCELLENT CODE-AS (* = 5)) (GOOD (AVERAGE (FAIR (POOR (BAD (TERRIBLE CODE-AS CODE-AS CODE-AS CODE-AS CODE-AS CODE-AS (* ">" 3)) (* = 3)) (* = 3)) (* (* (* " 3)) " 3)) 1 ) ) ************ L -TYPE **************** ((L-TYPE RNUMBER) (L-NOUN (L-DESC ) ((L-TYPE COST) (L-NOUN (VERB COST (MEM-OF )) (NUMBER KEY IDENTIFIER)) (REFERENCE INDEX)) (VALUE PRICE)) ;ADD COST LATER FEATURES (RELATION-VERB TRANS) VFRAME (((SUBJ (GENERAL *ANY*)) (DOBJ (L-TYPE COST))))) ((EXPENSIVE CODE-AS (* ">" 100)) (CHEAP CODE-AS (* "<" 50)) . ************ CROSS REFERENCE ********************* (((CATEG PHYSICAL-OBJECT) (L-TYPE COST)) (VERB COST VFRAME (((S U B J (CATEG PHYSICAL-OBJECT)) (DOBJ (L-TYPE COST)))) JOININFO NIL JOINTYPE NIL )) ) '(((CATEG HUMAN) (L-TYPE SALARY)) (VERB EARN VFRAME (((S U B J (CATEG HUMAN)) (DOBJ (L-TYPE SALARY)))) ) ) )) 1 52 APPENDIX E - GENERAL DICTIONARY The G e n e r a l D i c t i o n a r y c o n t a i n s words t h a t a r e common t o a l l database domains. A l l of the s e words w i l l be loa d e d i n t o the A c t i v e Domain D i c t i o n a r y . *** ADJECTIVES *** (UNIQUE ADJ * SEM ((UNIQUE))) (DISTINCT ADJ * SEM ((UNIQUE))) ; *** ABBREVIATIONS *** (A INT *) (B INT *) (LTD ABR LIMITED DICT NIL) *** ADVERBS *** (PLEASE ADV *) (QUICKLY ADV *) *** DETERMINERS *** (THE DET *) *** QUANTIFIERS *** (ONE QTF * PRO (ONE (OBJ) (SUBJ) (NUMBER SG)) N S) (ALL QTF *) . *** CONJUNCTIONS *** (AND CON *) (OR CON *) . *** VERBS *** (BE V (BE (UNTENSED)) FEATURES (REL-VERB COPULA TRANS INTRANS AUX) VFRAME (((SU B J (GENERAL *ANY*)) (PRED (GENERAL *ANY* ) (ARE V (BE (TNS PRESENT) (PNCODE X13SG))) (IS V (BE (TNS PRESENT) (PNCODE 3SG))) (BEEN V (BE (PASTTART))) 153 (DO V ES FEATURES (TRANS AUX)) (DID V (DO (TNS PAST))) (DOES V (DO (TNS PRESENT) (PNCODE 3SG))) (DOING V (DO (PRESPART))) (DONE V (DO (PASTPART))) (PRINT V S-ED FEATURES (COM-VERB TRANS INDOBJ PASSIVE) VFRAME (((SUBJ (GENERAL *YOU*)) (DOBJ (GENERAL *ANY*)) (IOBJ (GENERAL *ME*) OPT))) ) (LIST V S-ED FEATURES (TRANS INDOBJ PASSIVE) SYNONYM PRINT) *** NOUNS *** (INFORMATION N MASS SEM ((GENERAL *ANY*))) (DESCRIPTION N S) *** QUESTION NOUNS *** (WHO QWORD (WHO (NUMBER SG)) SEM ((CATEG GROUP) (CATEG HUMAN))) (WHERE QWORD * SEM ((TYPE LOCATION))) *** QUESTION DETERMINERS *** (WHAT QDET *) (WHO QDET *) *** PREPOSITIONS *** (AT PREP *) (ON PREP *) *** RELATIVE PRONOUNS *** (WHICH RELPRO *) (THAT RELPRO *) *** PRONOUNS *** DEFAULTS : PERSON 3 (YOU PRO (YOU (SUBJ) (OBJ) (NUMBER SG-PL) (PERSON 2 ) ) ) (ME PRO (I (OBJ) (NUMBER SG))) (I PRO (I (NUMBER SG) (PERSON 1 ) ) ) ; *** IGNORE WORDS *** ; FOR NOISE ATN (TELL GRB *) (INFORM GRB *) (WHETHER GRB *) 1 54 APPENDIX F - GLOBAL DICTIONARY The G l o b a l D i c t i o n a r y c o n t a i n s words which may be used i n any domain or database ( i . e . a complete d i c t i o n a r y ) . Only p a r t of t h i s i n f o r m a t i o n w i l l be i n main memory at one t i m e . However, the MORPH r o u t i n e w i l l s e a r c h through i t when an unknown word i s found. I n f o r m a t i o n from t h i s f i l e i s used d u r i n g the c o m p i l e stage by the f o l l o w i n g s o u r c e s of i n f o r m a t i o n : - Database Schema - Database Schema L i b r a r ; I n v e r t e d Index (BLUE ADJ *) (COLOUR N S) (COMPANY N IES) (COMPUTER N s) (COST N S V s - ED ) (EDUCATION FEATURES (REL-VERB N MASS) (EDUCATIONAL ADJ *) (GOOD ADJ * ADV *) (PROGRAM N S) (RUN V s - IRR ) (SELL ) (SOFTWARE (STEVE FEATURES (REL-VERB TRANS PASSIVE INTRANS) IRR ((RAN V (RUN (TNS PAST))) (RUN V (RUN (PASTPART))) (RUNS V (RUN (PNCODE 3SG)))) V S-IRR FEATURES (REL-VERB TRANS INDOBJ PASSIVE) IRR ((SOLD V (SELL (TNS PAST)))) IRR ((SELLS V (SELL (TNS PRESENT) (PNCODE 3SG)))) N MASS) NPR * SEM ((FNAME))) 1 55 (WEIGHT (WEIGH (RUN N V MASS) S-ED FEATURES (REL-VERB TRANS PASSIVE) ) V S-IRR FEATURES (REL-VERB TRANS PASSIVE INTRANS) IRR ((RAN V (RUN (TNS PAST))) (RUN V (RUN (PASTPART))) (RUNS V (RUN (PNCODE 3 S G ) ) ) ) ) 1 56 APPENDIX G - PARTIAL ACTIVE DOMAIN DICTIONARY FOR THE PROGRAM DATABASE L i s t e d below a re some sample d e f i n i t i o n s of words which a re used i n t h e Program d a t a b a s e . Note t h a t t h e s e words a r e c o m p i l e d from the G e n e r a l D i c t i o n a r y , the Database Schema, the Database Schema L i b r a r y , t he I n v e r t e d Index, and the G l o b a l D i e t i o n a r y . *A > DET * ;Determiner > INT * ; I n i t i a l > *AND > NPR ? ; p a r t of a compound NPR > PART-OF ((JOHN WILSON AND SONS LIMITED)) > CON * C o n j u n c t i o n > *CATEG ;Cross r e f e r e n c e s f o r C a t e g o r i e s > SEM ((PHYSICAL-OBJECT ((RELATION COMPUTER) > (RELATION PROGRAM))) > (GROUP ((FIELD COMPUTER MAKE) > (FIELD PROGRAM PUBLISHER))) > (ABSTRACT-OBJECT ((RELATION DESCRIPTOR))) > (JOIN ((RELATION PROGDESC)))) > •COMPUTER ;note the 3 semantic d e f i n i t i o n s > N S > L-DESC ((RELATION PROGRAM)) > NPR ? > PART-OF ((COLOR COMPUTER)) > L-NOUN ((RELATION COMPUTER)) > *DOES > V (DO (TNS PRESENT) (PNCODE 3SG)) > •EXPENSIVE ;Coded element of the f i e l d > ADJ * ; PROGRAM COST > MEM-OF ((L-TYPE COST)) > CODE-AS (* > 100) > *GOOD ;Coded element of the f i e l d > ADV * ; PROGRAM RATING > ADJ * > MEM-OF ((TYPE RATING)) > CODE-AS (* > 3) > 157 *LOGO > NPR > ELM-OF *ME > PRO > •NUMBER ;Database Element > > > > > *PLEASE > ADV > *PRINT L-NOUN FEATURE V N VFRAME FEATURES V > > > > > > *PROGRAM > N > L-NOUN > *RUN > FEATURES > V > JOINTYPE > JOININFO > VFRAME > > > > > > *WHAT > QDET > SEM > QWORD > *WILSON > NPR > PART-OF > *WITH > L-PREP > PREP ((FIELD PROGRAM NAME)) (I (OBJ) (NUMBER SG)) ;Cross r e f e r e n c e f o r Numbers ((L-TYPE RNUMBER) (TYPE REAL) (TYPE INTEGER)) (TRANS PASSIVE) S-ED S (((SUBJ (GENERAL *YOU*)) (DOBJ (GENERAL *ANY*)) (IOBJ (GENERAL *ME*) OPT))) (COM-VERB TRANS INDOBJ PASSIVE) S-ED ((RELATION PROGRAM)) (REL-VERB TRANS PASSIVE INTRANS) (RUN (PASTPART)) ((COMPUTER 1) (PROGRAM MANY)) (COMPUTER.CNO = PROGRAM.CNO) ;note 3 v e r b frames f o r t h i s v e r b (((SUBJ (RELATION COMPUTER)) (DOBJ (RELATION PROGRAM))) ((SUBJ (RELATION PROGRAM)) (MOD (FIELD PROGRAM RATING)) ((SUBJ (RELATION PROGRAM)) (PP (*) (RELATION COMPUTER))) ;see CATEG f o r p o s s i b l e * ; i n t e r p r e t a t i o n s i n t h i s domain ((CATEG PHYSICAL-OBJECT)) (WHAT (NUMBER SG)) ;pa r t of a compound p r o p e r noun '((JOHN WILSON AND SONS LIMITED)) ;commonly used w i t h the f o l l o w i n g : ((RELATION PROGRAM) (RELATION COMPUTER)) 158 APPENDIX H - SAMPLE PROGRAM DATABASE SESSION T h i s i s a sample s e s s i o n of the system u s i n g the Program d a t a b a s e . The q u e r i e s demonstrate some of the f e a t u r e s of the system. L i n e s which s t a r t w i t h an a s t e r i s k (*) r e p r e s e n t the u s e r ' s i n p u t , w h i l e l i n e s which s t a r t w i t h a l e s s - t h a n s i g n (>) r e p r e s e n t output g e n e r a t e d by the system. Comments added i n the s e s s i o n a r e i d e n t i f i e d by a s e m i - c o l o n (;). The d a t a i n the Program database has been made up i n o r d e r t o demonstrate the a b i l i t y of the system and i s by no means a c c u r a t e . For some q u e r i e s , the d a t a r e t r i e v e d by the system has been d e l e t e d t o save space. > > Which database would you l i k e t o use? > 1. The PROGRAM database > 2. The SUPPLIER/PART/JOB database > E n t e r a number between 1 and 2 and p r e s s R e t u r n *1 > Welcome t o the PROGRAM da t a b a s e . > To s t o p type 'STOP!' or 'QUIT!'. > > P l e a s e e n t e r your q u e s t i o n . *<Q1> PRINT THE COLLOID CHEMISTRY PROGRAMS! > > Unknown word - COLLOID > What would you l i k e t o do? > > 1. S p e l l i n g E r r o r ! E n t e r the c o r r e c t s p e l l i n g . > 2. A b b r e v i a t i o n ! E n t e r the f u l l word. > 3. Synonym! E n t e r the replacement word. > 4. Unknown d a t a element! Continue p r o c e s s i n g . > 5. Ignore the word! Continue p r o c e s s i n g . > 6. C a n c e l t h i s q u e r y . > > E n t e r a number between 1 and 6 and p r e s s R e t u r n *5 > SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME > FROM PROGRAM , DESCRIPTOR , PROGDESC > WHERE > ((PROGRAM.PNO = PROGDESC.PNO) AND > (PROGDESC.DNO = DESCRIPTOR.DNO)) > AND (DESCRIPTOR.DSC = 'CHEMISTRY') > 159 > PNO PUBLISHER NAME > > 11 JOHN BROWNS SOFTWARE LIMITED THE ATOM > 13 EDUCOACH ATOMIC PARTICLES > > > P l e a s e e n t e r your q u e s t i o n . *<Q2> DOES LOTUS RUN ON THE IBM PC? > > Unknown word - LOTUS > What would you l i k e t o do? > > 1. S p e l l i n g E r r o r ! E n t e r the c o r r e c t s p e l l i n g . > 2. A b b r e v i a t i o n ! E n t e r the f u l l word. > 3. Synonym! E n t e r the replacement word. > 4. Unknown da t a element! C o n t i n u e p r o c e s s i n g . > 5. Ignore the word! Continue p r o c e s s i n g . > 6. C a n c e l t h i s q uery. > > E n t e r a number between 1 and 6 and p r e s s R e t u r n *4 > Which c a t e g o r y does LOTUS belong t o ? > > 1. a Common Noun ( c a r , computer, pen) > 2. a P r o p e r Noun (J o e , S m i t h , Canada) > 3. an A d j e c t i v e ( p u r p l e , good, b i g ) > > E n t e r a number between 1 and 3 and p r e s s R e t u r n *2 > NOTE - (LOTUS) i s not (a computer program) > c o n t a i n e d i n t h i s d a t a b a s e ! > The above query cannot be answered. > > > P l e a s e e n t e r your q u e s t i o n . *<Q3> WHAT COMPUTERS RUN LOTUS? > > NOTE - (LOTUS) i s not (a computer program) > c o n t a i n e d i n t h i s d a t a b a s e ! > The above query cannot be answered. > > ; Note t h a t LOTUS i s now d e f i n e d i n the system. > > > P l e a s e e n t e r your q u e s t i o n . *<Q4> WHAT CHEMISTRY PROGRAMS ARE RUN BY THE APPLE I I ? > SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME > FROM COMPUTER , PROGRAM , DESCRIPTOR , PROGDESC > WHERE > (COMPUTER.CNO = PROGRAM.CNO) > AND ((PROGRAM.PNO = PROGDESC.PNO) AND > (PROGDESC.DNO = DESCRIPTOR.DNO)) > AND (COMPUTER.MODEL = ' I I ' ) > AND (COMPUTER.MAKE = 'APPLE') 160 > > AND (DESCRIPTOR.DSC = 'CHEMISTRY') PNO PUBLISHER NAME 13 EDUCOACH PROGRAMS WHICH TEACH CHEMISTRY? SELECT COMPUTER.CNO , COMPUTER.MAKE FROM COMPUTER , PROGRAM WHERE (COMPUTER.CNO = PROGRAM.CNO) (COMPUTER.MAKE = 'APPLE') (PROGRAM.PNO IN AND AND (SELECT FROM WHERE AND CNO MAKE PROGRAM.PNO PROGRAM , DESCRIPTOR , PROGDESC (PROGRAM.PNO = PROGDESC.PNO AND PROGDESC.DNO = DESCRIPTOR.DNO) (DESCRIPTOR.DSC = 'CHEMISTRY'))) MODEL > > > > P l e a s e e n t e r your q u e s t i o n . *<Q5> WHAT APPLE COMPUTERS RUN > > > > > > > > > > > > > > > > > > > > *<Q6> WHAT COMPUTERS RUN LOGO WITH 64 KB? > (*** Ab b r e v i a t i o n / S y n o n y m *** KB — > > > > > > > > > > > > > *<Q7> WHAT COMPUTERS WITH 64 KB > (*** Ab b r e v i a t i o n / S y n o n y m *** SELECT FROM ATOMIC PARTICLES COMPUTER.MODEL 1 APPLE I I P l e a s e e n t e r your q u e s t i o n . SELECT FROM WHERE COMPUTER. COMPUTER CNO , COMPUTER.MAKE , PROGRAM KILOBYTE) COMPUTER.MODEL AND AND CNO MAKE (COMPUTER.CNO = (COMPUTER.RAM = (PROGRAM.NAME = PROGRAM.CNO) 64) 'LOGO') MODEL 1 APPLE > > > > > > > > > COMPUTER, COMPUTER II RUN LOGO? KB — > KILOBYTE) CNO , COMPUTER.MAKE , COMPUTER.MODEL , PROGRAM WHERE AND AND (COMPUTER.CNO = (COMPUTER.RAM = (PROGRAM.NAME = CNO MAKE PROGRAM.CNO) 64) 'LOGO') MODEL 161 > 1 APPLE > > > P l e a s e e n t e r *<Q8> SHOW ME > > > > > > > > > > > > > *<Q9> > I s I I your q u e s t i o n . THE GOOD CHEMISTRY PROGRAMS! SELECT FROM WHERE AND AND PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME PROGRAM , DESCRIPTOR , PROGDESC ((PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO)) (PROGRAM.RATING > 3) (DESCRIPTOR.DSC = 'CHEMISTRY') no r e c o r d s s e l e c t e d ; T h i s means t h a t no r e c o r d s i n ; s a t i s f y t he above c o n d i t i o n . the database P l e a s e e n t e r your q u e s t i o n . HOW MUCH MEMORY DOES THE i t OK t o assume t h a t (THE > make of the computer)? *OK : Note t h a t APPLE i s APPLE HAVE? APPLE) r e f e r s t o ( t he a l s o a program p u b l i s h e r SELECT FROM WHERE COMPUTER. COMPUTER RAM (COMPUTER.MAKE = 'APPLE') RAM 64 128 P l e a s e e n t e r your q u e s t i o n . *<Q10> WHAT PROGRAMS RUN ON THE IBM PC? PROGRAM.PNO , PROGRAM.PUBLISHER PROGRAM , COMPUTER SELECT FROM WHERE AND AND (COMPUTER.CNO = PROGRAM.CNO) (COMPUTER.MODEL = 'PC') (COMPUTER.MAKE = 'IBM') PNO PUBLISHER > > > > > > > > > > > > > > P l e a s e e n t e r your q u e s t i o n . *<Q11> WHAT PROGRAMS RUN ON THE > > : Note t h a t "MICROSOFT" PROGRAM.NAME NAME 2 6 VI SICORP MICROSOFT VISICALC WORD IBM PC FROM MICROSOFT? m o d i f i e s "PROGRAMS" 162 SELECT FROM WHERE AND AND AND PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM , COMPUTER (COMPUTER.CNO = PROGRAM.CNO) (PROGRAM.PUBLISHER = 'MICROSOFT') (COMPUTER.MODEL = 'PC') (COMPUTER.MAKE = 'IBM') PROGRAM.NAME PNO PUBLISHER NAME 6 MICROSOFT WORD P l e a s e e n t e r your q u e s t i o n . <QJ2> WHAT COMPUTERS RUN GOOD PROGRAMS ON CHEMISTRY? SELECT COMPUTER.CNO , COMPUTER.MAKE , COMPUTER.MODEL COMPUTER , PROGRAM , DESCRIPTOR , PROGDESC FROM WHERE AND AND AND (COMPUTER.CNO = PROGRAM.CNO) ((PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO)) (PROGRAM.RATING > 3) (DESCRIPTOR.DSC = 'CHEMISTRY') no r e c o r d s s e l e c t e d P l e a s e e n t e r your q u e s t i o n . <Q13> PRINT THE PROGRAMS QUICKLY! I s i t OK t o i g n o r e the word/phrase (QUICKLY)? OK SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME FROM PROGRAM PNO PUBLISHER NAME 1 VI SICORP 2 VISICORP 3 APPLE 4 MICROSOFT 5 MICROSOFT 6 MICROSOFT 7 TERRAPIN 8 MILLIKEN PUBLISHING COMPANY 9 EDUCOACH 10 JOHN WILSON AND SONS LIMITED 11 JOHN BROWNS SOFTWARE LIMITED 12 JOHN W SOFTWARE OF CANADA LIMITED 13 EDUCOACH 14 EDUCOACH 14 r e c o r d s s e l e c t e d . VISICALC VISICALC APPLEWORKS TYPING TUTOR WORD WORD LOGO ADDITION CALCULUS GEOMETRY THE ATOM CALCULUS COACH ATOMIC PARTICLES FRENCH TEACHER SEQUENCES TUTOR TUTOR 1 63 > P l e a s e e n t e r your q u e s t i o n . *<Q14> PRINT THE EFFICIENT IBM SPREADSHEET PROGRAMS! > > Unknown word - EFFICIENT > What would you l i k e t o do? > > 1. S p e l l i n g E r r o r ! E n t e r the c o r r e c t s p e l l i n g . > 2. A b b r e v i a t i o n ! E n t e r the f u l l word. > 3. Synonym! E n t e r the replacement word. > 4. Unknown d a t a element! C o n t i n u e p r o c e s s i n g . > 5. Ignore the word! C o n t i n u e p r o c e s s i n g . > 6. C a n c e l t h i s q uery. > > E n t e r a number between 1 and 6 and p r e s s R e t u r n *5 > SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME > FROM PROGRAM , COMPUTER , DESCRIPTOR , PROGDESC > WHERE > (COMPUTER.CNO = PROGRAM.CNO) > AND ((PROGRAM.PNO = PROGDESC.PNO) AND > (PROGDESC.DNO = DESCRIPTOR.DNO)) > AND (COMPUTER.MAKE = 'IBM') > AND (DESCRIPTOR.DSC = 'SPREADSHEET') > > PNO PUBLISHER NAME > > 2 VISICORP VISICALC > > > P l e a s e e n t e r your q u e s t i o n . *<Q15> PRINT THE GOOD AND EXCELLENT PROGRAMS! > SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME > FROM PROGRAM > WHERE > (PROGRAM.RATING > 3 OR PROGRAM.RATING = 5) > > ; Note t h a t the d e f i n i t i o n of "good" and > ; " e x c e l l e n t " o v e r l a p . T h i s i s caused by the > ; way they a r e d e f i n e d i n the Database Schema. > > PNO PUBLISHER NAME > > 1 VISI CORP VISICALC > 2 VISICORP VISICALC > 3 APPLE APPLEWORKS > 5 MICROSOFT WORD > 6 MICROSOFT WORD > 7 TERRAPIN LOGO > 8 MILLI KEN PUBLISHING COMPANY ADDITION SEQUENCES > 9 EDUCOACH CALCULUS TUTOR > 8 r e c o r d s s e l e c t e d . > > > P l e a s e e n t e r your q u e s t i o n . 1 64 *<Q16> WHAT APPLE OR IBM COMPUTERS RUN LOGO? > SELECT COMPUTER.CNO , COMPUTER.MAKE , COMPUTER.MODEL COMPUTER , PROGRAM FROM WHERE AND AND (COMPUTER.CNO = PROGRAM.CNO) (COMPUTER.MAKE = (PROGRAM.NAME = ' CNO MAKE 'APPLE' LOGO') MODEL OR COMPUTER.MAKE = 'IBM') 1 APPLE I I *<Q17> > > > > > > > > > > > > > > > > > P l e a s e e n t e r your q u e s t i o n . WHAT IBM AND APPLE COMPUTERS RUN LOGO? Note t h a t t he an "OR". T h i s have one make system i n t e r p r e t s the "AND" as i s because one computer can o n l y a s s o c i a t e d w i t h i t ( o n e - t o - o n e ) . SELECT FROM WHERE AND AND COMPUTER.CNO , COMPUTER.MAKE COMPUTER , PROGRAM COMPUTER.MODEL CNO MAKE (COMPUTER.CNO = PROGRAM.CNO) (COMPUTER.MAKE = 'IBM' OR COMPUTER.MAKE = (PROGRAM.NAME = 'LOGO') MODEL 'APPLE') 1 APPLE I I *<Q18> > P l e a s e e n t e r your q u e s t i o n . WHAT FRENCH AND CHEMISTRY PROGRAMS RUN ON THE APPLE? ERROR — T h i s t ype of c o n j u n c t i o n i s not implemented! (see M u l t i p l e Query) E r r o r o c c u r e d i n : (WHAT FRENCH AND CHEMISTRY PROGRAMS) The above query cannot be answered. T h i s i s cau s e s by the "MANY-MANY" r e l a t i o n s h i p between PROGRAMS and DESCRIPTORS, (see s e c t i o n 6.2.3.2.3) P l e a s e e n t e r your q u e s t i o n . *<Q19> WHAT PHYSICS, MATH OR BUSINESS * PROGRAMS RUN ON THE > > > > > > APPLE I I ? (*** A b b r e v i a t i o n / S y n o n y m *** MATH --> SELECT FROM WHERE AND PROGRAM.PNO , PROGRAM.PUBLISHER PROGRAM , DESCRIPTOR , PROGDESC (COMPUTER.CNO = PROGRAM.CNO) MATHEMATICS) PROGRAM.NAME COMPUTER 165 ((PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO)) AND (DESCRIPTOR.DSC = 'PHYSICS' OR DESCRIPTOR.DSC = 'MATHEMATIC' OR DESCRIPTOR.DSC = 'BUSINESS') AND (COMPUTER.MODEL = ' I I ' ) AND (COMPUTER.MAKE = 'APPLE') PNO PUBLISHER NAME 7 TERRAPIN 8 MILLIKEN PUBLISHING COMPANY 9 EDUCOACH 12 JOHN W SOFTWARE OF CANADA LIMITED 13 EDUCOACH LOGO ADDITION SEQUENCES CALCULUS TUTOR CALCULUS COACH ATOMIC PARTICLES P l e a s e e n t e r your q u e s t i o n . <Q20> WHO MAKES LOGO? I s i t OK t o assume t h a t (WHO) r e f e r s t o (the name of the p u b l i s h i n g company of the program)? OK SELECT PROGRAM.PUBLISHER FROM PROGRAM WHERE (PROGRAM.NAME = 'LOGO') PUBLISHER TERRAPIN P l e a s e e n t e r your q u e s t i o n . <Q21> WHO MAKES THE PC? I s i t OK t o assume t h a t (WHO) r e f e r s t o (the make of the computer)? OK SELECT COMPUTER.MAKE FROM COMPUTER WHERE (COMPUTER.MODEL = 'PC') MAKE IBM P l e a s e e n t e r your q u e s t i o n . <Q22> HOW MUCH IS LOGO FOR THE APPLE I I ? SELECT PROGRAM.COST FROM PROGRAM , COMPUTER WHERE (COMPUTER.CNO = PROGRAM.CNO) AND (PROGRAM.NAME = 'LOGO') 166 AND (COMPUTER.MODEL = ' I I ' ) AND (COMPUTER.MAKE = 'APPLE') COST 1 49 P l e a s e e n t e r your q u e s t i o n . <Q23> HOW MUCH DOES LOGO COST? SELECT PROGRAM.COST FROM PROGRAM WHERE (PROGRAM.NAME = 'LOGO') COST 149 P l e a s e e n t e r your q u e s t i o n . <Q24> IS LOGO A GOOD PROGRAM? SELECT PROGRAM.NAME FROM PROGRAM WHERE (PROGRAM.NAME = 'LOGO') AND (PROGRAM.RATING > 3) YES. There i s 1 r e c o r d which s a t i s f i e s the g i v e n c r i t e r i a . Would you l i k e t o see i t ? N ; The user i s always g i v e n the o p t i o n of s e e i n g the ; d a t a s e l e c t e d by the system f o r "YES/NO" q u e r i e s . ; T h i s i s u s e f u l i f more than 1 database element ; s a t i s f i e s the u s e r ' s q u e r y . P l e a s e e n t e r your q u e s t i o n . <Q25> TELL ME IF LOGO IS A GOOD PROGRAM! SELECT PROGRAM.NAME FROM PROGRAM WHERE (PROGRAM.NAME = 'LOGO') AND (PROGRAM.RATING > 3) YES. There i s 1 r e c o r d which s a t i s f i e s t he g i v e n c r i t e r i a . Would you l i k e t o see i t ? N P l e a s e e n t e r your q u e s t i o n . <Q26> IS LOGO GOOD? SELECT PROGRAM.NAME FROM PROGRAM WHERE > 167 > (PROGRAM.NAME = 'LOGO*) > AND (PROGRAM.RATING > 3) > > > *N > > > P l e a s e e n t e r your q u e s t i o n . *<Q27> HOW MUCH MEMORY DOES THE PC HAVE? YES. There i s 1 r e c o r d which Would you l i k e t o see i t ? s a t i s f i e s the g i v e n c r i t e r i a > > > > > > > > > > > SELECT FROM WHERE COMPUTER. COMPUTER RAM (COMPUTER.MODEL = 'PC') RAM 256 P l e a s e e n t e r your q u e s t i o n . *<Q28> WHAT PROGRAMS DO WE HAVE FOR THE PC? PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM , COMPUTER SELECT FROM WHERE AND (COMPUTER.CNO = PROGRAM. (COMPUTER.MODEL = 'PC') CNO) PNO PUBLISHER > > > > > > > > > > > > > P l e a s e e n t e r your q u e s t i o n . *<Q29> WHAT PROGRAMS HAVE GOOD RATINGS? > > > > > > PROGRAM.NAME NAME 2 VISICORP 6 MICROSOFT VISICALC WORD SELECT FROM WHERE PROGRAM.PNO PROGRAM PROGRAM.PUBLISHER , PROGRAM.NAME (PROGRAM.RATING > 3) PNO PUBLISHER NAME > 1 VISICORP > 2 VI SI CORP > 3 APPLE > 5 MICROSOFT > 6 MICROSOFT > 7 TERRAPIN > 8 MILLIKEN PUBLISHING COMPANY > 9 EDUCOACH > 8 r e c o r d s s e l e c t e d . > VISICALC VISICALC APPLEWORKS WORD WORD LOGO ADDITION SEQUENCES CALCULUS TUTOR > 1 68 > P l e a s e e n t e r your q u e s t i o n . *<Q30> WHAT TEACHES FRENCH? SELECT FROM WHERE AND PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM , DESCRIPTOR , PROGDESC ((PROGRAM.PNO = PROGDESC.PNO) AND (PROGDESC.DNO = DESCRIPTOR.DNO)) (DESCRIPTOR.DSC = 'FRENCH') PNO PUBLISHER > > > > > > > > > > > > > P l e a s e e n t e r your q u e s t i o n . *<Q31> WHO TEACHES FRENCH? > Unable t o match the query t o a Your query cannot be answered. PROGRAM.NAME NAME 14 EDUCOACH FRENCH TEACHER Verb Frame! T h i s query cannot be answered s i n c e "WHO" w i l l not match the "RELATION PROGRAM" ( i t i s d e f i n e d as a PHYSICAL OBJECT and not a HUMAN or GROUP). T h i s query c o u l d be answered by enhancing the ver b matching r o u t i n e t o match v e r b frames t o phrases which a re not s e m a n t i c a l l y c o r r e c t . In t h i s case the system c o u l d form a prompt l i k e : I s i t OK t o assume t h a t "a computer program"? "WHO" r e f e r s t o P l e a s e e n t e r your q u e s t i o n . *<Q32> COUNT THE MICROSOFT PROGRAMS! > SELECT COUNT (PROGRAM.PNO) PROGRAM FROM WHERE (PROGRAM.PUBLISHER = 'MICROSOFT') COUNT(PROGRAM.PNO) P l e a s e e n t e r your q u e s t i o n . *<Q33> HOW MANY MICROSOFT PROGRAMS RUN ON APPLE COMPUTERS? > > > > > > > > > SELECT FROM WHERE AND AND COUNT (PROGRAM.PNO) PROGRAM , COMPUTER (COMPUTER.CNO = PROGRAM.CNO) (PROGRAM.PUBLISHER = 'MICROSOFT') (COMPUTER.MAKE = 'APPLE') COUNT(PROGRAM.PNO) 169 P l e a s e e n t e r your q u e s t i o n . <Q34> PRINT THE J WILSON AND SONS AND J W SOFTWARE OF CANADA LTD COMPUTER PROGRAMS! (*** Abbr e v i a t i o n / S y n o n y m *** LTD — > Assume (JOHN WILSON AND SONS LIMITED) f o r AND SONS)? OK PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM SELECT FROM WHERE LIMITED) ( J WILSON PROGRAM.NAME (PROGRAM.PUBLISHER = ' PROGRAM.PUBLISHER = 'JOHN W SOFTWARE OF JOHN WILSON AND SONS LIMITED' CANADA LIMITED') PNO PUBLISHER NAME 10 JOHN WILSON AND SONS LIMITED 12 JOHN W SOFTWARE OF CANADA LIMITED GEOMETRY TUTOR CALCULUS COACH P l e a s e e n t e r your q u e s t i o n . <Q35> WHAT APPLE COMPUTERS RUN APPLE COMPUTER PRORAMS? Note t h a t both "APPLE" and "COMPUTER" a r e used t w i c e i n t h i s q uery. Both words have two d i f f e r e n t semantic meanings h e r e . (*** S p e l l i n g M i s t a k e *** PRORAMS • SELECT COMPUTER.CNO , COMPUTER.MAKE FROM COMPUTER , PROGRAM WHERE (COMPUTER.CNO = PROGRAM.CNO) AND (COMPUTER.MAKE = 'APPLE') AND (PROGRAM.PUBLISHER = 'APPLE') -> PROGRAM) COMPUTER.MODEL CNO MAKE MODEL 1 APPLE I I P l e a s e e n t e r your q u e s t i o n . <Q36> COULD YOU PLEASE PRINT THE UNIQUE PROGRAM PUBLISHERS! SELECT DISTINCT PROGRAM.PUBLISHER FROM PROGRAM PUBLISHER VISICORP APPLE MICROSOFT TERRAPIN 1 70 > MILLIKEN PUBLISHING COMPANY > EDUCOACH > JOHN WILSON AND SONS LIMITED > JOHN BROWNS SOFTWARE LIMITED > JOHN W SOFTWARE OF CANADA LIMITED > 9 r e c o r d s s e l e c t e d . > > > P l e a s e e n t e r your q u e s t i o n . *<Q37> WHAT SHACK OR IBM COMPUTER PROGRAMS TEACH MATH? > Assume (RADIO SHACK) f o r (SHACK)? * Y > (*** Abbreviation/Synonym *** MATH --> MATHEMATICS) > SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME > FROM PROGRAM , COMPUTER , DESCRIPTOR , PROGDESC > WHERE > (PROGRAM.PNO = PROGDESC.PNO AND > PROGDESC.DNO = DESCRIPTOR.DNO) > AND (COMPUTER.CNO = PROGRAM.CNO) > AND (COMPUTER.MAKE = 'RADIO SHACK' OR > COMPUTER.MAKE = 'IBM') > AND (DESCRIPTOR.DSC = 'MATHEMATIC') > > PNO PUBLISHER NAME > > 10 JOHN WILSON AND SONS LIMITED GEOMETRY TUTOR > •' > > P l e a s e e n t e r your q u e s t i o n . *<Q38> PRINT THE COMPUTERS WHICH RUN SOFTWARE FROM * MICROSOFT WITH 128 KB! > (*** Abbrev i a t i o n / S y n o n y m *** KB — > KILOBYTE) > SELECT COMPUTER.CNO , COMPUTER.MAKE , COMPUTER.MODEL > FROM COMPUTER > WHERE > (COMPUTER.RAM = 128) > AND (COMPUTER.CNO IN > (SELECT COMPUTER.CNO > FROM COMPUTER , PROGRAM > WHERE > (COMPUTER.CNO = PROGRAM.CNO) > AND (PROGRAM.PUBLISHER = 'MICROSOFT'))) > > CNO MAKE MODEL > > 2 APPLE MACINTOSH > > P l e a s e e n t e r your q u e s t i o n . *<Q39> WHAT PROGRAMS DOES JOHN PRODUCE? > Assume (JOHN W SOFTWARE OF CANADA LIMITED) f o r (JOHN)? *N > Assume (JOHN BROWNS SOFTWARE LIMITED) f o r (JOHN)? *OK 171 SELECT PROGRAM.PNO , PROGRAM.PUBLISHER , PROGRAM.NAME FROM PROGRAM WHERE (PROGRAM.PUBLISHER = 'JOHN BROWNS SOFTWARE LIMITED') PNO PUBLISHER NAME 11 JOHN BROWNS SOFTWARE LIMITED THE ATOM P l e a s e e n t e r your q u e s t i o n . <Q40> WHAT COMPANIES PRODUCE EXCELLENT SOFTWARE? SELECT PROGRAM.PUBLISHER FROM PROGRAM WHERE (PROGRAM.RATING = 5) PUBLISHER APPLE TERRAPIN P l e a s e e n t e r your q u e s t i o n . <Q41> DOES JOHN LIMITED SELL GOOD SOFTWARE? Assume (JOHN W SOFTWARE OF CANADA LIMITED) f o r (JOHN LIMITED)? N Assume (JOHN BROWNS SOFTWARE LIMITED) f o r (JOHN LIMITED)? N Assume (JOHN WILSON AND SONS LIMITED) f o r (JOHN LIMITED)? OK SELECT PROGRAM.PUBLISHER FROM PROGRAM WHERE (PROGRAM.PUBLISHER = 'JOHN WILSON AND SONS LIMITED') AND (PROGRAM.RATING > 3) I don't know. There a re no r e c o r d s i n the database which s a t i s f y t he g i v e n c r i t e r i a . P l e a s e e n t e r your q u e s t i o n . <Q42> DOES APPLE SELL FOOD? Unknown word - FOOD What would you l i k e t o do? 1. Synonym! E n t e r the replacement word. 2. Unknown d a t a element! C o n t i n u e p r o c e s s i n g . 3. Ignore the word! C o n t i n u e p r o c e s s i n g . 4. C a n c e l t h i s query. 1 72 > > E n t e r a number between 1 and 4 and p r e s s Return *2 > NOTE - (FOOD) i s not (a ty p e / b r a n d of a computer) > c o n t a i n e d i n t h i s d a t a b a s e ! > The above query cannot be answered. > > > P l e a s e e n t e r your q u e s t i o n . *<Q4 3> DOES STEVE SMITH SELL GOOD PROGRAMS? > NOTE - (STEVE SMITH) i s not (the name of the > p u b l i s h i n g company of the program) c o n t a i n e d i n t h i s > da t a b a s e ! > The above query cannot be answered. > > > P l e a s e e n t e r your q u e s t i o n . *<Q44> DOES JOHN VAVRIK SELL GOOD PROGRAMS? > Unknown word - VAVRIK > What would you l i k e t o do? > > 1. S p e l l i n g E r r o r ! E n t e r the c o r r e c t s p e l l i n g . > 2. A b b r e v i a t i o n ! E n t e r the f u l l word. > 3. Synonym! E n t e r the replacement word. > 4. Unknown d a t a element! C o n t i n u e p r o c e s s i n g . > 5. Ignore the word! C o n t i n u e p r o c e s s i n g . > 6. C a n c e l t h i s q uery. > > E n t e r a number between 1 and 6 and p r e s s R e t u r n *4 > Which c a t e g o r y does VAVRIK belong t o ? > > 1. a Common Noun ( c a r , computer, pen) > 2. a Pr o p e r Noun ( J o e , Smith, Canada) > 3. an A d j e c t i v e ( p u r p l e , good, b i g ) > > E n t e r a number between 1 and 3 and p r e s s R e t u r n *2 > NOTE - (JOHN VAVRIK) i s not (the name of the > p u b l i s h i n g company of t h e program) c o n t a i n e d i n t h i s > da t a b a s e ! > The above query cannot be answered. > > > P l e a s e e n t e r your q u e s t i o n . *<Q45> PRINT THE COMPUTER NUMBERS! > What does (THE COMPUTER NUMBERS) best r e f e r t o ? > > 1 . (a unique i n d e n t i f i e r f o r each computer) > 2 . (th e s t a n d a r d amount of memory on the computer > i n k i l o b y t e s ) > 3 . None of the above > 173 E n t e r a number between 1 and 3 and p r e s s R e t u r n 1 SELECT COMPUTER.CNO FROM COMPUTER ; Note t h a t the r e t r i e v e d d a t a has been ommitted f o r ; t h i s example. E n t e r a number between 1 and 4 and p r e s s R e t u r n P l e a s e e n t e r your q u e s t i o n . <Q46> PRINT THE VERSION NUMBERS! SELECT PROGRAM.VERSION FROM PROGRAM ; Note t h a t the r e t r i e v e d d a t a has been ommitted f o r ; t h i s example. P l e a s e e n t e r your q u e s t i o n . <Q47> PRINT THE NUMBERS! What does (THE NUMBERS) b e s t r e f e r to? 1 . (a unique i n d e n t i f i e r f o r each computer) 2 . (a unique i n d e n t i f i e r f o r each program) 3 . (the v e r s i o n of the computer program i f any) 4 . (the s t a n d a r d amount of memory on the computer i n k i l o b y t e s ) 5 . ( t h e computer number which the program runs on) 6 . (a unique i n d e n t i f i e r f o r each d e s c r i p t o r ) 7 . None of the above 1 E n t e r a number between 1 and 7 and p r e s s R e t u r n SELECT COMPUTER.CNO FROM COMPUTER ; Note t h a t the r e t r i e v e d d a t a has been ommitted f o r ; t h i s example. 174 APPENDIX I - DATABASE SCHEMA FOR THE SUPPLIER DATABASE The f o l l o w i n g Database Schema i s d e f i n e d f o r the S u p p l i e r / P a r t / J o b d a t a b a s e . Database D e f i n i t i o n ((S ( (FIELDS (KEY (IDENTIFIERS (CATEG (L-NOUN (L-PREP (L-DESC (COMMENT ) ; ** S u p p l i e r R e l a t i o n ** (SNO SNAME STATUS CITY)) (SNO)) (SNO SNAME CITY)) (HUMAN)) (SUPPLIER)) (FROM)) (PART)) (a s u p p l i e r of a p a r t used i n a j o b ) ) (GNAME)) (INVERT NPR ? ) ) (RNUMBER)) (A unique i n d e n t i f i e r f o r each s u p p l i e r ) ) (SNO (TYPE (CONTENTS (L-TYPE (COMMENT ) (SNAME (TYPE (GNAME)) (CONTENTS (INVERT NPR ? ) ) (COMMENT ) (STATUS (TYPE (CONTENTS (INTEGER)) (L-NOUN (STATUS)) (COMMENT (the name of a s u p p l i e r ) ) (INTEGER)) (STATUS)) (th e s t a t u s of the s u p p l i e r ) ) ) (CITY (TYPE (LOCATION)) (CONTENTS (INVERT NPR ? ) ) (L-TYPE (CITY)) (COMMENT (the c i t y where the s u p p l i e r i s l o c a t e d ) ) ) ) 175 (P ; ** The P a r t r e l a t i o n *' ( (FIELDS (KEY (IDENTIFIERS (CATEG (L-TYPE (L-NOUN (L-PREP (L-DESC (COMMENT ) (PNO (TYPE (CONTENTS (L-TYPE (COMMENT ) (PNAME (TYPE (CONTENTS (COMMENT ) (COLOUR (TYPE (CONTENTS (COMMENT ) (WEIGHT (TYPE (CONTENTS (COMMENT ) ) (PNO PNAME COLOUR WEIGHT)) (PNO)) (PNO PNAME)) (PHYSICAL-OBJECT)) (PRODUCT)) (PART PERIPHERAL)) (WITH)) (COMPUTER)) (a p a r t used i n a j o b ) ) (GNAME)) (INVERT NPR ? ) ) (RNUMBER )) (a unique i n d e n t i f i e r f o r each p a r t ) ) (GNAME )) (INVERT N ? ) ) (the name of the p a r t ) ) (COLOUR)) (CHAR)) (the c o l o u r of' a p a r t ) ) (Q-MASS)) (INTEGER)) (th e weight of the p a r t ) ) ( J ** The Job r e l a t i o n ** ( (FIELDS (KEY (IDENTIFIERS (CATEG (L-NOUN (L-PREP (COMMENT ) (JNO JNAME CITY)) (JNO)) (JNO JNAME)) (EVENT)) (JOB PROJECT)) (ON FOR TO)) (the j o b s t h a t the company i s i n v o l v e d i n ) ) 176 (JNO (TYPE (CONTENTS (L-NOUN (COMMENT ) (JNAME (TYPE (CONTENTS (COMMENT ) (CITY (TYPE (CONTENTS (L-TYPE ; ** The JOB r e l a t i o n ( c o n t . ) ** (GNAME)) (INVERT NPR ? ) ) (NUMBER)) (a unique i n d e n t i f i e r f o r each j o b ) ) (GNAME)) (INVERT N ? ) ) (the name of a j o b / p r o j e c t ) ) (LOCATION)) (INVERT NPR ? ) ) (CITY)) ) (COMMENT (the c i t y where the j o b i s based i n ) ) (SPJ ( (FIELDS (CATEG (COMMENT ) (SNO (TYPE (CONTENTS (COMMENT ) (PNO (TYPE (CONTENTS (COMMENT ) (JNO (TYPE (CONTENTS (COMMENT ) (QTY (TYPE (CONTENTS (TYPE (COMMENT )) ; ** The S u p p l i e r / P a r t / J o b r e l a t i o n ** (SNO PNO JNO QTY)) (EVENT)) ( S p e c i f i e s t he q u a n t i t y of p a r t s f o r a j o b from a s u p p l i e r ) ) (GNAME)) ((KEY S SNO))) (a unique i n d e n t i f i e r f o r each s u p p l i e r ) ) (GNAME)) ((KEY P PNO))) (a unique i n d e n t i f i e r f o r each p a r t ) ) (GNAME)) ((KEY J JNO))) (a unique i n d e n t i f i e r f o r each j o b ) ) (INTEGER)) (INTEGER)) (Q-CNT)) (the number of a s p e c i f i e d p a r t s u p p l i e d by a s u p p l i e r f o r a j o b ) ) 1 77 Verb D e f i n i t i o n ; Note the 2 d e f i n i t i o n s of t h i s v e r b (SUPPLY VFRAME (((SUBJ (RELATION S)) (DOBJ (RELATION P)) (IOBJ (RELATION J ) OPT)) ((SUBJ (RELATION S)) (DOBJ (RELATION J ) ) (PP (WITH) (RELATION P) OPT ) ) ) JOINREL (SPJ) JOININFO ("S.SNO = SPJ.SNO AND P.PNO = SPJ.PNO AND J.JNO = SPJ.JNO") JOINTYPE ((S MANY)(P MANY)) ) (USE VFRAME (((S U B J (RELATION J ) ) (DOBJ (RELATION P)) (PP (FROM) (RELATION S) OPT))) JOINREL ' (SPJ) JOININFO ("S.SNO = SPJ.SNO AND P.PNO = SPJ.PNO AND J.JNO = SPJ.JNO") JOINTYPE ( ( J MANY)(P MANY)) ) (OBTAIN SYNONYM ) USE 1 78 D e f a u l t J o i n s and A s s o c i a t i o n F a c t o r s (DEFAULTJOIN ' J 'P '3 ' ( ( J MANY) (P MANY)) *"(J.JNO = SPJ.JNO AND P.PNO = SPJ.PNO)" ' ( S P J ) ) (DEFAULTJOIN ' J 'S ' 3 1 ( ( J MANY) '"(J.JNO = '( S P J ) ) D e f a u l t j o i n (S MANY)) SPJ.JNO AND S.SNO between Jobs and P a r t s R e l a t i o n names A s s o c i a t i o n f a c t o r JOIN TYPE J o i n Secondary R e l a t i o n = SPJ.SNO)" (DEFAULTJOIN 'S 'P ' 3 '((S MANY) (P MANY)) '"(S.SNO = SPJ.SNO AND P.PNO = SPJ.PNO)" ' ( S P J ) ) (DEFAULTJOIN ' J 'SPJ ' 3 ' ( ( J MANY) (SPJ MANY)) '"(J.JNO = SPJ.JNO)" ' ( S P J ) ) (DEFAULTJOIN 'P 'SPJ '3 ' ( ( P MANY) (SPJ MANY)) '"(P.PNO = SPJ.PNO)" '( S P J ) ) (DEFAULTJOIN 'S 'SPJ ' 3 '((S MANY) (SPJ MANY)) '"(S.SNO = SPJ.SNO)" ' ( S P J ) ) 1 79 APPENDIX J - SAMPLE "SUPPLIER/PART/JOB" DATABASE SESSION > > Which database would you l i k e t o use? > > 1. The PROGRAM database > 2. The S u p p l i e r / P a r t / J o b database > > E n t e r a number between 1 and 2 and p r e s s R e t u r n *2 > THE DICTIONARY HAS BEEN REFERENCED. > Welcome t o the SUPPLIER/PART/JOB d a t a b a s e . > To s t o p type 'STOP!' o r 'QUIT!'. > > P l e a s e e n t e r your q u e s t i o n . *PRINT THE STATUS OF SUPPLIERS IN PARIS? > > SELECT S.STATUS > FROM S > WHERE > (S.CITY = 'PARIS') STATUS 10 30 > > P l e a s e e n t e r your q u e s t i o n . *WHO SUPPLIES RED, GREEN OR BLUE PARTS? > > SELECT UNIQUE S.SNO , S.SNAME , S.CITY > FROM S , P , J , SPJ > WHERE > (S.SNO = SPJ.SNO AND P.PNO = SPJ.PNO AND > J.JNO = SPJ.JNO) > AND (P.COLOUR = 'RED' OR P.COLOUR = 'GREEN' OR > P.COLOUR = 'BLUE') > SN SNAME CITY 51 SMITH LONDON 52 JONES PARIS 53 BLAKE PARIS 54 CLARK LONDON 55 ADAMS ATHENS 180 > P l e a s e e n t e r your q u e s t i o n . *WHAT SUPPLIERS SUPPLY JOB J1 WITH RED NUTS IN PARIS? > UNIQUE S.SNO , S.SNAME , S.CITY S , J , P , SPJ SELECT FROM WHERE (S.SNO = SPJ.SNO AND P.PNO J.JNO = SPJ.JNO) AND ( J . JNO = 1 J 1 1 ) AND (J.CITY = 'PARIS') AND (P.PNAME = 'NUT') AND (P.COLOUR = 'RED') SPJ.PNO AND SN SNAME S1 SMITH CITY LONDON > > P l e a s e e n t e r your q u e s t i o n . *WHO SUPPLIES P3S TO JOBS IN ATHENS? > S.SNO , S.SNAME , S.CITY S , P , J , SPJ SELECT FROM WHERE AND AND (S.SNO = SPJ.SNO AND J.JNO = SPJ.JNO) (P.PNO = 'P3') (J.CITY = 'ATHENS') P.PNO = SPJ.PNO AND SN SNAME CITY S2 JONES S5 ADAMS PARIS ATHENS > > P l e a s e e n t e r your q u e s t i o n . *DOES THE COLATOR PROJECT USE BOLTS FROM SMITH? > (*** S p e l l i n g M i s t a k e *** COLATOR --> COLLATOR) > > SELECT J.JNO , J.JNAME > FROM J , P , S , SPJ > WHERE > (J.JNO = SPJ.JNO AND P.PNO = SPJ.PNO AND > S.SNO = SPJ.SNO) > AND (J.JNAME = 'COLLATOR') > AND (P.PNAME = 'BOLT') > AND (S.SNAME = 'SMITH') I don't know. There a r e no r e c o r d s i n the database which s a t i s f y the g i v e n c r i t e r i a . > 181 > P l e a s e e n t e r your q u e s t i o n . *WHERE IS SMITH? > I s i t OK t o assume t h a t (WHERE) r e f e r s t o ( t h e c i t y > where the s u p p l i e r i s l o c a t e d ) ? *OK > > SELECT S.CITY > FROM S > WHERE > (S.SNAME = 'SMITH') CITY LONDON > > P l e a s e e n t e r your q u e s t i o n *HOW MUCH DOES PART P4 WEIGH? > > SELECT P.WEIGHT > FROM P > WHERE > (P.PNO = 'P4') WEIGHT 1 4 > P l e a s e e n t e r your q u e s t i o n . *STOP! 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051909/manifest

Comment

Related Items