"Business, Sauder School of"@en . "DSpace"@en . "UBCV"@en . "Wagner, Christian"@en . "2010-10-18T18:07:40Z"@en . "1989"@en . "Doctor of Philosophy - PhD"@en . "University of British Columbia"@en . "The purpose of this research is the formalization of a method of bottom up database design known as view integration.\r\nView integration is one of the main steps of an acknowledged database design procedure, the New Orleans Database Design Workshop procedure. This procedure develops a global database (global schema) for an organization from small partial databases (user views). Individual user views are representations of the data relevant to the users' organizational tasks. Views will overlap since users will share data to some extent. View integration has to merge views without duplicating the information presented in multiple views. The task of merging views without duplication is complicated by the fact that users have different perceptions of the world which lead them to represent the same data differently, the most simple form of different perceptions being naming conflicts such as the occurrence of synonyms.\r\nWithin the last 13 years a variety of approaches to solve the integration task has been reported. Many of the approaches have neglected the problem of conflicting views altogether, leaving its solution to the database designer. Integration methods that performed conflict resolution did it in an unsystematic and incomplete fashion. Often these methods dealt with conflict situations only if information for their resolution was conveniently available.\r\nThis research fills that gap. A conflict analysis procedure is outlined which considers all possible conflict conditions and transforms them into conditions that can be merged by means of previously developed techniques. The research proceeds in two steps. First, a conflict analysis procedure is developed that ignores the information requirements problem by assuming complete information. This simplification allows the concentration on completeness of the procedure, since one does not have to be concerned with the difficulties involved in gathering the required information. The second step relaxes the assumption of complete information. Difficult information requirements are identified and replaced by more easily satisfied ones.\r\nMain contributions to knowledge are (1) a complete understanding of the factors causing conflicts between views, (2) detection of substitutes for difficult information requirements. Other contributions are (3) suggestions for the development of a semantic data dictionary, (4) an alternative method for the design of knowledge based systems, and (5) suggestions for efficient bottom up systems design strategies."@en . "https://circle.library.ubc.ca/rest/handle/2429/29312?expand=metadata"@en . "VIEW INTEGRATION IN DATABASE DESIGN by C h r i s t i a n Wagner Diplom-Ingenieur, Technical University B e r l i n , 1984 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY i n THE FACULTY OF GRADUATE STUDIES Faculty of Commerce and Business Administration We accept t h i s thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA A p r i l 1989 (c) C h r i s t i a n Wagner, 1989 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. ' F a c u l t y ^O^Sftrn'e'n't of Commerce and B u s i n e s s A d m i n i s t r a t i o n The University of British Columbia Vancouver, Canada _ t A p r i l 24, 1989 Date DE-6 (2/88) ABSTRACT The purpose of t h i s research i s the formalization o f a method o f bottom up database design known as view integration. View integration i s one of the main steps of an acknowledged database design procedure, the New Orleans Database Design Workshop procedure. This procedure develops a global database (global schema) for an organization from small p a r t i a l databases (user views) . Individual user views are representations of the data relevant to the users' organizational tasks. Views w i l l overlap since users w i l l share data to some extent. View i n t e g r a t i o n has t o merge views without d u p l i c a t i n g the information presented i n multiple views. The task of merging views without d u p l i c a t i o n i s complicated by the fac t that users have d i f f e r e n t perceptions of the world which lead them to represent the same data d i f f e r e n t l y , the most simple form of d i f f e r e n t perceptions being naming c o n f l i c t s such as the occurrence of synonyms. Within the l a s t 13 years a v a r i e t y of approaches to solve the i n t e g r a t i o n task has been reported. Many of the approaches have ne g l e c t e d the problem of c o n f l i c t i n g views altogether, l e a v i n g i t s s o l u t i o n to the database designer. Integration methods t h a t performed c o n f l i c t r e s o l u t i o n d i d i t i n an unsystematic and incomplete f a s h i o n . Often these methods dealt with c o n f l i c t s i t u a t i o n s only i f information f o r t h e i r r e s o l u t i o n was conveniently a v a i l a b l e . This research f i l l s that gap. A c o n f l i c t analysis procedure i s o u t l i n e d which considers a l l possible c o n f l i c t conditions and transforms them i n t o c o n d i t i o n s that can be merged by means of previously developed techniques. The research proceeds i n two steps. F i r s t , a c o n f l i c t analysis procedure i s developed that ignores the information requirements problem by assuming complete i n f o r m a t i o n . T h i s s i m p l i f i c a t i o n allows the concentration on completeness of the procedure, since one does not have t o be concerned with the d i f f i c u l t i e s involved i n gathering the required information. The second step relaxes the assumption of complete information. D i f f i c u l t information requirements are i d e n t i f i e d and replaced by more e a s i l y s a t i s f i e d ones. Main contributions to knowledge are (1) a complete understanding of the factors causing c o n f l i c t s between views, (2) detection of substitutes for d i f f i c u l t information requirements. Other c o n t r i b u t i o n s are (3) suggestions for the development of a semantic data dictionary, (4) an a l t e r n a t i v e method for the design of knowledge based systems, and (5) suggestions for e f f i c i e n t bottom up systems design strategies. i i i TABLE OF CONTENTS ABSTRACT i i TABLE OF CONTENTS i v LIST OF FIGURES v i i i ACKNOWLEDGMENT ix 1. OVERVIEW 1 2. VIEW INTEGRATION 3 2.1. Database Design Philosophies-Top Down vs. Bottom Up 3 2.2. Database Design based on the New Orleans Database Design Workshop Procedure 5 2.2.1. Syntactic Approaches 12 2.2.1.2. ^ M a r t i n 1 s C a n o n i c a l Synthesis 21 2.2.1.3. Casanova's and Vidal's Method 25 2.2.1.4. F u n c t i o n a l Data Model Based Integration 30 i v 2.2.2. Semantic View I n t e g r a t i o n Approaches Based on the E-R Model 34 2.2.2.1. Navathe's and Elmasri's Approach 3 6 2.2.2.2. Ba t i n i ' s Approach 39 2.3. View Integration Cases 43 2.4. Conclusion 45 3. SYSTEM FOR VIEW INTEGRATION . . . . ''53 3.1. R e s e a r c h Q u e s t i o n a n d Contribution to Knowledge 53 3.2. Approach to the Problem 60 3.2.1. Overview 60 3.2.2. Outline of the Problem with Available Information 61 3.2.3. Changes i n the Integration M e t h o d when N e c e s s a r y Information i s not D i r e c t l y Available 75 3.2.4. View I n t e g r a t i o n C o n f l i c t Cases 79 3.3. Expert System Methodology 83 4. RESULTS 90 4.1. Rules Guiding View Integration . . . . 90 v 4.2. Diagnosis Procedure 134 4.3. C o n f l i c t Therapy 151 4.4. The Impact of Heu r i s t i c s 167 4.5. Generalization Hierarchy f o r Database Objects 178 4.6. Assessment of the Method . . . . . 184 5. IMPLEMENTATION - THE AVIS PROGRAM 197 5.1. Overview . . . . . 197 5.2. F u n c t i o n and S t r u c t u r e of the AVIS Program 197 5.3. Knowledge Representation . . . . . 203 5.3.1. Representation of views 203 5.3.2. R e p r e s e n t a t i o n o f View Integration Knowledge 206 5.4. The Impact of Domain Knowledge . . . . 210 6. SUMMARY AND EXTENSIONS 214 7. REFERENCES 219 APPENDIX 224 Appendix 1: C o n f l i c t Cases 224 Appendix 2: C o n f l i c t Solutions 231 v i Appendix 3 : View I n t e g r a t i o n S e s s i o n with AVIS v i i LIST OF FIGURES Figure T i t l e Page 1 Object Comparison Matrix 65 2 Case Transformations during View Integration 72 3 Ordering of View Integration Steps 74 4 C o n f l i c t Recognition Procedure (abbreviated) 75 5 Decision Table I l l u s t r a t i o n 85 6 Test for Object Identity, Procedure without Heu r i s t i c s 138 7 Test for Identity with H e u r i s t i c 143 8 Test for Relatedness of Objects 147 9 Relationship becomes an En t i t y 152 10 Relationship At t r i b u t e Becomes an E n t i t y 153 11 E n t i t y Attribute Becomes an E n t i t y -Relationship Construct 155 12 Association of an Entit y to a Relationship 156 13 Relationship Relocation 158 14 Representation of Containment 159 15 Representation of Common Role 160 16 Representation of Common Superset without Common Subset 162 17 Representation of Common Superset and Common Subset 163 18 Sources of Evidence for Meaning Identity 173 19 Construct Mismatches Shown as Graph v i i i Contraction 186 20 Ide n t i c a l Meaning Query i n Prolog Graph Notation 188 21 AVIS Program Structure 199 22 Representation of Views i n AVIS 203 23 AVIS Hypotheses 206 24 AVIS \"make agenda\" Rule 2 07 25 F i l t e r i n g Rule i n AVIS 208 26 AVIS Object Assertion Rule 209 27 AVIS Meaning Identity Indicators 2i2 28 View Integration Sample Problem 239 ix ACKNOWLEDGMENT I thank my supervisor, Professor Robert C. Goldstein, f o r h i s guidance as well as f o r h i s ongoing encouragement. My thanks go to Professor Yair Wand for h i s often very c r i t i c a l and always very stimulating comments. To Professor Wolfgang Bibel I am gr a t e f u l f o r providing many new perspectives on the nature of t h i s research. I also wish to acknowledge the funding given f o r t h i s research by the World U n i v e r s i t y S e r v i c e of Canada, dem Deutschen Akademischen Austauschdienst, and the University of B r i t i s h Columbia. F i n a l l y , I thank my parents, Helmuth and I r i s Wagner, fo r t h e i r love and support. x 1. OVERVIEW The database designer's task, c o n v e r t i n g users' c a s u a l data d e s c r i p t i o n s i n t o a database design i s time consuming, e r r o r prone, and requires substantial expertise. T h i s argument i s s t i l l v a l i d , even though the separation of l o g i c a l and physical design considerations has s i m p l i f i e d the design e f f o r t (Curtice and Jones, 1982). Consequently, there e x i s t s s t r o n g i n t e r e s t i n the development of techniques to improve the database design process, p a r t i c u l a r l y the hardware independent l o g i c a l database design process. One approach t h a t has been taken i s the further decomposition of the design process. Frequently, database designers begin with a graphical representation of the database to be b u i l t , i . e . an e n t i t y - r e l a t i o n s h i p model or Brown diagrams (Brown, 1982) , before they design the actual database r e l a t i o n s or record and set types. As DeMarco (1979) mentions i n the context of structured analysis, graphical representations are a t o o l t h a t provides a c o n c i s e representation, allows easy consistency checking and i s very maintainable. Another form of design composition focuses on the development of i n d i v i d u a l user views for small task domains and subsequent integration of user views into a complete schema. The ra t i o n a l e for t h i s approach i s s i m p l i f i c a t i o n due to a more narrow focus, as well 1 as improved v a l i d i t y of the views. I f every user describes only the data of her task domain\u00E2\u0080\u0094the data she i s most f a m i l i a r w i t h \u00E2\u0080\u0094 t h e r e s u l t i n g representation promises to be more correct than one that i s done by a person only remotely f a m i l i a r with the domain. However, s i n c e each each view describes data structures as perceived by the i n d i v i d u a l users, differences i n user p e r c e p t i o n s \u00E2\u0080\u0094 c o n f l i c t s between user v i e w s \u00E2\u0080\u0094 a r e to be expected. These c o n f l i c t s have to be s e t t l e d , before views can be aggregated to form a global database structure. The purpose of t h i s research i s the formalization and solution of the c o n f l i c t r e s o l u t i o n problem. Even though a v a r i e t y of i n t e g r a t i o n methods are p r e s e n t l y avai l a b l e , e x i s t i n g view integ r a t i o n methods are incomplete, freguently neglecting the c o n f l i c t r e s o l u t i o n problem (Batini et a l . , 1986, p. 348). C o n f l i c t s a r i s e when d i f f e r e n t users model the same r e a l world c o n c e p t s d i f f e r e n t l y , or d i f f e r e n t r e a l w o r l d o b j e c t s i d e n t i c a l l y . T h i s r e s e a r c h b r i d g e s the gap by developing a c o n f l i c t c l a s s i f i c a t i o n and resolution scheme, and based on t h i s scheme a computer program t h a t integrates user views, grounded i n rules and h e u r i s t i c s of database design. 2 2. VIEW INTEGRATION 2.1. Database Design P h i l o s o p h i e s - Top Down vs. Bottom Up Independent of any p a r t i c u l a r database design approach t h e r e e x i s t s the question whether database design, l i k e any other form of systems design, should proceed top down or bottom up. Bottom up and top down repre s e n t the extreme points i n a spectrum of design a l t e r n a t i v e s . In general, top down design has the advantage over bottom up design that i t i s oriented towards o v e r a l l goals and that i t allows stepwise refinement of those general goals. Bottom up design r e q u i r e s i n t e g r a t i o n of the elements of the o v e r a l l system and w i l l almost c e r t a i n l y r e s u l t i n c o n f l i c t s between the elements and i n the n e c e s s i t y f o r the r e d e f i n i t i o n of system elements. Despite t h i s disadvantage, bottom up approaches are frequently used (Martin, 1984, McFadden and Hoffer, 1988). Their major advantage i s that they do not demand the existence of an o v e r a l l design before the design of p a r t i c u l a r elements can take place. Thus, no o v e r a l l understanding of the system i s required, or at lea s t not to the extent necessary for the top down approach. In addition, bottom up design f a c i l i t a t e s 3 the use of e x i s t i n g information from previous designs and thus i s a better approach for incremental development. Given that both approaches have advantages and disadvantages, designers w i l l t y p i c a l l y apply both design approaches, namely u s i n g a top down focus for the i n i t i a l design, to p a r t i t i o n the system into manageable subsystems which are c o n f l i c t - f r e e . Thereafter, they w i l l apply a bottom up approach i n the detailed design of these subsystems, t a k i n g i n t o consideration the necessity f o r c o n f l i c t resolution and trading i t f o r ease of design. The major database design techniques described i n t h i s paper, those u s i n g view i n t e g r a t i o n , w i l l appear to be bottom up approaches, since the integration process i s based on in d i v i d u a l user views. However, the procedure l a i d out at the New Orleans Database Design Workshop (New Orleans, 1979) which presents a framework for view integration approaches, recommends a database design procedure that introduces organizational goals and high l e v e l information requirements by means of Enterprise Modelling i n the step preceding view integration. In other words, t h i s widely accepted design strategy also applies a mixed top down and bottom up approach. 4 2.2. Database Design based on the New Orleans Database Design Workshop Procedure In t h i s s e c t i o n the focus w i l l be on the common elements of a l l view integration procedures as well as on t h e i r d i f f e r e n t i a t i n g c h a r a c t e r i s t i c s . In short, a l l integration approaches can be perceived as procedures for view aggregation and schema optimization. One feature of a l l (comprehensive) approaches w i l l be the a b i l i t y to resolve differences between views. To permit t h i s , the methods' data models w i l l have to be able to represent objects and object associations. D i s s i m i l -a r i t i e s among view integration procedures w i l l a r i s e primarily from the differences i n procedure, the differences i n a b i l i t i e s to deal with c o n f l i c t i n g information, v a r i a t i o n s i n information requirements, and on the r e s t r i c t i o n s placed on the i n i t i a l schema. View i n t e g r a t i o n i s an element of any bottom up database design strategy. This strategy, whose i n i t i a l input are user r e q u i r e m e n t s and whose f i n a l outcome i s the implemented ( p h y s i c a l ) database, has been segmented by various authors (New Orleans, 1979, Teory and Fry, 1982) into the following steps: 1. Requirements Analysis to obtain information from users on information and p r o c e s s i n g r e q u i r e m e n t s , and to analyze t h i s 5 i n f o r m a t i o n i n order to r e s o l v e c o n f l i c t s and i n c o n s i s t e n c i e s with the e n t e r p r i s e view. The a n a l y s i s and i n c o r p o r a t i o n of (global) business constraints adds a top down focus to t h i s otherwise bottom up oriented technique. 2. View Modelling and Modification to generate application views and information access requirements. 3. View Integration to merge i n d i v i d u a l views into a global schema. 4 . Implementation Design to handle issues of i n t e g r i t y , consistency, recovery, security and e f f i c i e n c y . 5 . Physical Design to ensure functioning and e f f i c i e n c y of the database with a p a r t i c u l a r database/file system. In other words, view integration takes as i t s inputs i n d i v i d u a l user views (and p o s s i b l y processing/query requirements) and produces as i t s output a global database schema. The most t r i v i a l form of view integration i s an aggregation of a l l i n d i v i d u a l views without a l t e r a t i o n of any of them. However, i n s t e a d of generating a system of interconnected database objects, t h i s method creates merely a lump of i n d i v i d u a l views. View integration has to go beyond aggregation, i t has 6 to i n c l u d e the r e o r g a n i z a t i o n (optimization) of the global schema. The task i s to eliminate redundancies and in c o n s i s t -encies that r e s u l t from combining overlapping views of users who a l l may have d i f f e r e n t conceptual models. Reorganization of the global schema i s intended to increase the des c r i p t i v e adequacy of the global schema 1 . In addition, i t may include the consideration of query requirements which has been a concern i n some e a r l i e r studies, e s p e c i a l l y i n non-r e l a t i o n a l database environments (for example B a t i n i et a l . (1984a) or Yao et a l . (1982, 1985)) . For network or h i e r a r c h i c a l databases, c o n s i d e r a t i o n of p r o c e s s i n g requirements might r e s u l t i n a trade-off that introduces d u p l i c a t i o n of database objects to improve processing e f f i c i e n c y . Even though a v a r i e t y of researchers choose the same approach to database design, namely view integration, differences e x i s t i n the data modelling language used to carry out the integration process. T i g h t l y connected to the data model i s the \"integration philosophy\", a l t e r n a t i v e s of which have been pointed out by Yao et a l . (1982) as (1) view i n t e g r a t i o n based on item l e v e l s y n t h e s i s u s i n g frequency information, (2) synthesis using functional dependencies among items and (3) merging of object l e v e l structures. 1 Descriptive adequacy i s understood as the pr e c i s i o n with which the data model describes the world i t attempts to model. 7 The f i r s t category i s a form of \" s t a t i s t i c a l \" view integration, i n which frequency i n f o r m a t i o n serves as a substitute for cohesion or functional dependency of data items (Dyba, 1977, Sheppard, 1977). The second category builds database objects, i . e . r e l a t i o n a l data structures, based on information on functional dependencies. Proponents of t h i s category can be found f o r i n s t a n c e i n Bernstein (1976), Raver and Hubbard (1977), Yao et a l . (1982), Casanova and V i d a l (1983), and Biskup and Convent (1986, 19 85) . Most of these approaches attempt to b u i l d databases p u r e l y based on functional dependencies (and possibly other forms of dependencies) and t r y to avoid the consideration of the meaning of d a t a o b j e c t s as much as p o s s i b l e during integration. Later, these approaches w i l l be referred to as synta c t i c approaches. The t h i r d group of approaches i s probably best represented by B a t i n i et a l . (for instance B a t i n i and Lenzerini, 1984) and Navathe et a l . (for instance Navathe and Elmasri, 1986). Both t e c h n i q u e s are based on the E-R model, enhanced by some a d d i t i o n a l information ( g e n e r a l i z a t i o n / s p e c i a l i z a t i o n ) . The fact that these techniques operate on an object l e v e l does not imply that functional relationships are not relevant for them. 8 However, i n E-R models, dependencies are represented i n the association of a t t r i b u t e s to e n t i t i e s or r e l a t i o n s . Since the l a t e seventies, the l i t e r a t u r e has moved away from s t a t i s t i c a l approaches to view integration. The main problem of s t a t i s t i c a l approaches i s t h a t they attempt to capture dependency information between data items by means of r e l a t i v e frequency of common use i n applications or coexistence i n the same f i l e structure. This substitute may often be correct, since experienced f i l e designers w i l l have a good understanding of which data items should belong together (see f o r instance Weber, 1986 on \" i n t u i t i v e \" normalization), but the technique i s i n f e r i o r to ones t h a t concentrate on the a c t u a l data dependencies. Thus, within t h i s research, the focus w i l l be on the l a t t e r two groups of integration methods only. For these two groups, prototypical integration methods (together with t h e i r data models) are presented i n the following l i s t . SYNTACTIC (at t r i b u t e - l e v e l ) INTEGRATION Based on Functional Dependencies only * Martin (1983) - \"Bubble Charting\" * Bernstein (1976) - Relational Model * Yao et a l . (1982) - Functional Data Model * Raver and Hubbard (1977) - \"Bubble Charting\" 9 * Al-Fedaghi and Scheuenrtan (1981) - Relational Model Based on FDs and other Dependencies * Casanova and V i d a l (1983) - Relational Model * Biskup and Convent (1986) - Relational Model SEMANTIC (object-level) INTEGRATION * B a t i n i et a l . (1983) - Entity-Relationship Model * Navathe e t a l . (1986) 1 - E n t i t y - C a t e g o r y -Relationship Model * Mannino and E f f e l s b e r g (1984) - Generalization Assertions * Teory and Fry (1982) - Semantic H i e r a r c h i c a l Data M. Not a l l of these techniques s h a l l be discussed i n d e t a i l since there e x i s t s considerable overlap among them. The following techniques w i l l be discussed: Martin, Bernstein, Yao et a l . , Casanova and V i d a l , Navathe et a l . , B a t i n i et a l . Martin 1 The method put forward by Navathe and others has gone through various stages and has involved various researchers. An e a r l i e r method i s described by Navathe and Gadgil (1978) or Navathe and Schkolnick (1978). Other versions include Navathe, Elmasri, and Larson (1986). The method referenced here i s the l a t e s t p u b l i s h e d form. I t has been extended into database integration by Elmasri et a l . (1986) . 10 contributes a not p a r t i c u l a r l y detailed, yet popular integration method. Bernstein presents the f i r s t algorithmic and purely s y n t a c t i c a l view synthesis method. Casanova and V i d a l introduce the f i r s t s y ntactic integration method that includes a r i c h e r s e t of dependencies. Navathe et a l . put forward a semantic i n t e g r a t i o n method with a l a r g e s e t of i n t e g r a t i o n cases. F i n a l l y , B a t i n i et a l . present the (semantic) integration method that best deals with c o n f l i c t i n g views. 11 2.2.1. Syntactic Approaches S y n t a c t i c approaches are design methods i n which the view integration procedure does not r e l y on a designer's understanding of the data during the integration process (nor on \"understanding\" by the algorithm) 1 . Instead, the algorithms r e o r g a n i z e the i n i t i a l schema i n a purely s t r u c t u r a l manner independent of the meaning of objects or a t t r i b u t e s involved, once c e r t a i n i n f o r m a t i o n requirements about f u n c t i o n a l dependencies are s a t i s f i e d . These information requirements are assumed to be s a t i s f i e d at the outset of the i n t e g r a t i o n procedure. They are not part of the technique. The s y n t a c t i c approaches introduced below, give a complete a l g o r i t h m f o r view i n t e g r a t i o n and show the \" o p t i m a l i t y 1 1 (author's terminology) of the r e s u l t i n g design. Optimality (Casanova and Vidal) i s not a p a r t i c u l a r l y well chosen term, s i n c e the design i s not optimal i n a l l c r i t e r i a a database designer might think of. \"Optimal\" i s meant as \"achieving the goals set f o r the design at the s t a r t of the integration process\" which i n p a r t i c u l a r means the generation of a v a l i d database, i . e . one that s a t i s f i e s a l l previously established i n t e g r i t y 1 I d e a l l y the techniques do not r e l y at a l l on the designer's understanding. However, at l e a s t one method (Biskup and Convent) c o n s u l t s the designer, when the i n t e g r a t i o n algorithm i s i n a deadlock. Other methods (e.g., Yao et al.) r e q u i r e designer understanding for more complex integration cases, such as removal of redundant functions. 12 constraints and i s free of undesirable data dependencies. We w i l l c a l l the r e s u l t i n g designs from now on \" f e a s i b l e \" rather than \"optimal\". Three main proponents of d i f f e r e n t syntactic approaches are Bernstein (1976), Casanova and V i d a l (1983), and Biskup and Convent (198 6). Two additional s y n t a c t i c approach s h a l l also be mentioned i n t h i s context, although they d i f f e r from the above three i n not being as purely synthetic, not providing a complete algorithm, and i n using other data models (\"bubble charts\" (Martin) and the Functional Data Model (Yao et al.)). A l l approaches, other than Biskup's and Convent's, w i l l be discussed. Biskup and Convent's technique i s rather s i m i l a r to Casanova's and Vi d a l s . Hence, a separate discussion w i l l not be necessary. B e r n s t e i n ' s approach does not p a r t i c u l a r l y address the view integration problem, but instead the problem of synthesizing a minimal number of 3NF r e l a t i o n s from a set of f u n c t i o n a l dependencies. Nevertheless, i t s approach i s applicable to view integration, since the algorithm does not mind whether the schema descriptions used for r e l a t i o n synthesis stem from one view or from many views. However, the procedure has obviously no means to unify c o n f l i c t i n g perceptions of the same data. Contrary to more rece n t i n t e g r a t i o n approaches such as Casanova's and V i d a l ' s , B e r n s t e i n ' s method r e l i e s o n l y on f u n c t i o n a l dependencies to carry out the r e l a t i o n synthesis procedure. 13 Martin's approach, Canonical Synthesis, attempts to develop a \"canonical data representation\" 1 . This method, l i k e Bernstein's, has no formal means for dealing with c o n f l i c t s between views, not even f o r naming c o n f l i c t s . In addition i t i s much less d e t a i l e d and much less algorithmic than Bernstein's. Casanova's and Vidal's technique assumes the existence of user views and complete knowledge of dependencies ( i n t e g r i t y c o n s t r a i n t s ) f o r the c o l l e c t i o n of user views. I t can be summarized by the following integration plan. Given a set of user views and a set of i n t e g r i t y constraints, define as a v a l i d (\"proper\") database scheme ( = global schema) one that s a t i s f i e s a l l desirable i n t e g r i t y constraints. Then apply an algorithm that reorganizes the c o l l e c t i o n of user views into a v a l i d schema by removing a l l undesirable data dependencies through changes i n r e l a t i o n schemes. Yao et a l . require f o r t h e i r approach complete information on e n t i t i e s ( \" e n t i t y nodes\"), functional r e l a t i o n s h i p s between e n t i t y nodes, plus assertions describing true facts about the data model which are not represented i n form of e n t i t y nodes or re l a t i o n s h i p s . A l l views are combined i n one representation 1 The notion of a canonical representation i n data models has been put forward by Raver and Hubbard (1977) and i s used to d e s c r i b e schemata which are redundancy-free (no nonessential associations), complete, and correct. Thus a canonical synthesis technique not only integrates user views, but can also extend them to add necessary further d e t a i l s . 14 which i s thereafter subject to removal of redundant functions and redundant nodes. A proof of correctness of the integration r e s u l t i s not given for t h i s approach. One major l i m i t a t i o n of the syntactic strategies, e s p e c i a l l y of Casanova's and V i d a l ' s , i s t h e i r e x t e n s i v e i n f o r m a t i o n r e q u i r e m e n t s . They assume at l e a s t the a v a i l a b i l i t y of i n f o r m a t i o n on f u n c t i o n a l , i f not also on union functional dependencies, i n c l u s i o n and exclusion dependencies. I t has to be questioned whether i t i s f e a s i b l e to generate t h i s information d u r i n g the view i n t e g r a t i o n process, and how r e l i a b l e the information w i l l be. With respect to the amount of information, one has to keep i n mind that not only intra-view but also i n t e r -view c o n s t r a i n t s have to be defined. This requirement can increase the number of constraints s u b s t a n t i a l l y , i t also demands from the designer the comparison of each r e l a t i o n scheme from each view against a l l other r e l a t i o n schemes, to detect those dependencies. Any incorrect assessment by the designer w i l l p o t e n t i a l l y r e s u l t i n an incorrect global schema. A second l i m i t a t i o n of these approaches i s the r e s t r i c t i o n s they place on the i n i t i a l views to make the integration a computat-i o n a l l y solvable problem ( i . e . only functional dependencies on the key for the i n i t i a l c o l l e c t i o n of views). 15 A t h i r d l i m i t a t i o n i s caused by the purely syntactic treatment of data dependencies. The procedures cannot d i f f e r e n t i a t e between dependencies that are of the same type and involve the same a t t r i b u t e s , even i f t h e i r meanings are d i f f e r e n t . For example, the functional dependency Employee# -> Department! might i n f a c t represent two d i f f e r e n t r e l a t i o n s h i p s , f i r s t , every employee works for one p a r t i c u l a r department, and second, every employee i s located i n one p a r t i c u l a r department. Thus, while f o r example employee 6750 works f o r the information systems department, he resides i n the o f f i c e s of the accounting department. This difference i n roles (here, roles of department) has to be incorporated into the a t t r i b u t e names, to allow the synt a c t i c approaches d i f f e r e n t i a t e between the two r e l a t i o n s h i p s . I . e . , t h e r e has t o e x i s t a L o c a t e d _ i n _ D e p t and a Employed_by_Dept. 16 2.2.1.1. Bernstein's Relation Synthesis This method i s described i n Bernstein (1976). An implementation of Bernstein's algorithm can be found i n Ceri and Gottlob (1986). The goal of Bernstein's method i s the creation of a schema containing the smallest number of 3NF r e l a t i o n s for a given set of functional dependencies. Since the procedure does not concern i t s e l f with the o r i g i n of the functional dependencies, i t does not object to the fact that the set of dependencies i s taken from more than one schema. Therefore the method can be considered a view integration procedure. The method not only provides a s y n t h e s i s algorithm, but a l s o demonstrates that the set of r e s u l t i n g r e l a t i o n s i s minimal and probably i n 3NF. The creation of 3NF r e l a t i o n s i s t y p i c a l l y the goal and f i n a l outcome of a decomposition process i n which larger tables are s p l i t into smaller redundance-free components (for example, Ullman, 1980 or Date, 1981). Bernstein, i n contrast, generates 3NF r e l a t i o n s by means of composition. This makes Bernstein's approach a view integration technique. The goal of Bernstein's integration procedure i s to f i n d the 17 smallest set of 3NF r e l a t i o n s that incorporates a l l pre-defined functional dependencies that have been defined. The a l g o r i t h m developed by Bernstein consists of three main parts. The f i r s t part (involving steps 1 and 2 of \"Algorithm 2\", see below), has the purpose to generate a new set of functional dependencies (FDs) from an a r b i t r a r y set of functional dependencies characterizing the data r e l a t i o n s h i p s . These new dependencies form the input to the synthesis part. Synthesis (steps 3 and 4 i n Algorithm 2) f i r s t p a r t i t i o n s the set of FDs into groups with i d e n t i c a l l e f t sides 1 and then merges the FDs i n these groups. The l a s t part of the procedure (steps 5 and 6 i n A l g o r i t h m 2) c o n s t r u c t s r e l a t i o n s which are free of t r a n s i t i v e dependencies, based on the FDs synthesized i n the previous steps. Algorithm 2: (1) E l i m i n a t i o n of extraneous a t t r i b u t e s to produce a set F 1 of functional dependencies. (2) Finding of a non-redundant covering C for the set F' of functional dependencies. 1 \"Left side\" means the set of determining a t t r i b u t e s . In contrast, the \"right side\" consists of the determined a t t r i b u t e s . 18 ( 3 ) P a r t i t i o n i n g of the c o v e r i n g C i n t o groups of functional dependencies with i d e n t i c a l l e f t sides. ( 4 ) Merging of equivalent keys. ( 5 ) Elimination of t r a n s i t i v e dependencies. (6) Construction of r e l a t i o n s . Bernstein's approach does not d i f f e r e n t i a t e among d i f f e r e n t cases of integration, based on d i f f e r e n t dependencies within the data at hand. A l l functional dependencies are treated by the same integration procedure. This i s a p o s i t i v e feature of Bernstein's approach, since i t s i m p l i f i e s the procedure. In addition, t h i s approach has les s information requirements than the two following ones, which a l s o r e q u i r e i n f o r m a t i o n on other forms of dependencies. One major problem of the technique, pointed out by Bernstein himself, i s the purely syntactic character of the approach which i s the source f o r the \"uniqueness assumption\". The uniqueness assumption says that only one functional dependency can ex i s t between any two i d e n t i c a l sets of a t t r i b u t e s . In other words, i f two FDs existed, because of a difference i n roles of either set of a t t r i b u t e s , the technique were not able to pick up the differ e n c e . In order to allow the technique to d i f f e r e n t i a t e among d i f f e r e n t r o l e s , r o l e names have to be introduced as at t r i b u t e names. 19 This point leads to another shortcoming of the technique, namely the s i g n i f i c a n c e of names. The purely syntactic technique operates on a t t r i b u t e names, being therefore subject to a l l problems caused by a t t r i b u t e name synonymy and homonymy. However, the technique was not conceived as a view integration technique, which j u s t i f i e s t h i s weakness to some extent. Furthermore, B e r n s t e i n ' s approach does not r u l e out the development of a pre-integration procedure which could take care of such c o n f l i c t s and then supply the integration procedure with c o n f l i c t - f r e e views. 20 2.2.1.2. Martin's Canonical Synthesis See for example Martin (1983). Canonical Synthesis i s Martin's approach to view integration. M a r t i n i n t e g r a t e s views by f i r s t d e p i c t i n g a l l functional dependencies between data elements ( a t t r i b u t e s ) and then overlaying any two views to generate a t h i r d new one. The main focus of h i s approach i s on the e l i m i n a t i o n of t r a n s i t i v e dependencies generated by the integration process. Martin s t r e s s e s the use of bubble c h a r t s , showing data items (attributes) and t h e i r functional dependencies. The procedure integrates views pairwise and consists of seven integration steps for the l o g i c a l database design. 1. The designer i s asked to eliminate any duplicate functional dependencies between any two data items. 2. The designer has to i d e n t i f y candidate keys. 3. A l l t r a n s i t i v e dependencies have t o be removed. The purpose of t h i s step i s to f i n d and to remove any hidden primary keys, and f i n a l l y to achieve a 3NF data structure. 21 4 . I n t r o d u c e s o c a l l e d \" c o n c a t e n a t e d k e y s \" . T h e p u r p o s e o f t h i s s t e p i s t o e x t e n d t h e d a t a m o d e l t o a l l o w t h e r e p r e s e n t a t i o n o f d a t a i t e m s t h a t a r e d e p e n d e n t o n t h e k e y o f m o r e t h a n o n e a l r e a d y e x i s t i n g d a t a s t r u c t u r e , i . e . P r i c e i s d e p e n d e n t o n S u p p l i e r # a n d P a r t # . 5. A l l o c a t e i n t e r s e c t i o n d a t a t o d a t a i t e m s . T h i s s t e p d e a l s w i t h r e l a t i o n s h i p s t h a t h a v e a t t r i b u t e s . I f r e l a t i o n s h i p s h a v e a t t r i b u t e s , t h e y a r e t r a n s f o r m e d i n t o r e c o r d s t r u c t u r e s 1 . 6 . Remove M : N r e l a t i o n s h i p s 2 . 7 . T h e t e c h n i q u e t r a n s f o r m s s t r u c t u r e s i n w h i c h one a t t r i b u t e i s owned b y two o r m o r e p r i m a r y k e y s . I f s u c h a n \" i n t e r s e c t i n g \" a t t r i b u t e e x i s t s , t h e d a t a s t r u c t u r e i s c h a n g e d t o g i v e t h e a t t r i b u t e a s i m p l e o w n e r . M a r t i n ' s m e t h o d h a s t h r e e m a j o r l i m i t a t i o n s . F i r s t , t h e m e t h o d i s n o t c o n c e r n e d w i t h t h e r e m o v a l o f c o n f l i c t s b e t w e e n v i e w s , 1 T h e s e r e c o r d s t r u c t u r e s a r e s i m i l a r t o e n t i t i e s . Y e t M a r t i n d o e s n o t u s e t h e t e r m s e n t i t y o r r e l a t i o n s h i p t o d e s c r i b e d a t a c o n s t r u c t s . 2 M a r t i n s u g g e s t s t h a t M : N r e l a t i o n s h i p s i n d a t a b a s e , a s i d e f r o m b e i n g s u p p o r t e d b y o n l y few DBMSs , a r e a n u n s t a b l e d a t a c o n s t r u c t , o n e t h a t i s t y p i c a l l y r e p l a c e d b y two 1 : M s t r u c t u r e s a s p a r t o f t h e d e s i g n o r i m p l e m e n t a t i o n p r o c e s s . H i s t e c h n i q u e t h e r e f o r e d i s i n t e g r a t e s a n y M : N s t r u c t u r e i n t o 1 : M s t r u c t u r e s . 2 2 and second, i t uses att r i b u t e s as the atomic b u i l d i n g blocks of the global schema. Third, the \"algorithm\" presented i s not precise and thus does no, contrary to Martin's statement, allow immediate automation of the process. C o n f l i c t r e s o l u t i o n i s mentioned only b r i e f l y (Martin, 1983, p. 265) r e f e r r i n g to the problem of homonyms. A l l other view c o n f l i c t p o s s i b i l i t i e s are ignored. For example, Martin i s not concerned about relationships or e n t i t i e s modelled i n c o r r e c t l y as a t t r i b u t e s . A consequence of neglecting c o n f l i c t r e s o l u t i o n i s t h a t Martin's approach cannot be automated, g i v e n t h a t c o n f l i c t s have to be expected i n r e a l world a p p l i c a t i o n s . Martin has to assume that a l l c o n f l i c t s were eliminated by the database designer p r i o r to the integration process. Thus, l i k e Bernstein's method, t h i s one i s a view merging procedure, but not a c o n f l i c t r e s o lution procedure. The use of a t t r i b u t e s as the atomic b u i l d i n g blocks generate at l e a s t two problems. F i r s t , the modeling process based on a t t r i b u t e s operates at a very high l e v e l of d e t a i l . In fact, i t might be viewed s t r i c t l y as a bottom-up approach to database design. The d e t a i l i n view descriptions creates l a r g e amounts of i n f o r m a t i o n the designer has to process. Even with a small number of views, an evaluation of the r e s u l t i n g schema becomes very complex and very d i f f i c u l t i n terms of redundancies ( t r a n s i t i v e dependencies) . The e n t i t y - r e l a t i o n s h i p 23 approach, i n comparison, allows to hide part of t h i s information, namely associations between an e n t i t y and i t s a t t r i b u t e s . In the E-R model, only e n t i t i e s or r e l a t i o n s h i p s are able to form re l a t i o n s h i p s to other e n t i t i e s or r e l a t i o n s h i p s . In Martin's model, every a t t r i b u t e can be related to any other a t t r i b u t e , p r i o r to redundancy elimination. Secondly, the synthesis of a t t r i b u t e s to higher l e v e l objects i s not based on the user's semantic objects (objects meaningful to the user), but instead on f u n c t i o n a l dependency. The r e s u l t i n g higher l e v e l data s t r u c t u r e s (records, segments, or relations) are therefore expected to have less meaning for the user than data structures based on objects the user chooses to describe h i s data world (e.g. e n t i t y MANAGER). In other words, the r e s u l t s of canonical synthesis may lose some of i t s d e s c r i p t i v e adequacy of r e a l world objects and associations. This comment i s not meant to imply that database design based on functional dependencies i s wrong. Yet, the aggregates should represent the r e a l world view as f a i t h f u l l y as possible. There ex i s t s more than one possible way to describe a r e a l world object i n the data model, c a n o n i c a l s y n t h e s i s might not allow a representative of t h i s object i n the form the user would prefer ( i . e . , semantic r e l a t i v i s m , Brodie, 1984). F i n a l l y , due to i t s lack of p r e c i s i o n , t h i s technique should only 24 be viewed as a guideline to integration. I t s t i l l w i l l require substantial designer i n t e r a c t i o n and designer i n s i g h t . 2.2.1.3. Casanova's and Vidal's Method See Casanova and V i d a l (1983) fo r a d e s c r i p t i o n of the method, as w e l l as Bishop and Convent (1986, 1985) for extensions. Casanova's view integration method i s a formal approach to view integration based on four types of dependencies existent i n a global database schema. Goal of the integration process i s the generation of an \"optimised\" (feasible) schema, optimised with respect to elimination of redundant information and reduction i n s i z e , as measured by number of r e l a t i o n s 1 i n the global schema. The four types of dependencies (also referred to as i n t e g r i t y constraints) i n t h i s approach are: functional dependencies (FDs) , 1 In Casanova's language, which i s based on Ullman (1980, p. 75) , a \" r e l a t i o n scheme\" r e f e r s to the s t r u c t u r e of a r e l a t i o n a l database object, while a r e l a t i o n i s an instance of that structure, that i s the actual data. Ullman defines relation scheme as the l i s t of a t t r i b u t e s for a r e l a t i o n . 25 i n c l u s i o n dependencies (INDs), exclusion dependencies (EXDs), and union functional dependencies (UFDs). A functional dependency fd, expressed as R:X->Y, i s v a l i d i f f for any t,u e r, i f t[X]=u[X] then t[Y]=u[Y] For example, i n a r e l a t i o n scheme STUDENT[Stud#,Name], i f t[X] and u[X] are i d e n t i c a l student numbers, they both have to i d e n t i f y the exact same student name. An inclusion dependency ind i s expressed as R1[X] c R2[Y], with X and Y being sequences of a t t r i b u t e s of equal length. This dependency i s v a l i d i f f r l [ X ] i s a subset of r2[Y]. For example, UNDERGRAD[Stud#] c STUDENT[Stud#] , means t h a t the set of undergrad students i s a subset of the set of a l l students. An exclusion dependency exd i s expressed as R1[X] | R2[Y], X and Y again being sequences of a t t r i b u t e s of same length. This dependency i s v a l i d , i f f r l [ X ] and r2[Y] are d i s j o i n t . For example, the set of graduate students and the set of undergrad students would be such d i s j o i n t sets of students. A union functional dependency i s a f u n c t i o n a l dependency stretc h i n g over the boundaries of one r e l a t i o n . I t i s expressed 1 R r e f e r s to a r e l a t i o n scheme, r i s an instance of that r e l a t i o n scheme, X and Y are sets of one or more a t t r i b u t e s , and t and u are tuples. 26 i n the form Yl, ... , Rim:Xm->Ym>, as a s e t of functional dependencies over r e l a t i o n schemes Ri, where a l l X and Y are sequences of attri b u t e s of same length. A UFD i s v a l i d , i f f a FD that holds i n one r e l a t i o n holds i n a l l r e l a t i o n s included i n the UFD. For example, a UFD Name, UNDERGRAD:Stud#->Uname> means that a student number '83959818' occurring i n STUDENT w i l l i d e n t i f y the same student name 'Jones' as the student number '83959818' i n UNDERGRAD. The l a s t example gives some i n d i c a t i o n of the purpose of the above dependencies. They w i l l be used to i d e n t i f y and eliminate sources of redundancies. Given complete information on the above dependencies, a procedure i s defined that w i l l transform the combination of a l l views into an integrated global schema. Complete i n f o r m a t i o n on dependencies necessitates complete information on a l l a t t r i b u t e s i n a l l r e l a t i o n s of a l l views, plus complete i n f o r m a t i o n on domains of at t r i b u t e s . Given t h i s information, the problem of homonymy or synonymy does not ar i s e , because the names of r e l a t i o n s or a t t r i b u t e s are almost i r r e l e v a n t . A l l the above i n f o r m a t i o n i s assumed to be unambiguous. In other words, there' w i l l be f o r instance no di s p u t e s between d i f f e r e n t views concerning dependencies or domains of a t t r i b u t e s . Hence, c o n f l i c t s are ruled out by d e f i n i t i o n . 27 A view i n t e g r a t i o n based on Casanova's and Vidal's method involves the following steps. F i r s t , for every view, define the above d e s c r i b e d dependencies. Second, combine the views by lumping them together and by defining additional constraints of the above types, to d e s c r i b e the r e l a t i o n s h i p s between the elements ( r e l a t i o n s ) of d i f f e r e n t views. Third, integrate (\"optimize\") t h i s schema by removing redundancies i n the combination of views. The f i r s t major problem of t h i s integration method, as stated by the authors, i s t h a t i t i s computationally hard. The problem i s PSPACE complete ( i t f i t s i n t o f i n i t e computer memory space, but can run i n d e f i n i t e l y ) . Casanova points out t h a t the optimization problem may not be decidable, even i f nothing but FDs and INDs are considered (see also Casanova and Fagin, 1982). Another major problem concerns the information requirements of t h i s t echnique. The approach r e q u i r e s l a r g e amounts of ambiguity-free information. Since i t cannot deal with p a r t i a l l y i n correct user views (wrong perceptions of data), i t cannot be used t o r e s o l v e c o n f l i c t s caused by inconsistencies i n user views. A further l i m i t a t i o n on Casanova's and Vidal's approach r e s u l t s 28 from i t s a p p l i c a b i l i t y to only so c a l l e d \" r e s t r i c t e d \" schemas. The f o l l o w i n g r e s t r i c t i o n s apply to the input of the view integration procedure. (1) A l l f u n c t i o n a l dependencies apply only to the ( s i n g l e ) key. Thus, t h e r e are no t r a n s i t i v e dependencies e x i s t i n g . (2) Any in c l u s i o n dependency applies only to the key attr i b u t e s of the re l a t i o n s involved. (3) Any union functional dependency must apply to the key a t t r i b u t e s (as the l e f t argument of the dependency) for a l l r e l a t i o n s involved and can only describe a dependency of a single a t t r i b u t e on the key (\" ... i f Y1, ... ,Rim:Xm->Ym> i s i n C, then Xl=...=Xm=Kil=...Kim and |Yj|=l, je[l,m]\"). (4) Any at t r i b u t e of any r e l a t i o n can appear i n at most one union functional dependency (\" ... for any Ries and any at t r i b u t e A of Ri, A occurs i n at most one UFD i n C\"). Note that t h i s r e s t r i c t i o n i s v i o l a t e d i n Casanova's example. The r e s t r i c t i o n may only r e f e r to dependent at t r i b u t e s , not to key a t t r i b u t e s . (5) A l l e x c l u s i o n d e p e n d e n c i e s apply to only key at t r i b u t e s . 29 E s p e c i a l l y r e s t r i c t i o n (4) seems l i k e a s i g n i f i c a n t l i m i t a t i o n to the integration problem. Real world databases w i l l have to serve as an i n d i c a t o r of how strong t h i s l i m i t a t i o n i s . 2.2.1.4. Functional Data Model Based Integration For references to the method, see Yao, Waddle and Housel (1985, 1982). In contrast to many other syntactic integration methods, Yao et a l . present a view integration approach based on Shipman's (1979) Functional Data Model. Within the Functional Data Model (FDM), data can be d e s c r i b e d i n form of two constructs, nodes (to represent e n t i t i e s and value sets) and functional r e l a t i o n s h i p s . Nodes can be e i t h e r simple nodes (value s e t s ) , or tuple nodes (cartesian product of n>l value s e t s ) . Functions, mappings from a domain into a range, can be functional ( n : l ) , one-to-one, or i d e n t i t y (1:1 mapping into i d e n t i c a l value) and can be p a r t i a l 30 (lower degree 0), or t o t a l (lower degree 1). Assertions are added as a further means for describing data, to increase the d e s c r i p t i v e power of the model. Assertions describe true facts about data, i . e . that one set of data i s the subset of another. Views are depicted i n form of nodes and r e l a t i o n s h i p constructs (in a graphical representation) . Therefore, complete information on e n t i t i e s and a t t r i b u t e s , t h e i r domains and t h e i r r e lationships has to be a v a i l a b l e . Aside from t h i s information, the approach also compiles information on the queries to be issued on the database. Database transactions, represented by means of a Transaction Specification Language (TASL) are kept together with the views and are updated whenever view updates require query modifications. One further piece of information i s c o l l e c t e d , namely information describing the physical data i n terms of quantities of members of a set, i . e . the number of students, p r o f e s s o r s , courses i n a u n i v e r s i t y database. Quantity information i s l a t e r used i n h e u r i s t i c s to i d e n t i f y non-redundant f u n c t i o n s i n the model. The treatment of transaction and q u a n t i t y i n f o r m a t i o n w i l l not be s u b j e c t of the following discussion. The technique incorporates two integration operations: the removal of redundant nodes, and the removal of redundant 31 f u n c t i o n s . According to B a t i n i et a l . (1986, p. 343), Yao's technique performs view integration on a l l views i n p a r a l l e l (\"one-shot n-ary\"). However, t h i s i s true f o r the integration of redundant functions only. Integration of nodes i s performed on a singl e p a i r of nodes at any point i n time. A node i s redundant i f i t represents the \"same set of values\" as some other node. Note that the \"same set of values\" (Yao et a l . , 1985, p. 338) does not mean the two sets are i n fact i d e n t i c a l . I t i s s u f f i c i e n t that one i s a subset of the other or that they are overlapping. I f two nodes represent the same set of values, they w i l l be merged. The integration can only be performed i f any e x i s t i n g functions between the two nodes are i d e n t i t y functions. Nodes A and B are merged by creating a new node C which i s the union of A and B. A l l functions that had A or B as domain or range w i l l be redefined to have C as domain or range. In addition, i f A and B are not i d e n t i c a l , a separation node SEP w i l l be created that stores information to d i f f e r e n t i a t e between the two o r i g i n a l nodes, given the new node C. I f a s e p a r a t i o n node has to be created, a l s o a new f u n c t i o n a l dependency w i l l be created with C as i t s domain and SEP as i t s range. A separation node can be viewed as a set of indices that indicates, by means of pointers to the new combined set C, the o r i g i n of each value i n the new set. 32 The second integration operation removes redundant functions. The goal i s to remove a functional r e l a t i o n s h i p A->C, i f i t can be replaced by other functions, i . e . the two functions A->B and B->C. The authors point out that the redundancy of a function can only be decided upon analysis of data semantics. In other words, the meaning of functional relationships has to be known to decide on i t s redundancy. T h i s i s one o f the c r i t e r i a which d i f f e r e n t i a t e s Yao's et a l . ' s technique from the previously discussed completely syntactic approaches. The method proposed by Yao et a l . has a number of l i m i t a t i o n s . F i r s t , the method i s incomplete. View integration i s r e s t r i c t e d to only three cases of node integration and one case of function integration. Hence, the technique w i l l not be able to adequately represent a l l possible types of set relationships between view objects (for example, two nodes are not overlapping but have a common superset). A second weakness concerns the i n t e g r a t i o n procedure. The procedure i s not defined exactly. For example, does function removal always precede node removal? Does the procedure perform 33 node merges always on single pairs of nodes, or on an a r b i t r a r y number of nodes at the same time. Third l y , the technique does not show the transformation from FDM into database objects, i . e . r e l a t i o n s , or more l i k e l y , network constructs. F o u r t h l y , use of p h y s i c a l database i n f o r m a t i o n i n l o g i c a l database design i s not p a r t i c u l a r l y u s e f u l ( i . e . , r e c o r d q u a n t i t i e s ) . F i n a l l y , the method has no means for dealing with c o n f l i c t i n g information, i . e . with naming c o n f l i c t s or with type c o n f l i c t s . 2.2.2. Semantic View Integration Approaches Based on the E-R Model Semantic approaches use data o b j e c t s t h a t are meaningful to the user. Since they require a higher l e v e l of understanding of the meaning of objects,.these approaches are 34 t y p i c a l l y i n t e r a c t i v e , that i s , they demand designer intervention during the integration process. Designer intervention i s for instance necessary to s e t t l e c e r t a i n naming or type c o n f l i c t s , and even more important, to i n t e r p r e t the meaning of data objects or object r e l a t i o n s h i p s . Since semantic integration approaches focus more on the meaning of the data objects than on only s t r u c t u r a l information, the data models used to represent views have to be able to capture data semantics. In t h i s section, integration techniques based on the Entity-Relationship (E-R) model, w i l l be introduced. The E-R model i t s e l f i s not p a r t i c u l a r l y r i c h i n i t s a b i l i t y to represent data semantics. Therefore, the methods discussed below (both Navathe et a l . and B a t i n i et al.) use an extended E-R model which for instance provides the c a p a b i l i t y to model categories which are generalizations of e n t i t i e s 1 . Interactive approaches take advantage of having access to the database designer during the integration process f o r c o n f l i c t settlement or information c l a r i f i c a t i o n . In consequence, they permit the integration of less r e s t r i c t e d data models and to perform a larger portion of the integration process i . e . include c o n f l i c t analysis. On the other hand, the reported i n t e r a c t i v e approaches t y p i c a l l y do not include a complete 1 Not a l l integration methods representing data semantics have to be based on the E-R model. For example, Teory and Fry (1982) developed a method based on a semantic h i e r a r c h i c a l model. 35 a l g o r i t h m f o r the i n t e g r a t i o n process and do not e x a c t l y s p e c i f y the r e s t r i c t i o n s placed on the data model (such as consistency). 2.2.2.1. Navathe's and Elmasri's Approach Description of various aspects of t h i s method can be found i n Navathe, Elmasri, Larson (IEEE 1986), Navathe and Elmasri (IEEE 1984), Elmasri and Navathe (1986), Elmasri et a l . (1987). Navathe's and Elmasri's approach concentrates on the idea of object class integration. The e n t i t y - r e l a t i o n s h i p model i s extended to an e n t i t y - c a t e g o r y - r e l a t i o n s h i p model where a category r e f e r s to a class or an object type (common ro l e or subclass). The atomic elements of t h i s approach are e n t i t i e s , categories, relationships, and a t t r i b u t e s . Two types of categories are used, common r o l e categories and sub c l a s s c a t e g o r i e s . A common r o l e category i s one that represents a common property of two or more otherwise d i f f e r e n t s e t s , i . e . the category OWNER represents a common ro l e for both PERSON and COMPANY, who may both be owners of a vehicle. 36 A subclass i s a s p e c i a l i z a t i o n of an e n t i t y set, i . e . the VEHICLE e n t i t y s e t has s u b c l a s s e s CAR and TRUCK. Common ro l e and s p e c i a l i z a t i o n w i l l have an impact on inheritance of a t t r i b u t e s . The procedure consists of three steps: pre-integration, object integration and r e l a t i o n s h i p integration. Within pre-integration three tasks are performed. F i r s t , naming correspondences are established, resolving the problem of i n t e r -view homonymy and synonymy. Synonymy and homonymy r e f e r to the problem of d i f f e r e n t names designating the same r e a l world object or i d e n t i c a l names designating d i f f e r e n t r e a l world objects (concepts 1 ) . The second task i s the i d e n t i f i c a t i o n of candidate keys fo r object classes. The t h i r d task i s the d e f i n i t i o n of domains fo r object classes. Domains play an important r o l e i n Navathe's technique. The purpose of defining them within the pre-integration step i s to gather information for the recognition of i d e n t i c a l or r e l a t e d r e a l world o b j e c t s . I.e. i f two objects have the same domain, i t may be suspected that these objects are i d e n t i c a l . Integration of objects ( e n t i t i e s or categories) i s the second phase of Navathe's scheme. In t h i s phase, information on 1 Navathe uses the term \"concept\" to r e f e r to a r e a l world object, while B a t i n i uses the term \"concept\" for a data model element such as an e n t i t y , a t t r i b u t e or r e l a t i o n s h i p . 37 domains i s used to determine s i m i l a r i t i e s or d i s s i m i l a r i t i e s among view o b j e c t s . Navathe analyses the following cases: i d e n t i c a l domains, contained domains, overlapping domains,and d i s j o i n t domains. INTEGRATION OF OBJECTS The integration of relationships follows the object integration step. Navathe points out that f o r r e l a t i o n s h i p integration both s t r u c t u r a l and semantic considerations are important. R e l a t i o n s h i p s are c l a s s i f i e d a c c o r d i n g t o thre e c r i t e r i a : degree (which i s not the mapping r a t i o but the number of o b j e c t s i n v o l v e d i n the view (construct)), r o l e s of object classes involved i n the relati o n s h i p , and s t r u c t u r a l constraints, such as mapping r a t i o s . The r e l a t i o n s h i p i n t e g r a t i o n process evaluates the above information i n the following sequence of importance: degree i n f o r m a t i o n ( s a m e / d i f f e r e n t d e g r e e ) , r o l e i n f o r m a t i o n (same/different r o l e s ) , and s t r u c t u r a l vs. domain constraints r e s u l t i n g i n 8 integration cases. The main points to be learnt from Navathe \u00E2\u0080\u00A2 s approach are the st r e n g t h of domain information and category information for view i n t e g r a t i o n , the p o s s i b i l i t y of simultaneous n-object integration (in some instances), and the relevance of p a r t i c u l a r 38 p i e c e s of i n f o r m a t i o n during the r e l a t i o n s h i p i n t e g r a t i o n phase. 2.2.2.2. Ba t i n i ' s Approach For references see for instance B a t i n i et a l . (1984a, 1983), or B a t i n i and Lenzerini (1983). B a t i n i ' s approach performs integration on the atomic elements of the e n t i t y - r e l a t i o n s h i p model, e n t i t i e s , r e l a t i o n s h i p s , and a t t r i b u t e s . View i n t e g r a t i o n i s presented as an i t e r a t i v e process which aggregates views pairwise. Whenever c o n f l i c t s a r i s e between the two views, a c o n f l i c t r e s o l u t i o n process i s invoked and c a r r i e d out i n t e r a c t i v e l y with a database designer. The t e c h n i q u e s t a r t s out with a name c o n f l i c t a n a l y s i s , i d e n t i f y i n g i n t r a - v i e w homonyms and synonyms and removing them. These can be naming c o n f l i c t s for the same concepts (e.g. e n t i t y ) or f o r d i f f e r e n t concepts (e.g. e n t i t y vs. r e l a t i o n s h i p ) . T h i s step i s followed by a type c o n f l i c t a n a l y s i s which r e s u l t s i n the same r e a l world object being represented by the same concept i n d i f f e r e n t views (e.g. MARRIAGE always an entity) and i n an adjustment of c a r d i n a l i t i e s (mapping r a t i o s ) and o p t i o n a l i t i e s of a t t r i b u t e s and r e l a t i o n s h i p s i n d i f f e r e n t views to make them i d e n t i c a l . 39 F i n a l l y , merging and redundancy a n a l y s i s superimposes the adjusted views and removes redundancies such as redundant c y c l e s 1 . B a t i n i ' s method builds a global schema i t e r a t i v e l y , integrating two views into a temporary global schema and adding additional views to t h i s schema u n t i l a l l views have been consolidated. The two main elements of the technique are C o n f l i c t Analysis (together with merging) and Redundancy Analysis, with the main focus on C o n f l i c t A n a l y s i s . U n l i k e other authors such as Martin, B a t i n i et a l . address the problem of inconsistencies between d i f f e r e n t users 1 perceptions of the world and d i f f e r e n t naming conventions systematically (but not completely). The goal of C o n f l i c t Analysis i s to detect and solve a l l e x i s t i n g c o n f l i c t s between two representations (views) of the same classes of objects. Two types of c o n f l i c t s are tackled, naming c o n f l i c t s and type c o n f l i c t s . Naming c o n f l i c t s a r i s e i f the same data model concept (entity, a t t r i b u t e or relationship) i s l a b e l l e d d i f f e r e n t l y (synonyms) , or i f d i f f e r e n t concepts are l a b e l l e d with the same name (homonyms) . Type c o n f l i c t analysis determines whether objects have compatible concepts (types) and adjusts them i f necessary. 1 The technique also includes quantitative and procedural aspects to a r r i v e at a procedurally more adequate schema where frequent database operations can be c a r r i e d out more e f f i c i e n t -l y . 4 0 To define homonymy and synonymy, B a t i n i et a l . r e f e r to the view representation of r e a l world objects. I f a view SI represents two d i f f e r e n t r e a l world objects with the same concept (name), t h i s i s c a l l e d an intra-view homonym 1 . Accordingly, synonymy ref e r s to the same r e a l world object being represented by two d i f f e r e n t c o n c e p t s w i t h i n one view. Given these view inconsistencies, B a t i n i i d e n t i f i e s a number of possible scenarios and s o l u t i o n a l t e r n a t i v e s . Interesting i n Ba t i n i ' s procedure i s the focus on only intra-view inconsistencies. Inter-view inconsistencies are, at lea s t i n t h i s step, ignored. A second step i n the naming c o n f l i c t analysis i s the so c a l l e d analysis of concept likeness or unlikeness. The attempt i n t h i s step i s to f i n d out whether a concept that has the same name i n two d i f f e r e n t views possesses d i f f e r e n t \"neighbor properties\" (concept unlikeness), or whether concepts have d i f f e r e n t names but some common neighbor properties (concept l i k e n e s s ) . The next step i n B a t i n i ' s approach i s the Type C o n f l i c t s A n a l y s i s . I t s purpose i s to a s s i g n the same concepts to i d e n t i c a l r e a l world o b j e c t s i n d i f f e r e n t views. I.e. i f MARRIAGE were a re l a t i o n s h i p i n one view, but an e n t i t y i n the 1 Usually one would expect inter-view homonymy to be the more important issue, two views supplying the same name to two d i f f e r e n t r e a l world objects. 41 other one, at l e a s t one of these representations would be change t o l e t MARRIAGE be represented by only one concept. The conversion of concepts i s r e s t r i c t e d to only atomic concepts ( e n t i t y , a t t r i b u t e relation) and r e s u l t s i n two views using same names and same concepts to describe r e a l world objects. The second p a r t of type c o n f l i c t analysis i s compatibility checking, a process which analyzes, among the now quite s i m i l a r views, whether c a r d i n a l i t i e s (mapping ratios) are i d e n t i c a l . C o m p a t i b i l i t y checking a l s o d i s c o v e r s d i f f e r e n c e s i n the o p t i o n a l i t y of a t t r i b u t e s and r e l a t i o n s h i p s . According to B a t i n i et a l . , differences i n c a r d i n a l i t i e s point to errors i n one of the views, or a l t e r n a t i v e l y to a containment r e l a t i o n s h i p . Once a l l c o n f l i c t s have been resolved, Merging and Redundancy A n a l y s i s f o l l o w . In merging, the c o n f l i c t - f r e e views are superimposed. Redundancy analysis removes redundant alternate paths between objects. Redundancies can occur because multiple paths are semantically equivalent. B a t i n i ' s technique concludes with an update of the i n d i v i d u a l views to make them consistent with the newly generated global schema and with an a l t e r a t i o n of the global schema to include procedural and quantitative aspects. 4 2 B a t i n i ' s approach provides a procedure f o r the integration process together with some exact c o n f l i c t r e s olution algorithms, yet, based on i t s description i n the l i t e r a t u r e , i t cannot be automated. The method does not c l a r i f y when a p a r t i c u l a r integration r u l e has to be applied, or which information has to be av a i l a b l e (Navathe i s more exact i n t h i s matter, basing h i s r e s o l u t i o n scheme on information on cl a s s membership). 2.3. View Integration Cases The i n v e s t i g a t i o n of the above view integration techniques found considerable overlap among techniques with respect to t h e i r integration c a p a b i l i t i e s . When techniques d i f f e r , they t y p i c a l l y deviate i n t h e i r c o n f l i c t resolution c a p a b i l i t i e s and i n aspects of the integration method related to t h e i r i n d i v i d u a l data models. The more recent techniques t y p i c a l l y provide a r i c h e r set of cases for c o n f l i c t resolution. Consensus e x i s t s with respect to the integration cases for sets (of e n t i t i e s or relationships) whose connection to each other i s known, as represented i n the following eight cases. 43 Object Class Integration: (1) I d e n t i c a l object classes (2) Contained object class (3) Overlapping object classes with a common superset (4) D i s j o i n t object classes with a common superset Relationship Integration: (5) Relationship i d e n t i t y (6) Relationship containment (7) Relationship overlap with a common superset r e l a t i o n s h i p (8) D i s j o i n t relationships with a common superset r e l a t i o n s h i p The table below depicts which of the above cases are supported by the techniques presented i n the chapter ('y' indicates the technique's a b i l i t y to deal with the case, a blank indicates that no reference has been made to how t h i s case would be solved). Cases Technique 1 2 3 4 5 6 7 8 Martin cases do not apply Bernstein cases do not apply Casanova and v i d a l y y y y y y y y Yao et a l . y y y Navathe e t a l . y y y y y y y y Ba t i n i et a l . y y y 44 2.4. Conclusion T h i s s e c t i o n s h a l l p o i n t out the comparative strengths and weaknesses of syntactic and semantic integration approaches. Syntactic approaches Restricted Data Models Syntactic approaches place considerable r e s t r i c t i o n s on the data model with which views are represented. For example, Biskup's and Convent' s model i s r e s t r i c t e d to only proper database schemes which impose r e s t r i c t i o n s on the f i e l d s to which constraints can apply. T y p i c a l l y , a l l dependencies have to involve the key or a key a t t r i b u t e . Bernstein r e f e r s i n h i s technique to the uniqueness assumption which dictates that only one functional dependency may e x i s t between any p a i r of f i e l d s . He also points out t h a t t h i s r e s t r i c t i o n may lead to the necessity to bury semantics i n data item names1 . 1 For instance that two f i e l d s Emp# and Dept# may be r e l a t e d by the functional dependency \"employee i s located i n department\" or by another dependency \"employee i s employed by department\". Syntactic models require a renaming of at l e a s t one of the Dept# f i e l d s i n t h i s case. 45 No C o n f l i c t Analysis The sy n t a c t i c approaches operate under the assumption that the da t a r e q u i r e d f o r i n t e g r a t i o n i s complete and c o r r e c t . T h e refore, c o n f l i c t analysis i s not part of the techniques. The techniques can deal with simple c o n f l i c t s , f or instance w i t h synonymy, i f i d e n t i t y i s e s t a b l i s h e d by means of constraints. No A b i l i t y to Deal with Incomplete or Inconsistent Data Again, the a b i l i t y t o deal with incomplete or inconsistent data i s outside the scope of syntactic integration techniques. At l e a s t one technique, Biskup's and Convent's, w i l l , when an unresolvable problem i s encountered, i n t e r a c t with the designer to resolve the problem i n order to allow a continuation of the integration process. However, t h i s form of exception handling i s not a planned form of c o n f l i c t analysis, but a measure to l e t the technique continue when none of the integration cases i s considered performable by the technique. Extensive Information Requirements The major information requirement of syntactic approaches i s knowledge of dependencies between data items. Since a l l dependencies are defined on the a t t r i b u t e l e v e l , t h i s information requirement exceeds that of semantic approaches which represent dependencies on the e n t i t y l e v e l only. Furthermore, the 46 requirement to also define inter-view constraints can lead to an exponential explosion of constraint d e f i n i t i o n s . Computationally Hard Casanova and V i d a l and Biskup and Convent p o i n t out the computational requirements of t h e i r techniques. Provide Integration Algorithm One major advantage of syntactic approaches i s the completeness of procedures. The approaches, i n s t e a d of o u t l i n i n g only p a r t i c u l a r i n t e g r a t i o n cases, t y p i c a l l y present a procedure t h a t upon t e r m i n a t i o n has produced an i n t e g r a t e d database schema. Show Optimality ( F e a s i b i l i t y ) of Design Another major advantage of syntactic approaches i s t h e i r ex-ante s p e c i f i c a t i o n of design objectives and t h e i r proof of achievement of these design objectives. Semantic approaches Require Designer Interaction Based on the f a c t that semantic approaches operate on objects meaningful to users but often not meaningful to the integration mechanism, these approaches require designer i n t e r a c t i o n for i n t e r p r e t a t i o n of objects and for c o n f l i c t analysis. Cover Larger Portion of the Integration Process In addition to the operations contained i n syntactic approaches, semantic approaches include also c o n f l i c t analysis procedures, and pre-integration procedures (see B a t i n i et a l . , 1986) which are concerned, among other factors, with data gathering. State/Solve More Integration Cases Semantic techniques i d e n t i f y and solve more integration cases s i n c e they include not only the simple eight cases based on s e t i n t e r - r e l a t i o n s h i p s as explained above, but also cases involving c o n f l i c t s . Allow Less Restricted Data Models ( i . e . , non-similar keys) Semantic methods perform integration based on the meaning of o b j e c t s , not (exclusively) based on s t r u c t u r a l s i m i l a r i t i e s . T h e r efore, a semantic approach can p o s s i b l y integrate two o b j e c t c l a s s e s i n which one i s a subset of the other, even when the object classes have d i f f e r e n t keys. Less Complex 48 Semantic approaches simplify the integration process i n two ways. F i r s t , the amount of d e t a i l i s much less than that of s y n t a c t i c approaches, s i n c e the focus i s on e n t i t y - l e v e l items. Second, semantic data items are more meaningful to humans than a r b i t r a r y c o l l e c t i o n s of f i e l d s held together only by dependencies. Deal with Database Objects Meaningful to Designers and Users The outcome of the design process also i s more profound for the database user, since the database objects are meaningful to database users. A syntactic integration, based purely on dependencies, may derive database objects that are not suggestive to the user. One of B a t i n i et a l . 's (1986) c r i t e r i a f or goodness of a design i s understandability. Do not Provide Complete Procedures One of the major weaknesses of the semantic approaches i s the l i m i t e d d e s c r i p t i o n of complete procedures f o r integration. Even though a va r i e t y of integration cases i s outlined, the des c r i p t i o n of sequences of integration steps and possible re-i t e r a t i o n s i s , i f not missing, at lea s t very terse. In addition, when dealing with c o n f l i c t analysis, semantic approaches are not complete i n t h e i r analysis, nor do they show the missing elements of the analysis. Do not Present Proof of Optimality of the Design 49 A consequence of the incompleteness of semantic integration procedures i s t h e i r i n a b i l i t y to demonstrate the optimality of the f i n a l design. No semantic procedure states a point at which the procedure terminates and has achieved a f i n a l design. Also, the objectives of semantic approaches involve the c r i t e r i o n of u n d e r s t a n d a b i l i t y which cannot be measured as e a s i l y as, f o r instance, adherence to normal forms. Yet, even for the c r i t e r i a t h a t can be shown more e a s i l y , semantic approaches t y p i c a l l y do not provide any proof of optimality or f e a s i b i l i t y . O verall, c o n f l i c t analysis and resolution i s the common weak p o i n t i n a l l i n t e g r a t i o n techniques. Three causes of t h i s d e ficiency are: (1) s y n t a c t i c techniques cannot d e a l with c o n f l i c t analysis at a l l . They ignore c o n f l i c t s i n general. (2) i f c o n f l i c t a n a l y s i s i s done, i t i s o f t e n done unsystematically. B a t i n i et a l . (1983) perform the most thorough analysis by separating naming c o n f l i c t s from t y p e c o n f l i c t s and t h e n a n a l y z i n g them separately. This analysis i s s t i l l not s u f f i c i e n t to i d e n t i f y , l e t alone solve, a l l possible causes of c o n f l i c t s . (3) c o n f l i c t a n a l y s i s i s b i a s e d by i n f o r m a t i o n r e q u i r e m e n t s c o n s i d e r a t i o n s . Only cases are considered for which information i s e a s i l y available ( i . e . mapping ratios) , which are most prominent ( i . e . 50 synonyms), or which are of p a r t i c u l a r concern due to the data model chosen ( i . e . semantic r e l a t i v i s m , or mapping r a t i o s ) . In contrast, a more systematic procedure should be aware of a l l possible c o n f l i c t cases and then should determine the information requirements t o s o l v e them. Thus, even i f the technique i s not able to resolve a l l c o n f l i c t s due to lack of information, i t i s at l e a s t aware of the p o s s i b i l i t y of existence of a c e r t a i n c o n f l i c t , and thus of i t s own l i m i t a t i o n s ! B a t i n i e t a l . (198 6) summarize the lack of research i n the area of c o n f l i c t analysis as follows: ... Simple renaming operations are used f o r s o l v i n g naming c o n f l i c t s by most methodologies. With regard to other types of c o n f l i c t s , the methodologies do not s p e l l out formally how the r e s o l u t i o n p r o c e s s i s c a r r i e d out; however, an i n d i c a t i o n i s given i n several of them as to how one should proceed. ... (p. 348) And further: ... I t i s i n t e r e s t i n g to note that among the methodologies surveyed, none provide an analysis or proof of the completeness of the schema transformation operations 51 from the standpoint of being able t o r e s o l v e any type of c o n f l i c t that can a r i s e . ... (ibid.) The s o l u t i o n to these problems w i l l therefore form the core of t h i s research project. 52 3. SYSTEM FOR VIEW INTEGRATION 3.1. Research Question and Contribution to Knowledge Research question 1: 1.1 Can a view i n t e g r a t i o n p r o c e s s be formalized which transforms any c o l l e c t i o n of c o n f l i c t i n g views into a complete and consistent global schema? 1.2 Which c o n f l i c t cases have to be solved i n the process? The purpose of t h i s research question i s to solve the c o n f l i c t analysis problem, i n i t i a l l y neglecting information requirements. Assuming s u f f i c i e n t information, a mechanism i s to be developed that allows the detection and s o l u t i o n of a l l view c o n f l i c t s . The view i n t e g r a t i o n mechanism s h a l l be able to convert a c o l l e c t i o n of views i n t o a complete and consistent global schema, u s i n g the p r e v i o u s l y i n t r o d u c e d group of 8 simple i n t e g r a t i o n cases f o r set-subset r e l a t i o n s h i p s , as well as others to be defined l a t e r . Based on the s u f f i c i e n t information assumption, c o n f l i c t cases can be described and solved without concern for the d i f f i c u l t y 53 of data g a t h e r i n g . Instead of mixing the c o n f l i c t problem with the information requirements problem, question 1 deals only with the former one. The f i r s t step i n answering t h i s research question w i l l be the i d e n t i f i c a t i o n and s o l u t i o n of a complete set of c o n f l i c t cases. The second step w i l l focus on the development of a procedure to carry out the integration, based on the set of cases. Research question 2 : 2.1 What i n f o r m a t i o n can be used f o r the integration of user views into a global d a t a b a s e schema when the n e c e s s a r y information i s not e x p l i c i t l y available? 2.2 How can t h i s information be gathered i n a p r o c e s s t h a t l i m i t s d e s i g n e r interrogation to a fe a s i b l e l e v e l ? The basis for the second question i s the assumption that i n a l l p r a c t i c a l s i t u a t i o n s the necessary information about views i s not unavailable, or too d i f f i c u l t or too c o s t l y to gather. Therefore, even though the answer to question 1 reveals which information i s necessary to perform view integration, a l l t h i s information cannot be expected to be present. Hence, substitutes 54 have to be found for the missing information; substitutes that can be ei t h e r known by the program (program's knowledge base) or which can be e a s i l y gathered through a minimum of in t e r a c t i o n with the database designer. The term \"substitutes\" may be better phrased as \"operationaliz-ations\" of information on some database concept. For example, given s u f f i c i e n t information, the system w i l l know that two r e l a t i o n s h i p s have i d e n t i c a l meaning, even i f t h e i r names d i f f e r . A system with i n s u f f i c i e n t information has to r e l y on o p e r a t i o n a l i z a t i o n s of the \"meaning\" concept to assess the i d e n t i t y of such r e l a t i o n s h i p s . Domain i d e n t i t y and i d e n t i t y of neighbour e n t i t i e s may be such operationalizations. The intention behind the second question i s not to f i n d \" t r i c k s \" to s o l v e the l i m i t e d i n f o r m a t i o n problem, but to i d e n t i f y s u b s t i t u t e i n f o r m a t i o n ; i n f o r m a t i o n items t h a t allow the assessment of concepts such as \"meaning\", which are d i f f i c u l t to grasp by a computer. The knowledge of these substitutes w i l l teach us also about al t e r n a t i v e information requirements of data modelling techniques. Even though a v a i l a b i l i t y of i n t e g r a t i o n information i s an important concern, the apparent lack of substitute information should not l i m i t the comprehensiveness of the integration mechanism. C o n f l i c t analysis, at l e a s t i n p r i n c i p l e , should 55 not be based on the convenience with which relevant information items can be produced. On the contrary, question 2 should i d e a l l y attempt to f i n d information sources f o r a l l requirements r a i s e d i n question 1. In other words, question 1 aims at s t a t i n g and s o l v i n g the integration problem i n a s u f f i c i e n t i n f o r m a t i o n environment, question 2 aims at s o l v i n g t h a t integration problem i n a li m i t e d information environment. In order to decide on the best information substitute i n the l i m i t e d information environment, questions have to be raised on the s u i t a b i l i t y of c e r t a i n p i e c e s of information. The following l i s t gives suggestions the s e l e c t i o n should be based on. The term \"concept\" refers to the information concept to be used as a substitute: 1. how well does the concept represent the underlying information that i s necessary for database design? 2. when does the concept f a i l as a surrogate for the underlying information? 3 . can the user/database designer provide the information, or can i t be gathered from some other source? 4 . how easy can the information be gathered during the integration process? 56 The l a s t point brings up the issue of developing a process for view integration which requires the l e a s t amount of i n t e r a c t i o n by u s i n g as much i n f e r r e d i n f o r m a t i o n as possible. Given s u f f i c i e n t information, designer i n t e r a c t i o n i s i d e a l l y not necessary 1 . Given l i m i t e d information, designer i n t e r a c t i o n w i l l be necessary. Therefore, a process developed to answer re s e a r c h question 1 may r e q u i r e r e d e s i g n to i n c r e a s e i t s usefulness. For example, a useful design change would be a m o d i f i c a t i o n t h a t enabled the technique to apply previously gathered information to l a t e r stages of the integration process. One has to keep i n mind that a program w i l l quickly lose i t s appeal as a productivity t o o l , i f i t repeately asks the designer t r i v i a l questions. Such redesign does not change the integration cases, but the sequence of the analysis, as w i l l be demonstrated l a t e r i n the context of h e u r i s t i c s . So, while the primary i n t e r e s t within t h i s research i s the discovery of an exhaustive set of c o n f l i c t cases and resolution p r i n c i p l e s , the secondary i n t e r e s t i s the development of an e f f i c i e n t integration procedure through choice of surrogates for c e r t a i n pieces of information and through choice of a 1 The integration mechanism which assumes information a v a i l a b i l i t y i s implemented i n form of a programmed procedure that d i r e c t s a l l questions concerning information requirements back to the designer (user of the mechanism). 57 sequence that allows to make inferences from the data already gathered. Contribution to Knowledge: A main r e s u l t of the study i s p r e s c r i p t i v e knowledge, knowledge on how view integration should be c a r r i e d out. The s t a r t i n g p o i n t f o r t h i s knowledge i s the set of i n t e g r a t i o n cases i d e n t i f i e d by the consensus of previous integration approaches. This research develops a systematic framework which encompasses the a v a i l a b l e integration knowledge (see chapter 2) as well as a set of addit i o n a l cases for c o n f l i c t i n g views. The research also demonstrates the framework's completeness. Another r e s u l t of the study i s a set of h e u r i s t i c s f o r e f f i c i e n t execution of the integration process with l i m i t e d information. The assumptions underlying these h e u r i s t i c s w i l l be c l e a r l y s t a t e d . For example, suppose, the f o l l o w i n g h e u r i s t i c i s implemented. \"IF object A i s i d e n t i c a l to object B and object A w i l l have the same c o n s t r u c t ( i . e . , be both e n t i t i e s ) . H e u r i s t i c s are accompanied by explanations concerning t h e i r g e n e r a l i z a b i l i t y and e f f e c t s of t h e i r f a i l u r e . P r e s c r i p t i v e knowledge encompasses knowledge on integration laws and integration process rules while d e s c r i p t i v e knowledge 58 encompasses process and information s u b s t i t u t i o n r u l e s . At the end, t h i s research presents a set of information requirements and a set of integration rules which together are s u f f i c i e n t to perform the integration process including c o n f l i c t resolution as well as an e f f i c i e n t integration process. Another c o n t r i b u t i o n t o knowledge can be derived from t h i s r e s e a r c h . I t i s an extension of the r e l a t i o n a l data model regarding data semantics. I t i s well known that the r e l a t i o n a l data model i n i t s current form i s not well suited f o r capturing data semantics. One step towards capturing data semantics i s the data di c t i o n a r y which keeps information on database items, e i t h e r i n computer or human interpretable form, i . e . on data types, or the meaning of the data i n the r e l a t i o n tuples. A large amount of the dictionary information can be generated, v i r t u a l l y e f f o r t - f r e e , as part of the design process. Thus, the outcome of the design process may not only be set of r e l a t i o n s , but also a data dictionary. The view integration approach suggests information that should be captured i n data d i c t i o n a r i e s but has not been captured yet. This information may include data concerning the meaning of database objects. Future database management systems could have f a c i l i t i e s to i n t e r p r e t t h i s data i n order to support the users and the system i t s e l f , for instance to improve the i n t e g r i t y of the database ( f u l l y integrated semantic dictionary) or at l e a s t to improve user understanding of database data. For example, the 5 9 database c o u l d e x p l a i n t o the user t h a t MANUFACTURER i s a subclass of SUPPLIER which supplies parts and also manufactures these parts or that SUPPLIER i s a person or organization that i n the present i s su p p l y i n g p a r t s or i n the past has been supplying parts. 3.2. Approach to the Problem 3.2.1. Overview The problem solving approach chosen f o r t h i s research i s d e t e r m i n e d by the i l l - s t r u c t u r e d nature of the view i n t e g r a t i o n process and the previous research i n the area. Previous r e s e a r c h has i d e n t i f i e d several c o n f l i c t cases and t h e i r solutions without assuring us that the problem has been solved i n i t s e n t i r e t y . With the f i r s t research question, the attempt i s made to develop a complete c o n f l i c t resolution method. This task i s s i m p l i f i e d by the information a v a i l a b i l i t y assumption. To answer t h i s research question, an a n a l y t i c a l problem solving approach was chosen. This approach i d e n t i f i e s 6 0 a l l p o s s i b l e c o n f l i c t cases f o r any p a i r of objects 1 from d i f f e r e n t views and shows that the l i s t of c o n f l i c t cases i s complete. The l i s t contains 17 general c o n f l i c t cases with various subcases. Completeness has to be shown for t h i s l i s t . The demonstration of completeness rests on the assumption that a l l c r i t e r i a which d i f f e r e n t i a t e any two views or parts thereof ( i . e . d i f f e r e n t names fo r the same object type, d i f f e r e n t meaning of two object types) have been i d e n t i f i e d here. Once a l l c r i t e r i a are known by which objects can be distinguished, a l l possible combinations of c r i t e r i a can be e a s i l y generated. The l a t t e r part of the argument has to j u s t i f y why some of the possible combinations are i r r e l e v a n t or why they are s i m i l a r to other, already i d e n t i f i e d ones. 3.2.2. Outline of the Problem with Available Information Even though some of the p r e v i o u s i n t e g r a t i o n approaches have d e a l t with the c o n f l i c t analysis ( c o n f l i c t r e c o g n i t i o n ) problem i n a systematic manner, t h e i r c o n f l i c t 1 Pairwise integration has been the procedural choice for most previous integration methods (see B a t i n i et a l . , 1986). Only recently, some researchers ( i . e . , Navathe) have demonstrated p a r a l l e l i n t e g r a t i o n techniques f o r more than two views, applicable i n c e r t a i n c o n f l i c t s i t u a t i o n s . 61 c l a s s i f i c a t i o n schemes were not suitable to i d e n t i f y a l l possible combinations of object differences. Consequently, they have f a i l e d t o i d e n t i f y some c o n f l i c t cases. In t h i s section, a categorization i s presented which overcomes t h i s weakness. The cases d i s c u s s e d below represent an exhaustive l i s t of p o s s i b l e c o n f l i c t s between any two o b j e c t s from d i f f e r e n t views. I t w i l l be argued that any possible c o n f l i c t case i s covered by the technique and that a f t e r resolution of c o n f l i c t s , views are i n a form i n which they can merged. I t w i l l also be argued t h a t there exists a \"causal ordering\" (compare Simon and Ando, 1963) of c o n f l i c t resolution cases which determines the sequence of steps within the integration process. Hence, an i n t e g r a t i o n procedure f o l l o w i n g t h i s o r d e r i n g w i l l be outlined. Object comparison Object comparison focuses on the detection of any i d e n t i t y or difference between two objects from d i f f e r e n t views. Objects may be of type en t i t y , r e l a t i o n s h i p , a t t r i b u t e . For example, a designer a r b i t r a r i l y picks one object from each of two view and wants to determine t h e i r i d e n t i t y or difference. To do t h i s , he should choose four general c r i t e r i a by which to evaluate objects: 62 (1) Name - are the objects' names i d e n t i c a l ? (2) Construct - are both objects represented by the same construct? ( 3 ) Meaning - do the objects have the same meaning? ( 4 ) Context - are the objects associated with the same neighbor objects i n both views? The name c r i t e r i o n i s a straightforward one and well described with i n the l i t e r a t u r e . In short, i d e n t i c a l objects should have the same name, d i f f e r e n t objects should have d i f f e r e n t names. Otherwise, the object pairs are synonyms or homonyms. Construct r e f e r s to the object type, i . e . , e n t i t y . Identical objects should have the same construct, to allow t h e i r merging. Previous research has given many examples of construct mismatches and t h e i r r e solution. Meaning i s the most d i f f i c u l t c r i t e r i o n . Instead of a lengthy d i s c u s s i o n about the i n t e r p r e t a t i o n of \"meaning\", at t h i s p o i n t the f o l l o w i n g working d e f i n i t i o n w i l l be used: two objects have the same meaning i f they both represent the same r e a l world object. Database design i s a form of modelling. Real world objects are represented by database items. I f two database items are both models the same r e a l world object, they have the same meaning. In previous research, meaning has not been e x p l i c i t l y d i s c u s s e d as di s c r i m i n a t i n g c r i t e r i o n , 6 3 p o s s i b l y because the meaning c r i t e r i o n i s very d i f f i c u l t to assess. For instance Navathe and Elmasri (for example, 1986) have frequently used domains or mapping r a t i o s as discriminating c r i t e r i a i n s t e a d . We may t h i n k of domain information and mapping r a t i o s as operationalizations capturing part of the meaning concept. E x p l i c i t r e p r e s e n t a t i o n of the meaning dimension w i l l r e s u l t i n a simple and c l e a r d i s t i n c t i o n of c o n f l i c t cases 1. Context ref e r s to the objects that are immediate neighbors of an object. By d e f i n i t i o n , an e n t i t y w i l l always have an empty c o n t e x t 2 . A r e l a t i o n s h i p ' s context are a l l e n t i t i e s i t i s a s s o c i a t e d with. An a t t r i b u t e ' s context i s the e n t i t y or r e l a t i o n s h i p i t belongs to. Context also has not been e x p l i c i t l y r e p r e s e n t e d i n p r e v i o u s r e s e a r c h , even though pre v i o u s researchers were aware of the importance of context, as t h e i r c o n f l i c t recognition and resolution examples show. Based on the four c r i t e r i a and two states of each c r i t e r i o n ( i d e n t i t y or di f f e r e n c e ) , a 2 x 2 x 2 x 2 matrix with 16 1 The main d i f f i c u l t i e s of meaning representation are completeness of the r e p r e s e n t a t i o n and differences i n user p e r s p e c t i v e . For example, when asked about the meaning of \" l i o n \" , most people may reply \"dangereous animal\", while a l i o n tamer would probably r e p l y \" l i v e l i h o o d \" . These are two d i f f e r e n t , incomplete interpretations which are both acceptable. For a discussion of the meaning concept consult Russell (1946). 2 Even though e n t i t i e s have no context by d e f i n i t i o n , i t may be u s e f u l l a t e r to think of an en t i t y ' s context as the rela t i o n s h i p s i t i s involved i n . 64 general cases of i d e n t i t y and difference of object p a i r s can be c o n s t r u c t e d . To exemplify the p r i n c i p l e s of c o n f l i c t r e c o g n i t i o n and r e s o l u t i o n , only the f i r s t three c r i t e r i a , name, construct, and meaning, w i l l be discussed i n more d e t a i l and represented graphically i n t h i s section (see Figure 1) . For now, the c o n f l i c t problem can be s i m p l i f i e d by assuming that whenever two objects have i d e n t i c a l meaning, t h e i r contexts w i l l be i d e n t i c a l . Whenever t h e i r meanings are d i f f e r e n t , t h e i r contexts may be d i f f e r e n t or i d e n t i c a l . The subsequent sections w i l l deal with the f u l l integration problem, allowing v a r i a t i o n s i n context, even i f meaning i s i d e n t i c a l . seme \u00E2\u0080\u0094> different CONSTRUCT Same dif leient 1. Idenltcal \ 2. Synonym 5. Homonym 6. Dllterenl Objects 7. Homonym * DIM. C o n s . 8. D i l l . Obj. * DIM. Cons-3. Construct Mismatch 4 Construct Mismatch \u00E2\u0080\u00A2 Homonym Figure 1: Object Comparison Matrix Each of the cases depicted i n Figure 1 w i l l be b r i e f l y presented below. The focus of t h i s discussion s h a l l be on the cases, 65 not on t h e i r d e t a i l e d s o l u t i o n . Unless solutions are simple or necessary f o r the d i s c u s s i o n , they w i l l be postponed to subsequent chapters. Note that not a l l cases below describe c o n f l i c t s . For instance, i f two objects are i d e n t i c a l (Case 1) , they can be merged without modifications. Other cases, such as synonymy (Case 2) require an object change. Case 1; [Name:same; Meaning:same; Construct:same] T h i s c o n d i t i o n corresponds to cases 1 and 5 from previous research (see chapter 2) . Two objects are i d e n t i c a l i n a l l factors. Example: View 1: CUSTOMER (entity) View 2: CUSTOMER (entity) both describing the same customer object type. The notion of i d e n t i t y i s not only meaningful f o r e n t i t i e s , as e x e m p l i f i e d , but a l s o f o r i d e n t i c a l r e l a t i o n s h i p s l i n k i n g i d e n t i c a l e n t i t i e s , and for i d e n t i c a l a t t r i b u t e s of i d e n t i c a l e n t i t i e s ( i d e n t i c a l context). Case 2: [Name:different; Meaning:same; Construct:same] This i s the case of a synonym. Both objects are i d e n t i c a l but carry d i f f e r e n t names. Note that both objects have the same construct ( i . e . , e n t i t y ) . Example: VI: CUSTOMER (entity) 6 6 V2: BUYER (entity) both describing the same r e a l world customer object type. Case 3; [Name:same; Meaning:same; Construct:different] T h i s case d e s c r i b e s a s i t u a t i o n where the same o b j e c t i s represented by d i f f e r e n t modelling constructs. This case w i l l be referred to as construct mismatch. Brodie (1984) refers to t h i s difference i n construct as \"semantic r e l a t i v i s m \" , e.g., when the same object i s represented as an e n t i t y i n one view and as a r e l a t i o n s h i p i n another view. Example: VI: MARRIAGE (entity) V2: Marriage (relationship) Both views describe marriage objects. Both views use the same name, but a d i f f e r e n t construct. For view 1, a marriage i s an en t i t y (probably with husband and wife a t t r i b u t e s ) , f o r view 2, a marriage i s a re l a t i o n s h i p (probably l i n k i n g two person e n t i t i e s ) . The solution to t h i s case i s a change i n one of the c o n s t r u c t s , e i t h e r making the e n t i t y a r e l a t i o n s h i p or vi c e versa. At the end, each object should be represented by the same construct i n a l l views. T h i s example d e s c r i b e s only one of many possible construct mismatch scenarios. 67 Case 4: [Name:different; Meaning:same; Construct:different] This case i s c l o s e l y related to the previous one. Again, both o b j e c t s have the same meaning, but t h i s time they not only have d i f f e r e n t constructs, but also d i f f e r e n t names. Therefore, i d e n t i t y of o b j e c t s i s d i s g u i s e d even f u r t h e r , by name differences on top of construct differences. Example: VI: MARRIAGE (entity) V 2 : Married_to (relationship) While both views use almost s i m i l a r names, to a syntactic processor, the names w i l l be d i f f e r e n t . Case 5 : [Name:same; Meaning:different; Construct:same] Th i s case marks homonyms. The objects carry the same name, but have d i f f e r e n t meaning. The objects have the same construct ( i . e . , e n t i t y ) . Example: VI: SUPPLIER (entity) V 2 : SUPPLIER (entity) Here the same name SUPPLIER i s used for both suppliers (currently supplying the product) and for manufacturers (who manufacture the product and may be po t e n t i a l s u p p l i e r s ) . Case 6: [Name:different; Meaning:different; Construct:same] This case may r e f e r to a t r i v i a l s i t u a t i o n i n which two objects are d i f f e r e n t i n meaning and name, but have the same construct. 68 On the other hand, i t may r e f e r to a number of more complex si t u a t i o n s of non-identical but related ( i . e . , superset-subset relationship) objects. Example 1: t r i v i a l s i t u a t i o n VI: EMPLOYEE (entity) V2: DEPARTMENT (entity) Example 2: related objects VI: STUDENT (entity) V2: UNDERGRAD (entity) The e n t i t i e s i n the f i r s t example r e f e r to two d i f f e r e n t r e a l world objects which are not related 1 . The objects represented i n the second example are related, namely through a superset-subset r e l a t i o n s h i p . Whenever there e x i s t s such a connection between two items they cannot be treated as independent. The eight cases extracted from previous research provide solutions for such non-identical but related sets. Case 7: [Name:same; Meaning:different; Construct:different] This case captures homonyms. Again, the name of two objects i s the same, but they d i f f e r both i n meaning and i n construct used. Note that t h i s case may also contain objects that have d i f f e r e n t meaning but are related to each other (as i n Case 6). Example: VI: SUPPLIER (entity) 1 \"Related\" i s used here to express that two object classes are e i t h e r overlapping or are contained by a common object c l a s s . 69 V2: Supplier (attribute) The name s u p p l i e r i s used f o r both an e n t i t y and f o r an at t r i b u t e , and the a t t r i b u t e does not r e f e r to the same supplier object ( i . e . , r e f e r s to a manufacturer object). Case 8: [Name:different; Meaning:different; Construct:different] T h i s case d e s c r i b e s o b j e c t s which are d i f f e r e n t i n every respect, meaning, name and construct. Example: VI: SUPPLIER (entity) V2: Department (attribute) Supplier and department are d i f f e r e n t objects altogether, with no s i m i l a r i t i e s between them. Again, t h i s exemplifies the t r i v i a l form of the case. But again, o b j e c t s may a l s o be related. The above eight cases f a l l into 2 main groups: objects that w i l l be ultimately completely i d e n t i c a l and objects that are d i f f e r e n t . Whether an o b j e c t belongs to the f i r s t or the second group i s determined by t h e i r meaning dimension. The f i r s t group consists of cases 1,2,3, and 4. The second group i s represented by cases 5,6,7,and 8. In eithe r group, c e r t a i n cases describe stable states. In the f i r s t group for example, case 3 (semantic r e l a t i v i s m ) becomes a case 1 ( i d e n t i c a l items), once d i f f e r e n t c o n s t r u c t s are eliminated. Case 4 becomes a case 3, once objects are renamed. Within the group 70 of d i f f e r e n t objects there e x i s t two stable states. I f objects are r e l a t e d ( i . e . , one i s a subset of the other), they w i l l u ltimately belong to case 6, i . e . , a f t e r renaming from case 5. I f they are unrelated, they w i l l belong to case 8 or case 6 1 . The complete pattern of transformations into stable states i s shown i n Figure 2. The figure shows depicts comparison cases and t r a n s f o r m a t i o n s from one case i n t o a n o t h e r . The t r a n s f o r m a t i o n arrows show the d i r e c t i o n of transformation during the integration process. 2 -> 1 convert true synonyms into i d e n t i c a l items through renaming. 3 -> 1 convert c o n s t r u c t mismatch i n t o i d e n t i c a l items through change of d i f f e r e n t constructs. 4 -> 3 convert c o n s t r u c t mismatch and synonym into j u s t semantic r e l a t i v i s m through renaming, or 4 -> 2 convert construct mismatch and synonym into synonym through construct change. 5 -> 6 convert homonyms i n t o d i f f e r e n t items (possibly related) through renaming. 8 -> 6 convert d i f f e r e n t items with d i f f e r e n t constructs into d i f f e r e n t items with same constructs (only i f items are d i f f e r e n t but related) through construct changes. 7 -> 5 convert homonymy with d i f f e r e n t construct into 1 I f the objects are unrelated, case 8 i s a stable state, requring no changes during c o n f l i c t resolution. I f objects are r e l a t e d , u l t i m a t e l y , the o b j e c t s w i l l be transformed into state 6. 71 homonymy through name change (only i f objects are rel a t e d ) . 7 -> 8 c o n v e r t homonyms i n t o d i f f e r e n t items through renaming. NAME s a m e d i f f e r e n t MEANING s a m e \u00E2\u0080\u0094 \u00E2\u0080\u00A2 d i f l e r e n t C O N S T R U C T di I l e ren t 1. I d e n t i c a l 5. H o m o n y m 7. H o m o n y m -D i l f . C o n s . 3. C o n s t r u c t M i s m a t c h 2. S y n o n y m 6. D i f fe rent O b j e c t s 8. Dif f . Obj . \u00E2\u0080\u00A2> Diff . C o n s . 4. C o n s t r u c t M i s m a t c h H o m o n y m Figure 2; Case Transformations during View Integration The transformation sequences have three end points, Case 1, Case 6, and Case 8. Case 1 i s the end point f o r a l l objects with same meaning. I t i s captured by cases 1 and 5 extracted from previous research. Case 8 i s the end point f o r a l l items which are d i f f e r e n t i n a l l aspects and not r e l a t e d . I t s s o l u t i o n i s t r i v i a l . A l l these non-identical items w i l l be included i n the global schema. Case 6 i s the end point for n o n - i d e n t i c a l , unrelated items with same construct ( t r i v i a l solution) and for d i f f e r e n t but related objects. I f objects are related, cases 2 to 4 and 6 to 8 from previous research (chapter 2) w i l l apply. The case transformations (Figure 2) are free of c i r c u l a r i t i e s . T h i s makes i t possible to postulate an ordering of c o n f l i c t recognition and resolution. Figure 3 i l l u s t r a t e s one possible ordering. The operations to be ca r r i e d out f i r s t are construct changes (4->2, 3->l, 7->5, 8->6) for i d e n t i c a l and f o r related objects. This i s followed by the change of names f o r synonyms (2->l) , and homonyms (5->6 f o r r e l a t e d o b j e c t s , 7->8 f o r unrelated objects). The termination points of the procedure are cases 1, 6, and 8. The other p o s s i b l e ordering would attend to name changes p r i o r to construct changes. For now, both sequences are equally good, even though the f i r s t one i s preferable, as w i l l be explained l a t e r . 73 C a s e 4 C a s e 3 C a s e 7 Construct Change C a s e 2 C a s e 1 Stable Construct Change C a s e 5 Name Change Construct Change C a s e 8 Stable Stable Figure 3: Ordering of View Integration Steps 74 3.2.3. Changes i n the I n t e g r a t i o n Method when Necessary Information i s not D i r e c t l y Available The i n t e g r a t i o n method d i s c u s s e d so f a r i s based on the a s sumption t h a t n e c e s s a r y i n f o r m a t i o n to c a r r y out the integration process i s d i r e c t l y a v a i l a b l e . For the required information to be available, i t e i t h e r has to be s p e c i f i e d ex-ante, or has to be e l i c i t e d during the view integration process. Since information s p e c i f i c a t i o n requires designer e f f o r t and r e p r e s e n t s a c o s t , i t i s d e s i r a b l e to reduce i n f o r m a t i o n s p e c i f i c a t i o n requirements for the database designers. Hence, while p r e v i o u s l y the focus was on the design of a complete method f o r i n t e g r a t i o n , the focus w i l l now be on a human- oriented complete method for view integration. The new goal w i l l be to dtermine object i d e n t i t y , difference and relatedness with a small number of i n t e l l i g e n t ( i . e . , non-redundant) questions. Obviously, the method should base f u t u r e questions on answers to previous ones. T h i s i s a minimum requirement. The following l i s t of questions outlines other areas i n which the procedure can be improved. 1. How many ob j e c t s s h a l l be included i n the object comparison? 75 2. Which objects should be compared? 3. What i s the sequence of c o n f l i c t d i a g n o s i s and therapy? 4. How s h a l l i d e n t i t y or difference be decided? How many objects? The p r e v i o u s l y o u t l i n e d procedure always compared o b j e c t p a i r s , i . e . , \" i s e n t i t y E l i d e n t i c a l i n meaning to en t i t y E2?\" T h i s type of qu e s t i o n can always be answered with \"yes\" or \"no\", but f o r n objects i n view 2 t h i s form of questioning requires n questions . By asking, \" i s E l i d e n t i c a l to one of {E2, E3, Em}\", the number of questions can be reduced to 1. The que s t i o n can be answered e i t h e r with the object's i d e n t i f i e r , or with \"no\". This form of questioning d r a s t i c a l l y reduces the questioning e f f o r t . The questioning format w i l l always be l : n instead of 1:1. An m:n format w i l l not be used, since the answers become awkward (a l i s t of tuples of i d e n t i c a l objects). Which objects? The procedure would not behave i n t e l l i g e n t l y , i f i t included o b j e c t s i n the comparison that should not be included. For instance, i f E21 from view V2 was found to be i d e n t i c a l to E l l from view VI, the procedure should never again inquire whether E21 i s i d e n t i c a l any other object from VI. Other rules which are d e s c r i b e d i n the r e s u l t s chapter, reduce the group of 76 r e l e v a n t o b j e c t s even more. Furthermore, h e u r i s t i c s (also r u l e s , but not always true) were found to reduce the group of o b j e c t s even f u r t h e r . For example, once two e n t i t i e s are found to be i d e n t i c a l , and both p a r t i c i p a t e i n re l a t i o n s h i p s , one may expect to f i n d i d e n t i c a l pairs of r e l a t i o n s h i p s within these smaller groups. Which sequence? So f a r , sequences of object modifications have been outlined which r e s u l t e d i n s t a b l e s t a t e cases, (Case 1) i d e n t i c a l objects, (Case 6) d i f f e r e n t , but related objects, and (Case 8) d i f f e r e n t and u n r e l a t e d o b j e c t s . For instance, a case of construct mismatch (Case 3) was transformed into Case 1 through a construct change. The question i s whether the method should operate by searching a c t i v e l y for c o n f l i c t cases such as Case 3 or Case 4? The answer i s \"no\". A human-oriented integration procedure w i l l a l t e r the sequence of t e s t s . Following the assumption that i n absence of information to the contrary, two views are assumed to be i d e n t i c a l , the procedure w i l l always attempt f i r s t to f i n d matching objects, not object mismatches. For example, t y p i c a l l y the assumption at the outset of the object comparison w i l l be t h a t f o r an object Ol i n view VI there e x i s t s an object 02 i n view V2 with an i d e n t i c a l construct, i . e . , both are r e l a t i o n s h i p s . Figure 4 b r i e f l y outlines the basic sequence of t e s t s . 77 NAME Figure 4 : C o n f l i c t Recognition Procedure (abbreviated) For any object 01 from view VI and any set of objects {02} from view V2, the f i r s t t e s t i s a t e s t f o r i d e n t i t y of meaning. I f i t f a i l s , a t e s t for construct mismatch follows. I f there i s no construct mismatch, an object i s assumed to be missing. Note that name and context difference or i d e n t i t y are ignored 7 8 at f i r s t . The t e s t f o r r e l a t e d n e s s which begins with the assumption of r e l a t e d n e s s i s separated from the t e s t f o r i d e n t i t y of o b j e c t s . Tests f o r r e l a t e d n e s s are postponed u n t i l a l l t e s t s for i d e n t i t y are c a r r i e d out. How to decide on i d e n t i t y or difference? For a l l object c h a r a c t e r i s t i c s , i d e n t i t y or difference have to be asserted. While t h i s i s simple for construct and name, i t i s not f o r meaning and context. Only people can ultimately judge whether two o b j e c t s have the same meaning, but an i n t e l l i g e n t i n t e g r a t i o n procedure should help as much as p o s s i b l e i n making t h i s decision. In short, the procedure w i l l help by f i l t e r i n g out objects that are not i d e n t i c a l to the o b j e c t i n qu e s t i o n . Rules to f i l t e r out these non-corresponding objects are defined. 3.2.4. View Integration C o n f l i c t Cases Previously, only 8 of the 16 general types of cases were discussed, when context was held constant. The case l i s t below describes a l l possible cases for the comparison of two a r b i t r a r y objects from d i f f e r e n t views. Cases are i d e n t i f i e d by name (N), c o n s t r u c t (object type T ) , meaning (M) , and context (C) of the involved objects. Object 01 i s 79 denoted through , object 02 through . For every case the equality or non-equality of parameters i s stated. The overview t a b l e below shows f o r each case i d e n t i t y or d i f f e r e n c e along the four dimensions. For example, a under N means that both objects have i d e n t i c a l names, an 1 x' means they are d i f f e r e n t . For the meaning dimension, 'r' means the meanings are d i f f e r e n t but related . Case N T M C 1 = = = = I d e n t i c a l objects 2 \u00E2\u0080\u0094 = = X Identical objects with d i f f e r e n t context 3 X = \u00E2\u0080\u0094 \u00E2\u0080\u0094 Synonym 4 X = \u00E2\u0080\u0094 X Synonym with d i f f e r e n t context 5 = X \u00E2\u0080\u0094 X Construct mismatch (semantic relativism) 6 X X \u00E2\u0080\u0094 X Construct mismatch and synonym 7 X X =/x Different and unrelated objects 8 \u00E2\u0080\u0094 = X =/x Homonym 9 X X X X D i f f e r e n t o b j e c t s w i t h d i f f e r e n t constructs 10 = X X X D i f f e r e n t o b j e c t s w i t h d i f f e r e n t constructs, but homonyms 11 X \u00E2\u0080\u0094 r d i f f e r e n t but related objects 12 = r = d i f f e r e n t but related homonyms 13 X \u00E2\u0080\u0094 r X d i f f e r e n t but r e l a t e d o b j e c t s with d i f f e r e n t context 14 = = r X d i f f e r e n t but r e l a t e d homonyms with d i f f e r e n t context 15 X X r X d i f f e r e n t but related objects of d i f f e r e n t type 16 \u00E2\u0080\u0094 X r X d i f f e r e n t but r e l a t e d homonyms o f d i f f e r e n t type 17 - - - - missing object 02 80 Note that i f two objects are of d i f f e r e n t type, t h e i r context w i l l be d i f f e r e n t , due to the d e f i n i t i o n of context. Note also that i d e n t i t y or difference of context i s i r r e l e v a n t for objects with d i f f e r e n t meaning. A more d e t a i l e d l i s t of view c o n f l i c t s can be found i n the Appendix. The l i s t i n the appendix breaks each general case down into subcases based on d i f f e r e n t i a t i o n according to the constructs of p a r t i c i p a t i n g objects. I.e., a construct mismatch exis t s between an e n t i t y and a re l a t i o n s h i p as well as between an e n t i t y and an at t r i b u t e . The extended l i s t has been l e f t out here f o r the purpose of r e a d a b i l i t y . The Appendix also p r o v i d e s a b r i e f d e s c r i p t i o n of the s o l u t i o n f or a l l case s c e n a r i o s . The general c o n f l i c t resolution r u l e f or object i d e n t i t y and difference i s to have a l l other dimensions follow the meaning dimension. I f two objects have i d e n t i c a l meaning, a l l other dimensions w i l l have to be made i d e n t i c a l . I f two o b j e c t s have d i f f e r e n t meaning, the name dimension has to r e f l e c t t h i s . Cases of object relatedness are solved through representation of the superset subset r e l a t i o n s h i p s . Omitted from t h i s solution description i s the technique for r e - a l l o c a t i o n of att r i b u t e s when relatedness i s detected. The gener a l r u l e i s to al l o c a t e those a t t r i b u t e s that belong to both the superset and the subset to the superset, and to 81 a l l o c a t e to the subset only the a t t r i b u t e s that are unique to i t (see f o r instance Navathe and Elmasri (1986)). 82 3.3. Expert System Methodology An implemented s o l u t i o n for the view integration problem requires an adequate problem representation and solution mechanism. So far, cases of p o t e n t i a l integration problems and a procedure have been i d e n t i f i e d , yet no implementation mechanism has been suggested. Before any further discussion of an adequate mechanism, here a short reminder of the problem s i t u a t i o n . Correcting the c o n f l i c t s i n a set of user views i s c l e a r l y a problem solving task. Within t h i s research, view integration i s treated as a diagnosis and therapy task (note that Hayes-Roth et a l . mention diagnosis and therapy (\"repair\") as generic tasks of knowledge engineering applications) . C h a r a c t e r i s t i c of a t y p i c a l diagnosis task i s the goal to f i n d out \"what's wrong\" i n the actual state. Thus, the purpose of the diagnosis part of view integration i s the i d e n t i f i c a t i o n of the discrepancy or mismatch between a p a i r of views. Once the c o n f l i c t case has been i d e n t i f i e d , the therapy or \" f i x i n g \" phase w i l l adjust one or both views to remove an e x i s t i n g c o n f l i c t . Therapy c r e a t e s the new, desired structure. Diagnosis and therapy tasks are p r o t o t y p i c a l tasks f o r expert systems or knowledge based systems. The integration method discussed here was not b u i l t by e x t r a c t i n g diagnosis and therapy rules from expert designers. Hence i t i s not t r u l y an expert system. However, 83 i t w i l l represent c o n f l i c t recognition and c o n f l i c t resolution knowledge. Database design rules for c o n f l i c t recognition and resolution can be e a s i l y formulated as s e t s of production r u l e s . In s i m p l i f i e d form, one may want to think of each production rule as describing one of the cases. For each object comparison, the r u l e matching the c o n f l i c t s i t u a t i o n would f i r e and transform the case into another one, u n t i l one of the stable state cases were reached (for a description of the production system reasoning mechanism see for instance Barr and Feigenbaum, 1981). The most appealing property of the production system mechanism i s the m o d u l a r i t y of the r e s u l t i n g systems. Rules can be added, d e l e t e d or changed without d i r e c t l y a f f e c t i n g other r u l e s . Figure 5 i l l u s t r a t e s t h i s f a c t . Figure 5 (taken from Vessey and Weber, 1986) depicts a decision table with cooking i n s t r u c t i o n s f o r vegetables to exemplify the convenience of r u l e e d i t i n g . Each i n s t r u c t i o n (column) corresponding to one p r o d u c t i o n r u l e . The l i s t can be e a s i l y expanded through addition of new columns. By the same token, the deletion of a column does not a f f e c t any other column (or r u l e ) i n the t a b l e . Furthermore, each column can be changed, thereby a f f e c t i n g only the instructions for one p a r t i c u l a r dish. The cause f o r t h i s s i m p l i c i t y of the rule based system l i e s i n the 84 design of the condition l i s t . Each condition stub i s s p e c i f i e d with the utmost d e t a i l , not r e f e r r i n g to conditions which are aggregates of more than one f a c t . I.e., the decision table does not c r e a t e intermediate r e s u l t s (aggregates of truth values) such as \" j u i c y and cr i s p y and le a f y but not t a l l \" , which could appear l a t e r as a single condition i n the condition l i s t f o r both \" f r y \" and \"steam\". In other words, condition items are decoupled as much as possible. Consequently also the r u l e s ( i . e . , the dishes) are decoupled. Juicy Y Y Y Y Y Y N Tall Y N N N N N \u00E2\u0080\u0094 Crispy \u00E2\u0080\u0094 Y Y Y N N \u00E2\u0080\u0094 Leafy \u00E2\u0080\u0094 Y Y N \u00E2\u0080\u0094 \u00E2\u0080\u0094 \u00E2\u0080\u0094 Red \u00E2\u0080\u0094 Y N \u00E2\u0080\u0094 \u00E2\u0080\u0094 \u00E2\u0080\u0094 \u00E2\u0080\u0094 Hard \u00E2\u0080\u0094 .\u00E2\u0080\u0094 . \u00E2\u0080\u0094 \u00E2\u0080\u0094 Y N \u00E2\u0080\u0094 Fry X Steam X Grill X Peel X X Boil X Chop X X Roast X Figure 5: Decision Table I l l u s t r a t i o n 85 The modularity of production rules makes t h e i r implementation very f o r g i v i n g . I f a case i s l e f t out i n the beginning, or i s s p e c i f i e d incompletely at f i r s t , additions can be made with very l i t t l e e f f e c t on the already e x i s t i n g rules. One disadvantage i s the i n e f f i c i e n c y of the production system approach, due to duplication of i d e n t i c a l condition elements. T h i s r e s u l t i s the c o s t induced by complete decoupling of conditions. Every condition l i s t has to be created and tested i n d e t a i l without being able to make use of e s t a b l i s h e d intermediate r e s u l t s . A more sensible design approach should compromise between complete decoupling of c o n d i t i o n s and processing e f f i c i e n c y . A h e u r i s t i c for aggregating conditions would group those conditions together that form a meaningful unit (are highly cohesive). Meaningful stands i n contrast to purely accidental coincidence of conditions. I.e., \" j u i c y and cr i s p y and leafy, but not t a l l \" i s not a p a r t i c u l a r l y meaningful grouping, because i t does not i d e n t i f y a c e r t a i n well-known group of food items. Therefore, t h i s aggregate should not be chosen as a grouping, even though i t could s i m p l i f y the decision table i n the example. A second disadvantage of production systems i s the fac t that they disguise the control flow. I t i s d i f f i c u l t f o r a designer to understand the control flow i n the production system. In 86 so c a l l e d \"procedural\" programming languages, i . e . Pascal or Fortran, the control flow i s determined by the ordering of the language statements, i f branching statements are neglected for the moment. In production systems, the sequence of rules has much l e s s importance. I.e., the \"chop\" r u l e w i l l not be applied f i r s t even though i t i s the f i r s t rule i n the decision t a b l e i n F i g u r e 5 , unless i t s conditions are true. I f the l a s t r u l e i n the system i s the one whose conditions become true f i r s t , i t w i l l be the f i r s t to f i r e . Hence, production systems i n general require substantial re-thinking by systems designers who are used to procedural languages. In a Prolog implementation t h i s problem i s a l l e v i a t e d to some extent since the language's interpreter interprets rules s t i l l i n sequential order. In c o n c l u s i o n , even though i t has some disadvantages, a production system seems to be a suitable representation mechanism for the implementation of t h i s research. The case des c r i p t i o n already provides many guidelines f o r the d e f i n i t i o n of c o n f l i c t r e s o l u t i o n r u l e s . A l s o , the m a i n t a i n a b i l i t y advantage of production systems becomes important when subsequently h e u r i s t i c s have t o be added to the i n t e g r a t i o n method to improve i t s operation with i n s u f f i c i e n t information. A d i f f e r e n t , apparently more elegant approach to view integration c o u l d perform the i n t e g r a t i o n process as a theorem proving 87 task. Similar to other theorem proving tasks (see for instance Nilsson, 1980) the program would be given a set of views and the question \"does there e x i s t a c o n f l i c t free global schema which contains a l l the information of the i n d i v i d u a l c o n f l i c t i n g views\"? I f the answer to that question were \"yes\", the global schema would be produced as a \"by-product\". Using Robinson's r e s o l u t i o n p r i n c i p l e (1965), the program would s o l v e the problem by creating a new goal \"there e x i s t s no global schema\" and by f a l s i f y i n g t h i s statement through a counter example. This approach i s elegant because i t i s based on a very general problem s o l v i n g mechanism, the theorem p r o v i n g mechanism. However, d e f i n i t i o n of the integration rules, e s p e c i a l l y the p r o c e d u r a l r u l e s of c o n f l i c t recognition and resolution i s more d i f f i c u l t than i n the production system approach. Two other reasonable representations for the task are frames and semantic networks (Waterman, 1986) . They w i l l be discussed below. Frames (Minsky, 1975) are complex data structures containing both fa c t u a l and procedural knowledge. Frames have s l o t s which can contain data concerning frame properties. Related to s l o t s can be procedures which are invoked when a s l o t i s f i l l e d . Slots that are not f i l l e d can take i n i t i a l l y defined default values. This default c a p a b i l i t y i s one of the advantageous features of frame based knowledge representations. 88 Mylopoulos and Levesque (1984) for instance stress t h e i r ease of dealing with incomplete knowledge. Frames have been used as knowledge representations i n a v a r i e t y of expert systems (see Waterman, 1986 or Hayes-Roth et a l . , 1983). Barr and Feigenbaum (1981) state that frames \"have problems\", yet do not mention where these problems l i e . Semantic nets represent knowledge i n a network i n which properties are inherited from other objects along the arcs of the network. Waterman states that semantic nets have algo been used i n expert systems, i n fa c t he argues that semantic nets and frames are s i m i l a r . Mylopoulos and Levesque (1984) emphasize as q u a l i t i e s of semantic nets t h e i r data organization and the provision of good access methods. As a disadvantage they state the lack of formal semantics and standard terminology. The problem of formal semantics becomes c l e a r , when the in t e r p r e t a t i o n mechanism for semantic nets i s investigated. A l l approaches are f e a s i b l e . However, for i t s forgivingness i n the maintenance of the knowledge base, the p r o d u c t i o n system approach has been chosen f o r t h i s r e s e a r c h . The integration method has been implemented i n Prolog. The program i s c a l l e d AVIS, for Automatic View Integration System. 89 4 . RESULTS 4 . 1 . Rules Guiding View Integration View i n t e g r a t i o n as a problem s o l v i n g t a sk i s guided by a set of r u l e s which allow the problem solver to define the problem environment, i d e n t i f y the p a r t i c u l a r problems ( c o n f l i c t s ) and to solve them. In t h i s section, the general r u l e s u n d e r l y i n g the process are presented, exemplified and j u s t i f i e d . The rules can be divided into two major groups: base r u l e s and h e u r i s t i c s . Base r u l e s are believed to be always t r u e . H e u r i s t i c s are support r u l e s . The b e l i e f s expressed i n them are known to be wrong sometimes but are expected to be true i n most cases. E s p e c i a l l y i n i t s c o n f l i c t r e c o g n i t i o n p a r t , t h i s view i n t e g r a t i o n method r e l i e s to a l a r g e extent on asking the r i g h t questions. I f the method can ask the r i g h t questions, i t can perform a large segment of the integration without user i n t e r a c t i o n . When user i n t e r a c t i o n cannot be avoided, a s e l e c t i o n of the r i g h t questions can s i m p l i f y the user's answering task. Furthermore, the method w i l l not appear to be stupid, i f i t can avoid asking t r i v i a l or redundant questions. To help i n the question formulation process, h e u r i s t i c s were included which f o r instance change the content and sequence of questions. 90 Base rules are separated into four groups of r u l e s . The f i r s t t h ree groups are s t a t i c modelling rules. The fourth group contains process r u l e s : 1. General Modelling Rules 2. Rules of the Modelling Language 3. Rules of Database Design/View Integration 3.1 General Database Design Rules 3.2 Rules Concerning the Test f o r Identity of O b j e c t s ( C o n f l i c t R e c o g n i t i o n and Rec o n c i l i a t i o n Rules) 3.3 Rules Concerning the Relatedness 1 of Objects ( R u l e s f o r R e c o g n i t i o n and M o d e l l i n g of Inter-Schema Relationships) 4. Process Rules 4.1 Process Rules f o r C o n f l i c t Recognition and Reconciliation 4 .2 Process Rules for the Recognition and Modelling of Inter-Schema Relationships General modelling rules are v a l i d not only i n the database context. For example, \"each relevant r e a l world object 2 s h a l l be represented by exactly one object i n the model\" i s such a r u l e . Rules of the modelling language, here the E-R modelling language, describe true statements about the E-R language that are relevant to the view integration task. Rules of database 1 The term \"relatedness\" i s used to s i g n i f y superset-subset r e l a t i o n s h i p s such as a l l managers are employees, MANAGER\u00E2\u0080\u0094 Isa\u00E2\u0080\u0094EMPLOYEE. The term \"relationship\", unless occurring i n the form \"subset/superset/containment r e l a t i o n s h i p \" , i s used to denote associations between e n t i t i e s . 2 Throughout the chapter, the terms object and object type w i l l be used i n t e r c h a n g e a b l y t o d e s c r i b e o b j e c t types. P a r t i c u l a r instances are referred to as object instance or object occurrence. 91 design are separated into rules to guide the database designer's (or the method's) t e s t for the i d e n t i t y of objects and rules to guide the uncovering of inter-schema (superset-subset) r e l a t i o n s h i p s . Process rules describe the sequence i n which t e s t s ( i . e . , c o n f l i c t r e c o g n i t i o n ) and c o r r e c t i v e measures ( i . e . , c o n f l i c t resolution) s h a l l be c a r r i e d out. The discussion w i l l begin with a d e s c r i p t i o n and explanation of the base rules, followed by an analysis of the h e u r i s t i c s . Base Rules General Modelling Rules: 1. Each r e l e v a n t r e a l world o b j e c t type s h a l l be represented by exactly one object type i n the model (redundancy-free representation). A l l model b u i l d i n g t r i e s to create a representation of the r e a l world that contains a l l relevant information i n the most concise form. Not a l l the information of the r e a l world can be represented. Most of the d e t a i l may not even be required f o r the tasks at hand. Hence, some r e a l world object types 92 w i l l not f i n d t h e i r way i n t o the model. I f a r e a l world object type i s represented more than once i n the data world, update anomalies can occur. Each new object instance of the r e a l world has to be inserted more than once into the data model. Should the r e a l world o b j e c t type i t s e l f cease to ex i s t , more than one data model object type has to be removed. Th i s c r e a t e s extra processing e f f o r t and the p o s s i b i l i t y of inconsistency. One of the purposes of database design i s to avoid exactly these problems. 2. An integration of multiple models s h a l l not r e s u l t i n the loss of information from any of the models. Any bottom-up modelling approach attempts to b u i l d a large global model through the combination of smaller models. Each of the small models represents the r e a l world facts that one model-builder perceives as relevant. Omission of any of these f a c t s out i n the global model would r e s u l t i n an incomplete g l o b a l model. Hence, the rule demands that a l l i n d i v i d u a l models are c o r r e c t and that the c o l l e c t i o n of models i s i n i t s e l f consistent (Biskup and Convent, 1986). Rules of the Modelling Language: 93 3 . Every object i n a view i s represented with exactly one o f f o l l o w i n g t h r e e c o n s t r u c t s : E n t i t y , Relationship, A t t r i b u t e . The view integration method models databases based on Chen's Entity-Relationship model i n which only E n t i t i e s , Relationships and A t t r i b u t e s e x i s t . Categories which are represented i n some extended forms of the E-R model w i l l be d e p i c t e d as sp e c i a l (Is-a) r e l a t i o n s h i p s . 4 . E n t i t i e s are autonomous objects. They can ex i s t without the existence of Relationships and without the d e f i n i t i o n of Attributes. E n t i t i e s are things or in d i v i d u a l s . As things or ind i v i d u a l s can e x i s t even i f they have no associations with other things or i n d i v i d u a l s , so can e n t i t i e s . For example, an e n t i t y SUPPLIER can e x i s t without an association to another ent i t y , such as BUYER. 5. A Relationship cannot e x i s t without the existence of at lea s t one En t i t y . R e l a t i o n s h i p s represent associations between e n t i t i e s . They map instances of one e n t i t y to instances of some other e n t i t y . 94 In the most r e s t r i c t e d case, one e n t i t y i s associated with i t s e l f . For example, the e n t i t y PERSON i s associated with i t s e l f through a Supervisor or a Parent r e l a t i o n s h i p . T y p i c a l l y , more than one e n t i t y w i l l be involved i n a r e l a t i o n s h i p , but never le s s than one. 6 . An A t t r i b u t e cannot e x i s t without the existence of the E n t i t y or Relationship i t belongs to. A t t r i b u t e s r e p r e s e n t a s s o c i a t i o n s between an e n t i t y and a value set, or a r e l a t i o n s h i p and a value set. For example, the Person_name a t t r i b u t e associates the PERSON e n t i t y with a value set of names which i t s e l f i s a set of s t r i n g s containing v a l i d person names. The a t t r i b u t e cannot e x i s t without the e x i s t e n c e of the e n t i t y or r e l a t i o n s h i p i t r e f e r s to (value sets are not part of the E-R model). General Database Design Rules: 7 . Two types of Attributes e x i s t . \"Property\" Attributes which describe the object (Entity or Relationship) i n more d e t a i l ( i . e . , c o l o r , name) and \" I n t e r c o n n e c t i o n \" A t t r i b u t e s which describe the 95 association of the object (Entity or Relationship) to some other object (Entity or Relationship). A t t r i b u t e s a r e always a s s o c i a t i o n s between e n t i t i e s or r e l a t i o n s h i p s and value sets. However, sometimes attr i b u t e s are not used to describe an innate property of the e n t i t y or r e l a t i o n s h i p they belong t o , but i n s t e a d , t o d e s c r i b e an association between the e n t i t y or r e l a t i o n s h i p and some other o b j e c t . For example, the a t t r i b u t e Person_name describes a property of a PERSON enti t y , t h e i r name. PERSON could also have an a t t r i b u t e Savings_acct_no. This a t t r i b u t e even though a s s o c i a t e d with PERSON, i s not a property of a person. In f a c t , the a t t r i b u t e i m p l i c i t l y s t a t e s t h a t t h i n g s c a l l e d savings accounts e x i s t and that a person i s or may be related to such a savings account. Thus, the a t t r i b u t e describes not a property, but an association. PERSON possesses SAVINGS_ACCT (PERSON is_associated_with SAVINGS_ACCT). While i n the example t h e d i f f e r e n c e b etween a p r o p e r t y a t t r i b u t e and an i n t e r c o n n e c t i o n a t t r i b u t e was d i s t i n c t , i t w i l l not be as cl e a r i n a l l cases. 8. Interconnection Attributes are shortened forms of E n t i t i e s ( i f the A i s a Relationship-Attribute), or 96 of Entity-Relationship constructs ( i f the A i s an E n t i t y - A t t r i b u t e ) . In the above example, PERSON had a Savings_acct_no a t t r i b u t e which indicated the existence of savings accounts and a person's p o s s e s s i o n of such an account. Obviously, savings_account could become an enti t y , since i t i s a thing i n the r e a l world. In that case, a re l a t i o n s h i p such as Has_account would represent a person's p o s s e s s i o n of such an account. Also, being an Ent i t y , a savings account could have a t t r i b u t e s i t s e l f , such as Account_balance, or Date_opened. The model buil d e r may not need a l l t h i s e x t r a i n f o r m a t i o n . I f the account number i n f o r m a t i o n i s s u f f i c i e n t , there i s no reason to describe savings accounts, or other r e a l world objects, i n more d e t a i l . A f t e r a l l , a model should contain only the relevant information about the system i t i s modelling. In the example, an e n t i t y a t t r i b u t e (Savings_acct_no) which was an association between an en t i t y (PERSON) and a value set (of a c c o u n t numbers) took the r o l e of a r e l a t i o n s h i p (Has_account) between PERSON and another e n t i t y SAVINGS_ACCT. The a t t r i b u t e thus represented both a r e l a t i o n s h i p (Has_account) and an e n t i t y (SAVINGS_ACCT) through the account number value. A l l a t t r i b u t e s of SAVING_ACCT other than i t s number, as well as any p o t e n t i a l non-key a t t r i b u t e s of Has_account are not 97 represented. Hence, interconnection a t t r i b u t e s are a compressed form of information representation. This compression has the undesirable side e f f e c t s of deletion and i n s e r t i o n anomalies. I.e., savings accounts do not ex i s t , u n t i l people e x i s t that possess the accounts. Accounts also cease to e x i s t with the person owning them. 9. I f A t t r i b u t e s a r e m u 1 t i - v a 1 u e d , t h e y a r e interconnection Attributes. This r u l e helps i n the detection of interconnection a t t r i b u t e s . I f a multi-valued a t t r i b u t e i s found, i t i s considered to be a interconnection a t t r i b u t e . For example, i f the Address a t t r i b u t e of an EMPLOYEE requires multiple entries i t should better be represented by a new e n t i t y RESIDENCE, related to EMPLOYEE through a r e l a t i o n s h i p such as Resi d e s _ a t . Storey (1988) deals with multi-valued a t t r i b u t e s i n t h i s manner during view creation. 10. A Relationship i s a less fundamental object than an En t i t y . Since r e l a t i o n s h i p s cannot e x i s t without the existence of at lea s t one en t i t y , t h e i r continuing existence i s based on two 98 factors. F i r s t , i t i s based on the existence of the objects underlying the e n t i t i e s , and second, on the existence of the association between those r e a l world objects? Should either one not ex i s t , then the re l a t i o n s h i p has to be removed. For e n t i t i e s , on the con t r a r y , i t i s unimportant whether any formerly e x i s t i n g association between them i s s t i l l i n place. They w i l l only disappear once the r e a l world objects underlying them disappear. The same i s true for e n t i t y and re l a t i o n s h i p instances. For example, i f a database contains the e n t i t i e s EMPLOYEE and DEPARTMENT as well as the r e l a t i o n s h i p Employed_by, i n d i v i d u a l i n s t a n c e s of Employed_by, such as [1005, Manufacturing] are only meaningful i f employee 1005 s t i l l e x i s t s , the manufacturing department e x i s t s , and the employee i n f a c t s t i l l works for the manufacturing department ( r e f e r e n t i a l i n t e g r i t y ) . 11. Each o b j e c t has four r e l e v a n t dimensions: Name, Construct (Entity/Relationship/Attribute), Meaning, and Context. One of the basic assumptions underlying t h i s view integration method i s that there e x i s t only four relevant d i f f e r e n t i a t i o n c r i t e r i a f or objects i n a view: name which i s the name of an ob j e c t , such as SUPPLIER, construct or object type, such as r e l a t i o n s h i p , meaning, and context. Meaning encompasses a l l the relevant knowledge conveyed by the object. For example, meaning includes a l l the information that i s known, once i t i s known that a p a r t i c u l a r e n t i t y i s a SUPPLIER. I.e., supplies parts, w i l l be paid for parts. Meaning i s the most important of a l l four dimensions. I t w i l l have absolute precedence over the other dimensions. I f two objects have the same meaning, they r e f e r t o the same r e a l world object and therefore a l l other dimensions w i l l have to be adjusted accordingly. Context i d e n t i f i e s the set of objects an object i s associated with. An a t t r i b u t e ' s context i s the e n t i t y or r e l a t i o n s h i p i t belongs to. A re l a t i o n s h i p ' s context are the e n t i t i e s associated by i t . E n t i t i e s are defined as having no context. E n t i t i e s are the only objects able to e x i s t without any other type of objects. 12. Along each dimension, any two objects can be eithe r \"same\" or \" d i f f e r e n t \" , i . e . same name, same construct. Another major assumption of the view integration method refers to the v a r i a t i o n s i n each dimension. I t i s more important to f i n d out whether two objects are i d e n t i c a l (same) or d i f f e r e n t i n each of the relevant dimensions rather than to f i n d out the a c t u a l v a l u e s f o r each dimension. In order to merge two o b j e c t s , they have to match, they have t o be completely i d e n t i c a l . I f they are even s l i g h t l y d i f f e r e n t a change i s r e q u i r e d . The magnitude of d i s s i m i l a r i t y does not matter, 100 s i n c e a change i s required nevertheless. For example, the en t i t y names SUPPLIER and SUPPLIERS are only s l i g h t l y d i f f e r e n t . N e v e r t h e l e s s , they are d i f f e r e n t and w i l l r e q u i r e a name change i f the e n t i t i e s are to be merged. The same i s true for the other dimensions. Two relationships may have \"almost\" the i same context, that i s , most of the e n t i t i e s associated by them are the same. Despite t h i s fact, these r e l a t i o n s h i p s have a d i f f e r e n t context and cannot be merged unless the context of one or both of them i s changed. 13. Two objects with d i f f e r e n t meanings can be related i n meaning. Meaning i s the only dimension where i d e n t i t y or difference are not the only two relevant values. For example, the e n t i t i e s EMPLOYEE and PART_TIME_EMPLOYEE have o b v i o u s l y d i f f e r e n t meaning, yet they are not completely independent. EMPLOYEE r e f e r s t o a type of i n d i v i d u a l s which includes the type of i n d i v i d u a l s r e f e r r e d to by PART_TIME_EMPLOYEE. Hence, when two o b j e c t s are d i f f e r e n t i n meaning, any superset-subset r e l a t i o n s h i p s between them are nevertheless relevant. Objects with such re l a t i o n s h i p s w i l l be c a l l e d related i n meaning. 101 14. Two related objects 01 and 02 w i l l display one of the following set relationships between them: 1. 01 and 02 have a common subset (yes/no); and 2. 01 and 02 have a common superset (yes/no); r e s u l t i n g i n the following possible combinations: (a) one object contains the other object; (b) b o t h o b j e c t s have a (meaningful) common superset and a common subset, yet the superset i s not one of 01 or 02; (c) both o b j e c t s have a common superset, but they do not overlap; (d) both objects have no common superset and do not i n t e r s e c t ; v i r t u a l l y no relatedness, no need for representation i n a database. Set r e l a t i o n s h i p s and t h e i r treatment within view integration have been discussed at d i f f e r e n t l e v e l s of completeness by a l l previously reviewed integration techniques, most completely by Navathe and colleagues, Elmasri and Navathe (1984), Navathe and Elmasri (1983). This r u l e l i s t s a l l relevant r e l a t i o n s h i p s between two sets. The q u a l i f i e r \"meaningful\" for supersets or subsets implies t h a t any such superset or subset has to be a cohesive group from the point of the users. For example, the e n t i t i e s EMPLOYEE and CUSTOMER have a common superset requiring implementation, the e n t i t y PERSON. Consequently, both EMPLOYEE and CUSTOMER would i n h e r i t the at t r i b u t e s of PERSON and a l l instances of EMPLOYEE and CUSTOMER would be instances of PERSON. Another, les s meaningful superset would be an e n t i t y EMPLOYEESCUSTOMER. The c h o i c e o f an a p p r o p r i a t e common s u p e r s e t , i . e . , 102 EMPLOYEES CUSTOMER vs. PERSON, has to remain with the user 1 . While there are no fixed rules to what constitutes a \"good\" e n t i t y , t h e r e are i n d i c a t o r s for less good e n t i t y choices. For instance, i f the user cannot provide a good name for the object, i t may not be a (good) e n t i t y . I.e., EMPLOYEE&CUSTOMER i s not a good object name. Hence, the object i s not expected to be very meaningful. Or, i f the objects a t t r i b u t e s are i d e n t i c a l t o an already e x i s t i n g o b j e c t ' s a t t r i b u t e s , the object may not be a (good) e n t i t y . Examples fo r the forms of relatedness are: (a) EMPLOYEE contains PART_TIME_EMPLOYEE; (b) PRODUCT_TEAM_MEMBER and PROJECT_TEAM_MEMBER are both s u b s e t s of EMPLOYEE, t h e i r i n t e r s e c t i s PRODUCT&PROJECT_TEAM_MEMBER; (c) PART_TIME_EMPLOYEE and FULL_TIME_EMPLOYEE are both subsets of EMPLOYEE, but they do not overlap; (d) CUSTOMER and DEPARTMENT do not in t e r s e c t . The relatedness i n (d) i s so weak that i t s h a l l be ignored. Even though i t represents some extra knowledge about the world, the knowledge i s negative knowledge. Since negative 1 Throughout the text, the term \"user\" ref e r s to a database designer who employs the integration method. This \"designer user\" represents the inte r e s t s of the end users of the database. The end users are assumed to have provided the o r i g i n a l views. 103 knowledge i s so much more abundant than p o s i t i v e knowledge, i t s representation t y p i c a l l y becomes i n f e a s i b l e . 15. Two unrelated objects 01 and 02 may share a common ro l e . Two e n t i t i e s , f or example PERSON and COMPANY can be d i f f e r e n t and unrelated, but they s t i l l can have a common r o l e such as the r o l e of shareholder. Neither view may contain a shareholder object, even though both may contain a STOCK e n t i t y . Goldstein and Storey (1988) discuss unrelated objects sharing a common ro l e (\"W-relationship\") and the proper representation of t h i s s i t u a t i o n i n a generalization l a t t i c e . 16. Two objects are i d e n t i c a l , i f they are i d e n t i c a l i n a l l dimensions. Only the previously discussed four dimensions are relevant to judge whether objects are i d e n t i c a l . Objects have to correspond i n a l l dimensions. For example, an e n t i t y EMPLOYEE and an e n t i t y WORKER are known to mean the same. Thus they are i d e n t i c a l i n meaning, construct ( e n t i t y ) , and context (empty). Nevertheless, the objects are i d e n t i c a l only a f t e r t h e i r names have been made i d e n t i c a l too. 104 17. Each object i s related to i t s e l f (contains i t s e l f and i s contained by i t s e l f ) . This relatedness s h a l l not be represented i n any view. T h i s r u l e guides and l i m i t s the search for between-view set r e l a t i o n s h i p s . For example, i f an e n t i t y EMPLOYEE has been found to be i d e n t i c a l to another object EMPLOYEE from some other view, each of the e n t i t i e s i s also a superset of the other one. They also share a common subset, the e n t i t y set i t s e l f . The r e p r e s e n t a t i o n of t h i s f i n d i n g bears no extra information. I t would also r e s u l t i n an i n f i n i t e expansion of the g l o b a l database, s i n c e i f every o b j e c t i s r e l a t e d to i t s e l f , also the object expressing t h i s relatedness i s related to i t s e l f which has to be expressed through yet another object, and so on. 18. An object can be related to between 0 and n other objects. I t i s important to remember that one object can be related to more than one other object. The search f o r rel a t e d objects from another view i s not completed a f t e r one rel a t e d object has been found. However, i t i s also possible that no related objects can be found i n another view. 105 19. Each object i n one view can have a maximum of one i d e n t i c a l object i n another view ( c a l l t h i s object also the \"corresponding\" object). This r u l e follows from the general r u l e of modelling that no r e a l world o b j e c t s h a l l be represented more than once i n a model. A view i s a model. Hence, i f two objects of one view are i d e n t i c a l to another e n t i t y from some other view, the two objects must be i d e n t i c a l . This rule implies that once a p a i r of i d e n t i c a l o b j e c t s has been found, there i s no need to search f o r further i d e n t i c a l objects. 20. Two views are the same, i f a l l t h e i r objects are i d e n t i c a l . The goal of the c o n f l i c t recognition and r e s o l u t i o n procedure i s to correct omissions and c o n f l i c t s so that at the end two p r e v i o u s l y d i f f e r e n t views are i d e n t i c a l . Then they do not have to be merged, one of them can be removed, since a l l i t s information i s also contained i n the other view. This rule states when the i d e n t i t y condition i s achieved. 106 21. Each i n d i v i d u a l view i s complete and consistent and minimal. A view i s complete i f i t represents a l l the i n d i v i d u a l s , things, and associations between them, relevant to the user. A view i s consistent i f none of the facts stated concerning the r e l a t e d n e s s of s e t s are c o n t r a d i c t e d by others i n the view. F o r example, i f the view s t a t e s t h a t the e n t i t y PART_TIME_EMPLOYEE i s a subset of the e n t i t y EMPLOYEE, no other f a c t i n the view may present contrary information, such as PART_TIME_EMPLOYEE and EMPLOYEE have no members i n common, (see Casanova and Vi d a l (1982), Biskup and Convent (1983)). M i n i m a l i t y of a view e n t a i l s that each r e a l world object i s only represented once i n a view. For example, i f one view co n t a i n s two e n t i t i e s , SUPPLIER and DEALER, these e n t i t i e s have to be d i f f e r e n t ; they have to r e f e r to d i f f e r e n t objects i n the r e a l world. The completeness assumption c l a r i f i e s the ro l e of the integration method as a method that finds omissions or c o n f l i c t s i n views based not on within-view (intra-view) analysis but based on between view (inter-view) comparison. 107 2 2 . The c o l l e c t i o n of views before i n t e g r a t i o n i s consistent. The view i n t e g r a t i o n method assumes t h a t not only views i n d i v i d u a l l y are consistent, but also that the c o l l e c t i o n of views i s consistent as a whole. In other words, facts stated concerning relatedness of sets i n one view cannot contradict facts stated i n another view. T h i s r u l e c l a r i f i e s the purpose of the c o n f l i c t recognition and r e s o l u t i o n method as a method that corrects omissions and c o n f l i c t s ( i . e . , differences i n opinion on name, context) but not contradictions. For instance, i f view VI states that a l l managers have to be f u l l - t i m e employees, while view V2 states t h a t a l s o part-time employees can be managers, the views contradict. Both statements cannot be true at the same time. The method assumes that such contradictions do not e x i s t . Rules Concerning the Test for Identity of Objects: ( C o n f l i c t Recognition and Resolution Rules) 108 23. I f f or an object 01 from view VI an i d e n t i c a l object 02 cannot be found i n view V2, then 02 i s either missing or represented through an object that has the same meaning but i s d i f f e r e n t along i t s other dimensions. Ideally, an i d e n t i c a l object 02 from V2 e x i s t s f o r each object 01 from VI. Both objects are i d e n t i c a l i f they are i d e n t i c a l i n a l l r e l e v a n t dimensions: name, c o n s t r u c t , meaning, and context. The most c r u c i a l dimension i s the meaning dimension. I f two objects have the same meaning, they r e f e r to the same ob j e c t i n the r e a l world. Hence, i f an object 02 with the same meaning as 01 ex i s t s , there may remain a name, construct or context c o n f l i c t between Ol and 02 to be taken care o f f , but 02 i s not missing. I f no 02 exi s t s that r e f e r s to the same r e a l world object as 01 does, then that 02 i s t r u l y missing. 24. No change of a view during integration s h a l l r e s u l t This r u l e provides a guideline to the d i r e c t i o n of change i n cases of construct mismatch as described by one of the following a l t e r n a t i v e s : i n the loss of information. Object i n view 1: Object i n view 2: En t i t y E n t i t y Relationship Relationship A t t r i b u t e A t t r i b u t e 109 Mismatches between an a t t r i b u t e on one hand and an e n t i t y or r e l a t i o n s h i p on the other hand w i l l r e s u l t i n a change of the o b j e c t with the a t t r i b u t e construct. This adjustment rule follows from the rule on interconnection a t t r i b u t e s . A mismatch between an e n t i t y and a r e l a t i o n s h i p , r e s u l t s i n a change of the object with the r e l a t i o n s h i p construct, based on the r u l e concerning object permanence. Relationships are less fundamental than e n t i t i e s . Relationship instances cease to e x i s t when the e n t i t y i n s t a n c e s they r e f e r t o cease to e x i s t ( r e f e r e n t i a l i n t e g r i t y ) , as i l l u s t r a t e d below: View 1: SUPPLIER\u00E2\u0080\u0094Sup_con\u00E2\u0080\u0094CONTRACT\u00E2\u0080\u0094Cus_con\u00E2\u0080\u0094CUSTOMER View 2: SUPPLIER\u00E2\u0080\u0094Contract\u00E2\u0080\u0094CUSTOMER Both views have suppliers i n a contract s i t u a t i o n with customers, yet i n view 1, the contract i t s e l f i s an e n t i t y , i n view 2 i t i s a r e l a t i o n s h i p . In view 2, a disappearing customer (instance) destroys a l l records of a contractual agreement between him and the supplier. No h i s t o r i c data remains. In view 1, contracts have a l i f e of t h e i r own and survive the disappearance of a customer instance. Hence, the less permanent character of a r e l a t i o n s h i p p o t e n t i a l l y leads to information loss i n the database extension. Consequently, a construct mismatch between 110 an e n t i t y and a re l a t i o n s h i p should r e s u l t i n a change of the re l a t i o n s h i p construct into an e n t i t y construct. 25. I f two unrelated objects share a common ro l e , the common ro l e object and s p e c i f i c r o l e objects have to be represented as well as Isa rel a t i o n s h i p s between the o r i g i n a l objects and the s p e c i f i c r o l e and between the s p e c i f i c roles and the common r o l e . This r u l e i s based on Goldstein and Storey (1988). For example, i n : VI: PERSON\u00E2\u0080\u0094Holds\u00E2\u0080\u0094STOCK V2: COMPANY\u00E2\u0080\u0094Holds\u00E2\u0080\u0094STOCK PERSON and COMPANY have the same r o l e . Therefore, a common ro l e object SHAREHOLDER i s needed to describe the s i t u a t i o n . Furthermore, s p e c i f i c r o l e o b j e c t s , PERSON_SHAREHOLDER and COMPANY_SHAREHOLDER are needed. Then, PERSON_SHAREHOLDER i s a PERSON as well as a SHAREHOLDER. SHAREHOLDER here w i l l be the object associated with STOCK through the Holds r e l a t i o n s h i p . Rules Concerning the Test f o r Relatedness of Objects: (Recognition and Modelling of Inter-Schema Relationships) 111 26. Any Object 01 from VI which i s not an e n t i t y and which i s related to an object 02 from V2 s h a l l become an e n t i t y . Any object 01 that i s not an e n t i t y i s e i t h e r a r e l a t i o n s h i p or an a t t r i b u t e . Neither of the two may be associated with other o b j e c t s by means of a r e l a t i o n s h i p . R e l a t i o n s h i p s i n v o l v e d i n r e l a t i o n s h i p s a r e not p e r m i t t e d , nor are r e l a t i o n s h i p s involving a t t r i b u t e s . However, i f two objects a r e r e l a t e d , t h e y w i l l have t o be connected by an Isa r e l a t i o n s h i p . Thus, t h i s construct change i s necessary. For example, an a t t r i b u t e Supplier belonging to e n t i t y PART i n view 1 i s r e l a t e d to e n t i t y DEALER from view 2. The relatedness i s such that a l l suppliers are dealers but not a l l dealers are s u p p l i e r s . In t h i s case, the Supplier a t t r i b u t e i n view 1 w i l l become an e n t i t y , which w i l l be associated with part through a Supplies r e l a t i o n s h i p . Supplier i n view 1 was an i n t e r c o n n e c t i o n a t t r i b u t e which i s now more adequately represented through an e n t i t y . For a more de t a i l e d i l l u s t r a t i o n of construct changes compare section 4.3 on c o n f l i c t therapy. 27. I f an object 01 contains an object 02, the containment s h a l l be represented by an Isa r e l a t i o n s h i p . I f the Isa r e l a t i o n s h i p does not e x i s t , i t must be added. 112 The contained object w i l l possess a l l a t t r i b u t e s of the containing object. T h i s r u l e on the E-R representation of containment i s taken from Elmasri and Navathe (1984). For example, i f EMPLOYEE contained PART_TIME_EMPLOYEE, the connection between the two would have to be represented by an Isa r e l a t i o n s h i p , s t a t i n g that PART_TIME_EMPLOYEE i s an EMPLOYEE. PART_TIME_EMPLOYEE would i n h e r i t a l l a t t r i b u t e s of EMPLOYEE. 28. I f two objects 01 and 02 overlap, and neither object contains the other, the overlap s h a l l be represented by an overlap object 03. I f the overlap object does not e x i s t , i t must be added. The overlap object 03 w i l l i n h e r i t the union of the a t t r i b u t e s of 01 and 02. The connections 01-03 and 02-03 s h a l l be r e p r e s e n t e d by one Isa r e l a t i o n s h i p each. I f either of the Isa rela t i o n s h i p s does not ex i s t , i t must be added. T h i s r u l e s t a t e s how the method handles relatedness of the form common subset ( o v e r l a p ) . The f o l l o w i n g example w i l l i l l u s t r a t e the ru l e : View 1: PROJECT_EMPLOYEE[Emp#,Proj#,Yrs_experience,Title] View 2: PRODUCT_EMPLOYEE[Emp#,Prodname,Function,Title] 113 t h e common s u b s e t PROJECT&PRODUCT_EMPLOYEE i n h e r i t s the a t t r i b u t e s Emp#, Proj#, Yrs_experience, Prodname, Function, T i t l e and c o n t a i n s a l l i n s t a n c e s of employee contained i n PROJECT_EMPLOYEE and i n PRODUCT_EMPLOYEE ( i n t e r s e c t ) . Furthermore, the following re l a t i o n s h i p s are added: PROJECT&PRODUCT_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094PROJECT_EMPLOYEE PROJECT&PRODUCT_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094PRODUCT_EMPLOYEE The creation of overlap objects i s explained i n d e t a i l i n Yao et a l . (1982). 29. I f two objects 01 and 02 have a common superset, and neither object contains the other, the superset s h a l l be represented by a superset object 03. I f the superset object does not e x i s t , i t must be added. The superset object 03 w i l l possess the int e r s e c t of the at t r i b u t e s of 01 and 02. I f they are not i d e n t i f i e r a t t r i b u t e s , these a t t r i b u t e s w i l l have to be removed from 01 and 02. The connections O l -03 and 02-03 s h a l l be represented by one Isa rel a t i o n s h i p each. I f either of the Isa relationships does not ex i s t , i t must be added. T h i s r u l e s t a t e s how the method handles relatedness of the form common superset. The following example w i l l i l l u s t r a t e the r u l e : 114 View 1: PROJECT_EMPLOYEE[Emp#,Proj#,Yrs_experience,Title] View 2: PRODUCT_EMPLOYEE[Emp#,Prodname,Function,Title] the common superset EMPLOYEE receives the a t t r i b u t e s Emp#,Title. The non-key a t t r i b u t e T i t l e are removed from PROJECT_EMPLOYEE and from PRODUCT_EMPLOYEE: EMPLOYEE[Emp#,Title] PROJECT_EMPLOYEE[Emp#,Proj #,Yrs_experience] PRODUCT_EMPLOYEE[Emp#,Prodname,Function] EMPLOYEE cont a i n s a l l i n s t a n c e s of employees i n c l u d e d i n PROJECT_EMPLOYEE or i n PRODUCT_EMPLOYEE (union). Furthermore, the following r e l a t i o n s h i p s are added: PRODUCTJEMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094EMPLOYEE PROJECT_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094EMPLOYEE The c r e a t i o n of overlap objects and a t t r i b u t e r e l o c a t i o n i s explained f o r instance i n Navathe et a l . (1986). 30. I f two objects exclude each other, the exclusion s h a l l be represented through an i n t e g r i t y constraint. No new objects are added i n the case of an exclusion. However, an i n t e g r i t y c o n s t r a i n t can be added to prevent any object i n s t a n c e from a c c i d e n t a l i n s e r t i o n into the non-overlapping sets. For example: View 1: FULLTIME EMPLOYEE 115 View 2: PARTTIME_EMPLOYEE d e s c r i b e two non-overlapping sets. An i n t e g r i t y constraint c o u l d be formulated to permit i n s e r t i o n of instances into e i t h e r o b j e c t only i f a f t e r the i n s e r t i o n a j o i n of both objects s t i l l returns the empty set. If the model (and the DBMS) can support i n t e g r i t y constraints, t h i s r e s t r i c t i o n can improve the d a t a q u a l i t y . The representation of exclusion i n t e g r i t y constraints i s suggested by [Casanova and V i d a l , 1983] and [Biskup and Convent, 1986]. 31. Containment i s t r a n s i t i v e . I f A contains B and B contains C, then A contains C. The t r a n s i t i v i t y s h a l l not be e x p l i c i t l y represented i n any view. An Isa rel a t i o n s h i p between A and C i s assumed to ex i s t , i f an Isa r e l a t i o n s h i p exists between A and B and between B and C. T h i s r u l e p r e v e n t s the generation of new redundant Isa r e l a t i o n s h i p s i n m u l t i - l e v e l h i e r a r c h i e s . I f f o r example, PERSON, EMPLOYEE, and PART_TIME_EMPLOYEE e n t i t i e s e x i s t i n a view, and EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094PERSON, as well as PART_TIME_EMPLOYEE--Isa\u00E2\u0080\u0094EMPLOYEE has been expressed, there i s no need to also express PART_TIME_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094PERSON. 116 32. I f an Isa re l a t i o n s h i p hierarchy implies another Isa re l a t i o n s h i p hierarchy because of t r a n s i t i v i t y , the implied Isa re l a t i o n s h i p s h a l l be removed. Thi s r u l e assures the removal of already e x i s t i n g redundant Isa r e l a t i o n s h i p s i n m u l t i - l e v e l hierarchies. I f for example view 1 s t a t e s t h a t PART_TIME_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094 PERSON, while view 2 expresses that PART_TIME_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094 PERSON, expressed, the t r a n s i t i v e Isa i n view 2 contains both Isa's i n view 1 and i s redundant. I t has to be removed. 33. C r e a t i o n of a new superset or subset object w i l l r e s u l t i n r e l o c a t i o n of r e l a t i o n s h i p s i f these rela t i o n s h i p s were previously linked to e n t i t i e s at an incorrect l e v e l of generalization. Whenever a new superset-subset r e l a t i o n s h i p i s introduced into a view, the p o s s i b i l i t y e x i s t s that e x i s t i n g r e l a t i o n s h i p s may have to be relocated. Consider the following example: VI: DEPARTMENT\u00E2\u0080\u0094Employs\u00E2\u0080\u0094FULLTIME_EMPLOYEE, V2: FULLTIME_EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094EMPLOYEE. In VI, Employs refers to FULLTIME_EMPLOYEE, because no more general EMPLOYEE object e x i s t s . Once the new EMPLOYEE becomes 117 p a r t of VI, the Employs r e l a t i o n s h i p w i l l be relocated to associate DEPARTMENT with EMPLOYEE. V1/V2: DEPARTMENT\u00E2\u0080\u0094Employs\u00E2\u0080\u0094EMPLOYEE\u00E2\u0080\u0094Isa\u00E2\u0080\u0094FULLTIME_EMPLOYEE. Process Rules: 34. In view integration, the t e s t f o r i d e n t i t y ( c o n f l i c t recognition and r e c o n c i l i a t i o n ) s h a l l precede the tes t f o r relatedness. The t e s t f o r i d e n t i t y and the t e s t for relatedness are two independent phases of view integration. The t e s t for i d e n t i t y detects or creates i d e n t i c a l pairs of objects i n the involved views so that f i n a l l y f o r each object i n view VI exactly one i d e n t i c a l object e x i s t s i n view V2. The t e s t f o r relatedness has the purpose to detect currently missing forms of relatedness (set r e l a t i o n s h i p s ) between views. I t s purpose i s not to de t e c t within-view relatedness. A l l occurrences of within-view relatedness are supposed to be already represented i n the i n d i v i d u a l views (completeness assumption). An example may i l l u s t r a t e t h i s f a c t . VI has employees working i n departments, V2 assigns employees to projects. View 1: EMPLOYEE\u00E2\u0080\u0094Works_in\u00E2\u0080\u0094DEPARTMENT View 2: EMPLOYEE\u00E2\u0080\u0094Assigned_to\u00E2\u0080\u0094PROJECT 118 The completeness assumption p o s t u l a t e s t h a t no forms of relatedness e x i s t within either of the views, because none are e x p l i c i t l y s t a t e d (no knowledge i s interpreted as negative knowledge). For example, i t i s known that EMPLOYEE i s not a subset of DEPARTMENT. Consequently, the search f o r inter-view relatedness has to focus only on those objects that o r i g i n a l l y e x i s t i n one view but not i n the other. I.e., i f EMPLOYEE were i d e n t i c a l to EMPLOYEE, Works_in i d e n t i c a l to Assigned_to, and DEPARTMENT i d e n t i c a l to PROJECT, then no undetected i n t e r -view relatedness could e x i s t . In order to know which views o r i g i n a l l y existed only i n one view but not i n the other, the t e s t f o r i d e n t i t y has to be c a r r i e d out f i r s t . Thus, the sequence of the two independent view comparisons, for i d e n t i t y and f o r relatedness, i s determined by the fac t that a previous t e s t f o r i d e n t i t y can reduce the number of comparisons for relatedness. Process Rules for C o n f l i c t Recognition and Rec o n c i l i a t i o n : 35. For each o b j e c t 01 from view VI, t r y to f i n d an i d e n t i c a l object 02 i n view V2. 119 The purpose of the method i s to either f i n d that two views are i d e n t i c a l , or to make them i d e n t i c a l . Once two views are i d e n t i c a l , one of them can be e l i m i n a t e d because a l l i t s information i s represented i n the remaining view. As defined e a r l i e r , two views are i d e n t i c a l , i f a l l t h e i r objects are i d e n t i c a l . Hence, the t e s t f or i d e n t i t y begins with an attempt to f i n d an i d e n t i c a l object 02 i n V2 for each object 01 from VI. 36. I f no i d e n t i c a l object 02 from V2 can be found for 01 from VI, t r y to f i n d an object that has the same meaning as 01 and change the d i s s i m i l a r dimensions of 01 and 02 so that they become i d e n t i c a l . E a r l i e r , complete i d e n t i t y of objects was defined. This rule d e s c r i b e s the a c t i o n to be taken i f two o b j e c t s are only p a r t i a l l y i d e n t i c a l , i f they have the same meaning. The meaning dimension as the most important dimension determines the d i r e c t i o n of change. I f the e n t i t y SUPPLIER i n view 1 has the same meaning \u00E2\u0080\u0094 r e f e r s to the same r e a l world o b j e c t \u00E2\u0080\u0094 as the a t t r i b u t e Dealer_no i n view 2, both objects f i n a l l y have the same name and the same construct. 120 37. I f no object 02 with same meaning can be found, add a new object 02 to V2 where 02 i s i d e n t i c a l to Ol from VI. I f no object 02 with same meaning as 01's can be found, then 01 has no corresponding object i n V2. Hence an object i d e n t i c a l to 01 has to be added to V2. 38. For each object 02 from V2 which i s d i f f e r e n t i n meaning to 01 from VI but has the same name, change the name so that no two objects with d i f f e r e n t meaning carry the same name. This r u l e forbids the existence of homonyms i n the database. I f a homonym i s found, a name change i s required based on t h i s r u l e . Again, name follows the more important dimension meaning. I f meanings are d i f f e r e n t , names have to be d i f f e r e n t . The other dimensions, construct and context can remain as they are. 39. For each 02 i n V2 that remains without an i d e n t i c a l object from VI, a f t e r a l l objects i n VI have been matched with an i d e n t i c a l object i n V2, add a new object 01 to VI which i s i d e n t i c a l to 02. 121 View V2 may contain objects that are not part of VI. Hence, a f t e r a l l of V l ' s o b j e c t s have been assigned an i d e n t i c a l object i n V2, some of the objects i n V2 may be l e f t without an i d e n t i c a l object i n VI. Consequently, these objects have to be added to VI. Process Rules f o r the R e c o g n i t i o n and Modelling of Inter-Schema Relationships: 40. Compare each object 01 from VI which was o r i g i n a l l y unique to VI (before addition of missing objects during i d e n t i t y t e s t ) a g a i n s t a l l o b j e c t s {02} formerly unique to V2, to f i n d out whether 01 contains 02, or 02 contains 01. Represent each i d e n t i f i e d containment. Purpose of the analysis i s only the addition of missing i n t e r - view superset-subset rel a t i o n s h i p s . Therefore, the contain-ment t e s t applies only to objects that were o r i g i n a l l y unique to one of the two views. For example: View 1: PART\u00E2\u0080\u0094Last_ordered_from\u00E2\u0080\u0094SUPPLIER View 2: PART\u00E2\u0080\u0094Carried_by\u00E2\u0080\u0094DEALER 122 Here PART i s the same i n both views and t h e r e f o r e i s not unique. Hence, only Carried_by, Last_ordered_from, DEALER, and SUPPLIER, are p o t e n t i a l l y r e l a t e d to objects from the other view. I.e., DEALER could be related to Last_ordered_from or to SUPPLIER, Last_ordered_from could be rela t e d to DEALER or to Carried_by. I f , for instance a l l SUPPLIERS are DEALERS but not a l l DEALERS are SUPPLIERS, then DEALER contains SUPPLIER. Consequently, an Isa re l a t i o n s h i p between SUPPLIER and DEALER would have to be created. The comparison summarized i n t h i s rule i s the f i r s t t e s t for relatedness, because i t the most spe c i a l case of relatedness and r e q u i r e s the l e a s t change i n the e x i s t i n g views. The comparison A contains B i s a spe c i a l case of common containment (A contains A and A contains B), as well as a s p e c i a l case of common subset (B i s a subset of A and B i s a subset of B). In t h i s s p e c i a l case, only an Isa re l a t i o n s h i p i s added to the views. In the general case, the common superset and the common subset have to be added too. Thus, i f t h i s t e s t i s the f i r s t one, the subsequent steps are s i m p l i f i e d . 41. For a l l pairs of o r i g i n a l l y unique objects 01, 02 i n which neither object contains the other, investigate whether 01 and 02 are contained by a common object 123 d i f f e r e n t from 01 and 02. Represent the common containment. T h i s r u l e summarizes the procedure for a common containment where the containing object i s d i f f e r e n t from 01 or 02. Only those objects are compared that were o r i g i n a l l y represented i n one view only. A l l object pairs i n which one object contains the other are not considered. 42. For a l l p airs of o r i g i n a l l y unique objects 01, 02 i n which neither object contains the other and which have a common superset, also investigate whether 01 and 02 int e r s e c t . Represent any e x i s t i n g common subsets. Represent the lack of a common subset through an i n t e g r i t y constraint. This r u l e summarizes the procedure for a common subset where the i n t e r s e c t object i s d i f f e r e n t from 01 or 02. Only those objects are compared that were o r i g i n a l l y represented i n one view only. A l s o , only objects that have a common superset ( d i f f e r e n t from OI and 02) are compared. Objects without a meaningful common superset cannot have a meaningful common subset. 124 43. For a l l object p a i r s 01, 02 o r i g i n a l l y unique to one view, investigate the existence of a W-relationship (common r o l e ) . R e p r e s e n t any e x i s t i n g W-r e l a t i o n s h i p s . Even though the t e s t f o r r e l a t e d objects may not f i n d any relatedness among the objects themselves, objects can have a common ro l e , which requires the addition of objects to represent the common r o l e and the objects 1 s p e c i a l r o l e . I.e., both a company and a person can be car owners. Even though company and person are not related ( i . e . , have no meaningful common superset i n the database), t h e i r common r o l e car owner requires r e p r e s e n t a t i o n , as do t h e i r s p e c i a l roles person-car-owner and company-car-owner. Heu r i s t i c s H e u r i s t i c s are rules that are generally true, but not true i n a l l cases. The use of these rules during the view integration process w i l l s i m p l i f y the process for the user i n cases where the h e u r i s t i c s are t r u e and w i l l s l i g h t l y inconvenience or p r o l o n g the process when the h e u r i s t i c f a i l s . The use of i n c o r r e c t h e u r i s t i c s w i l l not r e s u l t i n an incorrect database design, but i t may prolong the database design process. 125 H e u r i s t i c s improve the integration process by helping the user to f i n d objects with s i m i l a r or related meaning. I f object 01 i s compared to a set of objects {02} from view 2 and that set i s large and diverse (large number of objects including e n t i t i e s , r e l a t i o n s h i p s and attributes) , the s e l e c t i o n problem may be d i f f i c u l t f o r the user. I f the set {02} i s small, the sel e c t i o n problem becomes simple or even t r i v i a l . H e u r i s t i c s help to si m p l i f y the s e l e c t i o n problem by including only those objects i n the set that are l i k e l y to be i d e n t i c a l or rel a t e d to the object 01. The l i s t below shows only some h e u r i s t i c s , i t cannot be complete. I t i s always p o s s i b l e to formulate f u r t h e r assumptions to s i m p l i f y the search procedure. Furthermore, some of the h e u r i s t i c s shown may be too stringent for a p a r t i c u l a r design, others may be too loose. H e u r i s t i c s that are too stringent are a p a r t i c u l a r problem, since they can r e s u l t i n decision errors which require lengthy recovery procedures. This problem i s e x e m p l i f i e d i n the next sec t i o n which shows al t e r n a t i v e view integration procedures, one without any h e u r i s t i c s , one with only one h e u r i s t i c implemented. The following h e u r i s t i c s have been i d e n t i f i e d : 1. Two objects with i d e n t i c a l or related meaning w i l l have some common context. 126 This r u l e says that i d e n t i c a l or related objects w i l l be found i n the v i c i n i t y of i d e n t i c a l objects. For example, i f i t has been found that there exists an en t i t y EMPLOYEE i n views VI and V2, and EMPLOYEE i n VI p a r t i c i p a t e s i n a re l a t i o n s h i p Employment, then i t i s reasonable to assume that EMPLOYEE w i l l p a r t i c i p a t e i n a s i m i l a r association i n V2, even though that association may not be c a l l e d Employment i n V2 and even though i t may not be a re l a t i o n s h i p . The h e u r i s t i c i s based on the assumption that people describing the same environment w i l l have the same perception of the environment. Since both views have common elements, both views describe at l e a s t p a r t i a l l y the same r e a l world environment. In the absence of i n f o r m a t i o n to the contrary, the method t h e r e f o r e that a l l users regard the same r e a l world objects and associations as relevant. In the example, the h e u r i s t i c f a i l s i f the Employment association i s not relevant i n V2 and therefore missing. Note however, that the Employment a s s o c i a t i o n may not be missing, but be more d i f f i c u l t t o f i n d , i f i n V2 i t i s not represented as a re l a t i o n s h i p , but as an e n t i t y a t t r i b u t e or as an e n t i t y . 127 Even though e n t i t i e s are d e f i n e d to have no context i t i s u s e f u l t o t r e a t the r e l a t i o n s h i p s they are involved i n as t h e i r context, t o permit the a p p l i c a t i o n of t h i s valuable h e u r i s t i c . 2. Two objects with i d e n t i c a l or re l a t e d meaning w i l l have the same construct. T h i s r u l e s t a t e s t h a t even before c o n f l i c t r esolution, two object with i d e n t i c a l or related meaning w i l l be of the same type. Thus, the rule leads the integration method to look for a matching object only among those with the same construct. I f EMPLOYEE i s an e n t i t y i n VI, the matching object i n V2 w i l l also be an en t i t y . This h e u r i s t i c i s based on the assumption that i f two people describe the same object or association from the r e a l world, they w i l l agree i n t h e i r assessment of the construct that the object or association should be represented with. Depending on the r e a l world item, t h i s assumption i s more or l e s s reasonable. One would assume that almost anyone considers an employee or a customer to be an i n d i v i d u a l , but a customer's order may be perceived as a thing (entity) , or as an association (relationship) between a customer and a company. 128 The h e u r i s t i c f a i l s i n a l l cases of construct mismatch (semantic r e l a t i v i s m ) , i . e . , where one r e a l world object i s represented as an e n t i t y i n one view and as a re l a t i o n s h i p i n the other view. For cases i n which the r u l e f a i l s , the integration procedure has to backtrack and look at objects with d i f f e r e n t constructs to f i n d a match. 3. I f no two objects with i d e n t i c a l or rela t e d meaning and i d e n t i c a l construct can be found, the construct mismatch w i l l be of the following type: - I f 01 i s an e n t i t y or a r e l a t i o n s h i p , then 02 w i l l be an en t i t y a t t r i b u t e . This h e u r i s t i c suggests which construct mismatch to investigate f i r s t . Storey (1988) found that a very common error i n database design was the representation of an e n t i t y - r e l a t i o n s h i p construct as an interconnection a t t r i b u t e . Since t h i s \"mistake\" i s very frequently made, checking f o r i t s occurrence when an i d e n t i c a l o b j e c t was not found i s u s e f u l . In combination with the common context h e u r i s t i c , t h i s h e u r i s t i c i s expected to reduce the set {02} to a manageable s i z e . Some at t r i b u t e s can under no circumstance be interconnection a t t r i b u t e s , while others are more l i k e l y to be interconnection 129 a t t r i b u t e s . Two support r u l e s help i n i d e n t i f y i n g these groups: a s i n g l e a t t r i b u t e o b j e c t key cannot be an interconnection a t t r i b u t e . a t t r i b u t e s i n a m u l t i - a t t r i b u t e object key (composite key) are assumed to be interconnection a t t r i b u t e s . For example, Employee# i s the s i n g l e a t t r i b u t e key of an employee. I t does not represent the r e l a t i o n s h i p between EMPLOYEE and some other object. In contrast, the key of an ORDER e n t i t y , Customerid+Product# i d e n t i f i e s l i n k s to two other objects, a customer object and a product object. Both are p o t e n t i a l interconnection a t t r i b u t e s . Since more forms of mismatches other than the interconnection a t t r i b u t e s e x i s t , the h e u r i s t i c can f a i l . To recover from t h i s f a i l u r e , the system w i l l then search according to the following r u l e s : - I f 01 i s an e n t i t y and 02 i s not an e n t i t y a t t r i b u t e then 02 w i l l be a r e l a t i o n s h i p a t t r i b u t e . - I f 01 i s a r e l a t i o n s h i p and 02 i s not an e n t i t y a t t r i b u t e then 02 w i l l be an e n t i t y . These are the only other alte r n a t i v e s for construct mismatch, aside from the interconnection a t t r i b u t e assumption. However, any of these rules may f a i l too, i f an object i s missing. 130 4 . Objects with i d e n t i c a l meaning w i l l have i d e n t i c a l names (consider a name i n singular i d e n t i c a l with i t s p l u r a l ) . T h i s h e u r i s t i c assumes that a p a r t i c u l a r a p p l i c a t i o n uses a st a n d a r d i z e d language to la b e l i t s objects. In absence of information to the contrary, members of the same organization are expected to use terms to l a b e l the same objects. For instance, terms such as \"department\" or \"job c l a s s i f i c a t i o n \" or \"account\" are expected to be used consistently. I f t h i s were true, synonyms and homonyms would not e x i s t . Hence, t h i s assumption i s expected to have very l i m i t e d r e l i a b i l i t y . Nevertheless, i t provides a good s t a r t i n g point i n the search fo r matching pa i r s of objects at the outset of the integration procedure. When t h i s h e u r i s t i c i s applied, two objects are treated as having the same name even i f one i s i n singular form while the other one i n the p l u r a l ( i . e . , employee vs. employees). I f the h e u r i s t i c f a i l s , the search for a matching object has to continue among a l l objects with d i f f e r e n t names. 131 5 . Objects with related meaning w i l l have names with i d e n t i c a l word stems. In the search f o r related objects, the word stem can be a very s t r o n g f i l t e r to i d e n t i f y those o b j e c t s t h a t are l i k e l y unrelated. For example, FULLTIME_EMPLOYEE and EMPLOYEE have t h e same s t e m e m p l o y e e , GRADUATE_STUDENT and UNDERGRADUATE_STUDENT have the same student stem. Thus, they are l i k e l y to be related. An even stronger i n t e r p r e t a t i o n of the word stem phenomenon may conclude that i f one object's name i s the word stem, i t w i l l be the superset of the other object, while two object with d i f f e r e n t prefixes have a common superset. Again, s i n c e synonyms and homonyms are frequent, t h i s rule w i l l be of only l i m i t e d use. Nevertheless, i n a computerized procedure, i t r e q u i r e s no user e f f o r t and i s t h e r e f o r e a desirable feature, even i f i t s benefits may be marginal. 6 . Two objects with i d e n t i c a l or related meaning w i l l have some a t t r i b u t e s with i d e n t i c a l names ( f o r e n t i t i e s and relationships only). E s p e c i a l l y i n the search for i d e n t i c a l objects, t h i s r u l e can be used to e l i m i n a t e those o b j e c t s t h a t are very u n l i k e l y candidates for i d e n t i t y . Two d i f f e r e n t views describing the 132 same EMPLOYEE e n t i t y are expected to use at le a s t some i d e n t i c a l a t t r i b u t e s to s p e c i f y employee properties. In p a r t i c u l a r , i d e n t i c a l or related objects are assumed to have the same key at t r i b u t e s (with the same key a t t r i b u t e names). Obviously, homonymy i s a problem i n t h i s context. Attributes may be i d e n t i c a l , but a t t r i b u t e names may be not. 7 . Objects with i d e n t i c a l or r e l a t e d meaning w i l l belong to the same pre-defined meaning category. In a subsequent section, a hierarchy of object categories w i l l be introduced which provides a structure for the categorization of database objects according to t h e i r meaning, i . e . , as an \"animate object\". I f each object's meaning i s pre-defined, i n terms of the category i t belongs to, then two objects from d i f f e r e n t categories cannot be i d e n t i c a l . Again, t h i s h e u r i s t i c provides a f i l t e r to eliminate non-identical objects. 133 4.2. Diagnosis Procedure The c o n f l i c t and omission recognition procedure consists of two p a r t s : the t e s t for i d e n t i t y of objects (object types), and the t e s t for relatedness of objects. The t e s t f o r i d e n t i t y i s concerned with the i d e n t i f i c a t i o n of i d e n t i c a l objects i n the observed views; the t e s t for relatedness i s concerned with t h e d e t e c t i o n of i n t e r - v i e w s e t r e l a t i o n s h i p s (object relatedness). Even though an o b j e c t from one view can have at most one corresponding object i n any other view, more than one object of another view can be related to i t . Relatedness means that there e x i s t s a s e t r e l a t i o n s h i p between the objects. The relatedness question has to be approached independently. I t i s impossible to conclude the relatedness or non-relatedness of objects from the existence of a p a i r of i d e n t i c a l objects, or v i c e versa. The f i r s t question w i l l r e f e r to the i d e n t i t y of objects. In order to r e s t r i c t the t e s t for relatedness only to inter-view r e l a t e d n e s s , the relatedness t e s t has to be preceded by the t e s t f o r i d e n t i t y . Inter-view r e l a t i o n s h i p s can only ex i s t between objects that are o r i g i n a l l y unique to one view. To f i n d out, which objects have no corresponding objects i n the 134 other view, the t e s t for object i d e n t i t y has to be performed 1 . Test f o r Identity of Objects The purpose of t h i s t e s t i s to answer the question \"does there e x i s t an object 02 i n V2 which i s i d e n t i c a l to 01 from VI?\", i . e . i f view 1 contains an e n t i t y SUPPLIER, does view 2 also contain an e n t i t y with same name and same meaning. Again, \"same meaning\" can be interpreted as \"both objects r e f e r to the same object i n the r e a l world\". Obviously, finding a perfect match w i l l be the exception. I t i s more l i k e l y that objects w i l l be found that are somewhat s i m i l a r , but not i d e n t i c a l . In such cases, adjustments have t o be made. The general r u l e i s to make o b j e c t s completely i d e n t i c a l i f they r e f e r to the same r e a l world objects (have same meaning). In such cases, possible mismatches i n name, construct or context w i l l be adjusted. I f objects r e f e r to d i f f e r e n t r e a l world objects, then a possible, but u n d e s i r a b l e , match i n t h e i r names (homonym) has to be corrected. The t e s t f o r i d e n t i t y i s c a r r i e d out incrementally, with a comparison of the involved objects along one dimension at a time. A l l t e s t s compare one object from view 1 to a set of 1 The t e s t procedures w i l l frequently mention therapy procedures to r e s o l v e c o n f l i c t s or to r e f l e c t i n t e r - v i e w r e l a t i o n s h i p s , without going into much d e t a i l . Detailed solution descriptions w i l l be given i n the subsequent section. 135 objects from view 2, to f i n d the ones that f u l f i l l the condition of the t e s t . Objects are i d e n t i c a l i f t h e i r four dimensions are i d e n t i c a l . Since the meaning dimension i s the most important one\u00E2\u0080\u0094other dimensions are adjusted a c c o r d i n g l y \u00E2\u0080\u0094 i t presents a good s t a r t i n g point for the analysis. The main problem with t h i s approach i s that an object 01 from view VI i s compared to a l l objects 02 from V2, independent of t h e i r name, construct or context, even though only one object from V2 can be i d e n t i c a l to 01. This may require that the user check a long l i s t of i r r e l e v a n t objects. The h e u r i s t i c s introduced i n the previous s e c t i o n can be used to a l l e v i a t e the problem. Therefore, a second procedure w i l l be shown which includes the h e u r i s t i c \"objects with i d e n t i c a l meaning w i l l have i d e n t i c a l constructs\", to exemplify the e f f e c t of h e u r i s t i c s . This second procedure begins with a search for objects with constructs i d e n t i c a l to that of Ol. While i t i s important to begin with the meaning dimension i n the f i r s t procedure, the analysis sequence fo r other dimensions may vary. The order chosen here i s : construct, context, name. Construct a n a l y s i s has to precede context analysis, because every t e s t for i d e n t i t y may r e s u l t i n a change i n that dimension. For example, a t e s t f o r i d e n t i t y of construct w i l l cause a c o n s t r u c t change, i f c o n s t r u c t s are not i d e n t i c a l . But a c o n s t r u c t change w i l l also r e s u l t i n a context change. In contrast, context changes do not a f f e c t the construct. Thus, 136 no t e s t f o r i d e n t i t y of context should be executed u n t i l c o n s t r u c t s have become i d e n t i c a l . Name i d e n t i t y analysis should follow construct analysis, because the user may decide to g i v e o b j e c t s d i f f e r e n t names, which are based on t h e i r c o n s t r u c t . The complete procedure i s depicted i n flowchart form i n Figure 6 (with abbreviated notation). To i l l u s t r a t e the whole procedure with an example, i t w i l l be assumed that an object 01 from view VI i s selected at random, i . e . , the e n t i t y type SUPPLIER which denotes the set of current s u p p l i e r s of a company. With t h i s object held fixed, the following t e s t s are c a r r i e d out: The procedure begins with the goal to f i n d an object 02 with i d e n t i c a l meaning to 01. To f i n d the object, the procedure generates the hypothesis HI \"there e x i s t s an object 02 from V2 such that 02 i s i d e n t i c a l i n meaning to 01\". Directed towards the user, i t r e s u l t s i n the question \"which object from view VI i s i d e n t i c a l i n meaning to 01?\" The use can then either i d e n t i f y an object, or reply with a \"none\". For example, view V2 may contain an e n t i t y MANUFACTURER which i s used i n V2 to d e s c r i b e a l l suppliers. I f a matching object i s found, the system state s'=sl i s reached. I f not, s'=s5. In contrast to the subsequent hypotheses H2-H4, t h i s t e s t compares 01 to a set {02} from view V2 rather than to a single object. {02} contains a l l objects from V2 which so far have not been 137 s-sO S'\"S4 Pick next object 01 Figure 6 : T e s t f o r O b j e c t I d e n t i t y , P r o c e d u r e w i t h o u t He u r i s t i c s 138 matched up with an object from VI. As a r e s u l t of HI, either one of these objects w i l l f i n d a matching object i n VI, while the remaining n-1 objects w i l l be i n state s5, or a l l objects from {02} w i l l be i n state s5. In other words, fo r most, i f not a l l objects from V2, the r e s u l t of t h i s t e s t w i l l be state s5. Thus, i n the flowchart i n Figure 6, for most i f not a l l o b j e c t s i n {02}, the outcome of HI w i l l be the \"no\" path, while at most one object w i l l follow the \"yes\" path. I f a matching o b j e c t i s found, the method continues with hypothesis H2 which states that 01 and 02 w i l l have the same c o n s t r u c t , i . e . , that both are e n t i t i e s . The method issues the question, \"do 01 and 02 have the same construct?\" In a computerized view integration system, the integration procedure w i l l look up the information to answer t h i s question from the view d e f i n i t i o n s . Should both objects have d i f f e r e n t constructs (s'=s6), a c o n s t r u c t change would have to occur. I f the c o n s t r u c t s are i d e n t i c a l , s t a t e s'=s2 i s reached. In the example, SUPPLIER and MANUFACTURER are both e n t i t i e s and thus have i d e n t i c a l constructs. Subsequent t o s2, the system checks for i d e n t i c a l context. Are Ol and 02 associated with i d e n t i c a l objects? For e n t i t i e s , the answer to t h i s question i s always p o s i t i v e , since t h e i r context i s an empty set. I f 01 and 02 are r e l a t i o n s h i p s or a t t r i b u t e s and not a l l t h e i r context objects have been matched 139 to objects i n the other view yet, then the i d e n t i t y t e s t f or 01 and 02 i s suspended, u n t i l the context objects are matched to objects i n the other view. I f the r e s u l t of the context t e s t i s that 01 and 02 have d i f f e r e n t contexts (s'=s7), the contexts have to be made i d e n t i c a l (s'=s3). In the example, both object are e n t i t i e s . Thus, both have i d e n t i c a l (empty) contexts. I f state s3 has been reached, the remaining t e s t i s the tes t for name i d e n t i t y of the objects. The method's hypothesis i s that both objects have i d e n t i c a l names. I f they do not share the same name (s'=s8), t h e i r names are made i d e n t i c a l (s'=s4) through a change of at lea s t one of the names. The new name w i l l have to be d i f f e r e n t from the names of a l l other objects i n VI and V2 to avoid homonymy. In the example, at le a s t one of the e n t i t i e s would require a name change. The name chosen should be such that i t i s not i d e n t i c a l to the name of another obj ect. Once the p a i r of objects i s i d e n t i c a l i n a l l four dimensions, the i d e n t i t y t e s t i s completed f o r t h i s p a i r . The method continues by s e l e c t i n g a new o b j e c t 01 from view VI, and subjecting i t to the same analysis. The procedure terminates when a l l objects have a matching object i n the other view. 140 The set of a l l objects {02} from V2 that, as a r e s u l t of HI, are known to be d i f f e r e n t i n meaning from 01 (s'=s5) i s subject to further analysis. H5 tests whether a l l of the objects have names d i f f e r e n t from 01's name. A l l objects with same names (slO) require renaming to make t h e i r names unique (s9). In addition, i f none of the objects {02} was i d e n t i c a l i n meaning to 01, a new object 02, completely i d e n t i c a l to 01, has to be added to achieve the state s4. The use of h e u r i s t i c s r e s u l t s i n changes to the view integration procedure. To exemplify such changes, a procedure w i l l be d i s c u s s e d below t h a t i n c l u d e s only one h e u r i s t i c : \"objects with i d e n t i c a l meaning w i l l have i d e n t i c a l constructs.\" This h e u r i s t i c i s i n fac t one of the h e u r i s t i c s implemented i n the view integration program AVIS. Again, the procedure begins by picking one object 01 from view VI. I t again w i l l attempt to f i n d an object i n view V2 that i s i d e n t i c a l to Ol. The procedure (see Figure 7) begins with the goal \" f i n d the set of objects {02} from V2 that have the same construct as object 01\". Since the procedure assumes that a l l objects with same meaning have the same c o n s t r u c t , i t decides t o only c o n s i d e r those objects 02 for further i d e n t i t y t e s t i n g that have the same construct as 01. A number of objects from V2 w i l l q u a l i f y and thus be i n state sO, while the objects of 141 d i f f e r e n t type w i l l be i n s t a t e s5. Since i n the example SUPPLIER i s an enti t y , a l l e n t i t i e s from V2 would be considered f o r further i d e n t i t y t e s t i n g . One may want to think of the use of construct as a \" f i l t e r \" which can reduce the number of objects to be considered, hopefully without being too stringent a condition. For those o b j e c t s with same c o n s t r u c t , the procedure then investigates whether there e x i s t s an object 02 which has the same meaning as 01 from VI. I.e., i t i s looking f o r an e n t i t y i n V2 i d e n t i c a l i n meaning to SUPPLIER. Again, at most one ob j e c t of V2 i s allowed to f u l f i l l t h i s c o n d i t i o n . That object w i l l be i n state s i . A l l objects with d i f f e r e n t meaning w i l l be i n state s6. I f an object with same meaning i s found, the procedure continues with the context (H3) and name (H4) te s t s , s i m i l a r to the te s t s above. However, i f no object i n V2 i s found to have the same meaning as 01, the procedure c o n t i n u e s d i f f e r e n t l y , t o v e r i f y one o f two p o s s i b l e i n t e r p r e t a t i o n s of the s i t u a t i o n . The f i r s t p o s s i b i l i t y i s th a t the h e u r i s t i c i s wrong. Thus, an object 02 with same meaning but d i f f e r e n t c o n s t r u c t e x i s t s i n V2. The second p o s s i b i l i t y i s that no object with i d e n t i c a l meaning exists i n V2, regardless of construct. The procedure has to f i n d out which a l t e r n a t i v e i s true, to avoid the creation of a non-minimal global schema. 142 F i g u r e 7 : T e s t f o r I d e n t i t y w i t h H e u r i s t i c 143 Thus, a f t e r t a k i n g care of homonyms (H5), the procedure continues with a t e s t to i d e n t i f y those objects with constructs d i f f e r e n t from Ol's construct. In the figure, t h i s t e s t i s shown i n a b b r e v i a t e d n o t a t i o n as c 2 o c l . I t s c o r r e c t i n t e r p r e t a t i o n i s \"are there any objects i n V2 that have a d i f f e r e n t construct?\" This question may appear redundant for the o b j e c t s i n s5, because they f a i l e d the \"same context\" t e s t . However, the s e t of objects i n state s5 may be the empty set. Thus, they would q u a l i f y f o r the answer \"no\" to question H6 (s l 3 ) , requiring the addition of a new object. I f t h e r e are o b j e c t s i n V2 with c o n s t r u c t s d i f f e r e n t from Ol's, the procedure checks whether any of them have the same meaning as Ol (H7).. I f an object with same meaning i s found ( s l l ) , i t s c o n s t r u c t has to be changed. I f no such object e x i s t s ( s l 4 ) , a t e s t for homonymy follows (H8), r e s u l t i n g i n a name change f o r a l l homonyms. Subsequently, the missing object i s added. In t h i s procedure v a r i a n t , the main e f f e c t i s a sequence change with r e s p e c t to the t e s t s f o r meaning i d e n t i t y and c o n s t r u c t i d e n t i t y . I t r e s u l t s i n a p r o l o n g a t i o n of the procedure i f the h e u r i s t i c i s wrong. 144 The procedure co u l d be v a r i e d f u r t h e r , f o r i n s t a n c e by a switch i n the sequence of meaning i d e n t i t y and context i d e n t i t y t e s t . Therefore, the t e s t for meaning i d e n t i t y would follow the t e s t f o r construct and context i d e n t i t y . Consequently, only those objects with same construct and same context would i n i t i a l l y be considered for the meaning i d e n t i t y t e s t . This procedure change would r e f l e c t the h e u r i s t i c \" i d e n t i c a l objects are i n the v i c i n i t y of i d e n t i c a l o b j e c t s . \" The procedure would look i n the neighborhood of matching objects to fi n d f u r t h e r matching o b j e c t s . T h i s h e u r i s t i c i s , i n modified form, also implemented i n AVIS. AVIS requires only part of the context to be i d e n t i c a l . The t e s t f o r meaning i d e n t i t y could even be moved past the te s t for name i d e n t i t y to r e f l e c t the h e u r i s t i c that objects with same meaning w i l l have same names. Since t h i s h e u r i s t i c i s expected to be frequently wrong, i t has not been implemented i n AVIS. Test f o r Relatedness of Objects The purpose of t h i s t e s t i s to f i n d out whether aside from being i d e n t i c a l , objects from one view are related to objects from another view through set rel a t i o n s h i p s . I.e., an en t i t y 145 (type) SUPPLIER i n VI i s a subset of an e n t i t y DEALER i n V2. Such a case would e x i s t i n a s i t u a t i o n where SUPPLIER referred to a l l current suppliers of the company, while DEALER refers to a l l present and a l l p o t e n t i a l suppliers of the company. I f those r e l a t i o n s h i p s are not made e x p l i c i t , anomalies can occur. I.e., i f a member i s dropped from the e n t i t y set DEALER, i t should a l s o be a u t o m a t i c a l l y dropped from the e n t i t y set SUPPLIER. Furthermore, a t t r i b u t e inheritance can be derived from set rel a t i o n s h i p s . The procedure described below i s a generic procedure without the use of h e u r i s t i c s (see Figure 8) . I t begins with a t e s t for containment (HI and H2) . Subject of the t e s t i s whether one of the o b j e c t s i s contained by the other object, i . e . , SUPPLIER i s contained by DEALER. The procedure f i r s t determines the set {02} of objects contained by 01, and then, f o r those objects not contained by 01, the set {02'} containing 01. The way the question i s raised to the user i s \"Which of the objects ( i n V2) are contained by 01\", and v i c e versa \"which of the objects (in V2) contain 01?\" I t i s possible that 01 contains some o b j e c t s i n V2 while being i t s e l f contained by others. I.e., SUPPLIER (VI) i s contained by DEALER (V2) but may contain another object SMALL_QTY_SUPPLIER from V2. In such a s i t u a t i o n an Isa r e l a t i o n s h i p between DEALER and SMALL_QTY_SUPPLIER would have existed which now would have to be removed because i t i s a t r a n s i t i v e Isa. 146 c 04 )4 c Ov M c 02 \" Change construct \u00E2\u0080\u00A2My. yc2-cK * ' \u00E2\u0080\u00A2 \u00C2\u00AB 16 Change construct Re present relatlon-ah Ip Represent relatI on -ship no x -Change ;o n at ruct Rep resent relat lon-shlp The containment t e s t i s the f i r s t one issued, because i t i s the most s p e c i a l i z e d form of common containment and common superset, r e q u i r i n g the l e a s t amount of a d d i t i o n s to the ex i s t i n g views. Only one Isa r e l a t i o n s h i p has to be established between the o b j e c t s . The i n s e r t i o n of an Isa between the objects requires, however, that both objects are e n t i t i e s . I f they are not, a l l of them which are not e n t i t i e s have to be converted into e n t i t i e s . The t e s t H6.1 i s executed to determine whether both objects are e n t i t i e s . The e n t i t y t e s t (H6) i s issued for each p a i r of objects a f t e r t h e i r relatedness has been discovered. There i s no need to t e s t f o r object type e a r l i e r , since only related objects that are not e n t i t i e s w i l l require construct changes. Unrelated objects w i l l keep t h e i r o r i g i n a l constructs. Since the object type t e s t (H6) i s i d e n t i c a l for a l l forms of relatedness (H6.1 - H6.4), i t w i l l not be discussed further i n the procedure. Should neither object contain the other one (s8), the procedure inquires whether both objects have a common superset (H3). I f they do, the procedure f u r t h e r i n q u i r e s whether a common subset e x i s t s between them (H4). The common superset question precedes the common subset question, because objects that have a (meaningful) common subset and are themselves meaningful sets have to have a (meaningful) common superset. Although i t 148 i s p o s s i b l e to construct sets such as the set of \" a l l green things\" and the set of \" a l l edible things\" which have a common subset i n the set of \" a l l green edible things\", while having no meaningful superset other than \" a l l things\", the rule i s nevertheless v a l i d when only meaningful sets are considered. In the example, e s p e c i a l l y the set \"green things\" i s not a meaningful set as i t has no c l e a r l y defined a t t r i b u t e s (rather than green color) which we expect for an e n t i t y or re l a t i o n s h i p type. I f objects have both a common superset and subset (slO), two new objects w i l l be created to represent the superset and the subset. Also, new Isa relationships w i l l be created to represent the relatedness. I f the objects have a common superset but no common subset (s l 4 ) , only a common superset e n t i t y and the corresponding Isa relationships w i l l be added. In addition, an i n t e g r i t y constraint may be defined to i d e n t i f y that the objects are not overlapping. Objects without a common superset (sl3) are tested f o r the ex i s t e n c e of a W-relationship (Goldstein and Storey, 1988) . I f no common superset e x i s t s , the objects are i n fact not r e l a t e d . Yet the objects may s t i l l require the creation of inter-view r e l a t i o n s h i p s i f they have a common r o l e . I f the objects have a common ro l e , i . e . , both a PERSON and a COMPANY en t i t y may be car-owners, a new object describing the common 149 r o l e (CAR_OWNER) , plus objects describing the spe c i a l roles (PERSON_CAR_OWNER, COMPANY_CAR_OWNER) have to be created. Furthermore, Isa relationships have to be added to represent the associations between the objects. I f not even a W-relationship e x i s t s between the objects, they are unrelated and require no addition of inter-view r e l a t i o n s h i p objects. 1 5 0 4 . 3 . C o n f l i c t Therapy As soon as a c o n f l i c t i s detected by the diagnosis procedure, the integration method w i l l correct the problem. Thus, while t h e r e e x i s t s a d i a g n o s i s procedure to recognize c o n f l i c t s , there e x i s t s no therapy procedure per se. Instead, for each c o n f l i c t case, a case solution i s defined. A l l case solutions are based on a set of 11 elementary solu t i o n operations which were formulated e a r l i e r as rules guiding view integration: 1. Relationship becomes an e n t i t y . 2 . Relationship a t t r i b u t e becomes an e n t i t y . 3. E n t i t y a t t r i b u t e becomes an E-R construct. 4. Association of an en t i t y to a r e l a t i o n s h i p . 5 . Relocation of a re l a t i o n s h i p a f t e r creation of new superset or subset classes. 6 . Representation of containment. 7. Representation of a common ro l e (W-relationship). 8 . Representation of common superset without overlap. 9 . Representation of common superset with overlap. 10. Renaming of homonyms and synonyms. 11. Addition of missing objects. One or more of these elementary therapy measures may have to be c a r r i e d out during c o n f l i c t r e c o n c i l i a t i o n . Each of them w i l l be d e s c r i b e d i n d e t a i l . Appendix 2 w i l l show which groups of elementary s o l u t i o n s w i l l be applied to s p e c i f i c c o n f l i c t cases and t h e i r sub-cases. Relationship becomes an e n t i t y (SI) Whenever necessary, a r e l a t i o n s h i p i s transformed into an e n t i t y . I f a r e l a t i o n s h i p becomes an e n t i t y , the linkages between the r e l a t i o n s h i p and the e n t i t i e s i t associated become re l a t i o n s h i p s themselves (see Figure 9). C U S T O M E R Figure 9: Relationship Becomes an E n t i t y 152 The e n t i t y construct i s the more fundamental one. Furthermore, an e n t i t y can be associated to other e n t i t i e s by means of a r e l a t i o n s h i p , i . e . an Isa r e l a t i o n s h i p . Consequently, fo r the newly created e n t i t y set r e l a t i o n s h i p s to other objects can be represented within the E-R modelling language. In the example i n the f i g u r e , the r e l a t i o n s h i p Contract between DEALER and CUSTOMER becomes an e n t i t y i t s e l f and two new r e l a t i o n s h i p s , Dealer_contract and Customer-contract are created i n addition. Relationship a t t r i b u t e becomes an e n t i t y fS2) When necessary, r e l a t i o n s h i p a t t r i b u t e s are converted into e n t i t i e s and a linkage i s expressed between the r e l a t i o n s h i p and the newly created e n t i t y (see Figure 10). Figure 10: Relationship A t t r i b u t e Becomes an E n t i t y 153 R e l a t i o n s h i p a t t r i b u t e s t h a t have t o be transformed i n t o e n t i t i e s are i n t e r c o n n e c t i o n a t t r i b u t e s . Interconnection a t t r i b u t e s represent e n t i t i e s (or E-R constructs) i n shortened form. I f the database requires that an interconnection a t t r i b u t e be associated with another object, i t f i r s t has to be converted i n t o an e n t i t y (or an E-R construct) . In the i l l u s t r a t i o n , SUPPLIER i s associated with PART through the Supply r e l a t i o n s h i p which has an at t r i b u t e Project. This a t t r i b u t e subsequently becomes an e n t i t y . E n t i t y a t t r i b u t e becomes an E-R construct (S3) Similar to r e l a t i o n s h i p a t t r i b u t e s , e n t i t y a t t r i b u t e s may have to be transformed, i f they r e q u i r e a s s o c i a t i o n with other objects, or i f another view represents them d i f f e r e n t l y . An e n t i t y a t t r i b u t e which i s an interconnection a t t r i b u t e represents an e n t i t y - r e l a t i o n s h i p construct i n shortened form. Therefore, i t w i l l be converted i n t o an e n t i t y - r e l a t i o n s h i p structure (see Figure 11). T y p i c a l l y , the newly created e n t i t y w i l l r e f e r to the same r e a l world o b j e c t t h a t the o r i g i n a l a t t r i b u t e referred to. However, the user may think of the newly created r e l a t i o n s h i p as the object that corresponds to the o r i g i n a l a t t r i b u t e . In 154 f a c t , the a t t r i b u t e corresponds to both the e n t i t y and the re l a t i o n s h i p . In the example, the PART e n t i t y has an at t r i b u t e Supplier which i n fact represents a Supply r e l a t i o n s h i p and a SUPPLIER e n t i t y i n shortened form. Figure 11; E n t i t y A t t r i b u t e Becomes an E n t i t y - R e l a t i o n s h i p Construct 155 Association of an e n t i t y to a r e l a t i o n s h i p (S4) A c o n f l i c t s i t u a t i o n may require the association of an already e x i s t i n g e n t i t y with an already e x i s t i n g r e l a t i o n s h i p . The new element added to the view i s the association l i n k (role) between the e n t i t y and the r e l a t i o n s h i p (see Figure 12). View 1 View 2 S U P P L I E R P R O J E C T PART Global Schema Figure 12: Association of an E n t i t y to a Relationship 156 Such a s i t u a t i o n arises when two rela t i o n s h i p s are s i m i l a r , even though one invol v e s only a subset of the e n t i t y types a s s o c i a t e d by the other r e l a t i o n s h i p , i . e . one i s a binary, the other a ternary r e l a t i o n s h i p . The figure shows a Supply r e l a t i o n s h i p , i n v o l v i n g only the SUPPLIER and PART i n the f i r s t r e l a t i o n s h i p . Subsequently, the PROJECT e n t i t y i s also t i e d into the re l a t i o n s h i p . Relocation of a re l a t i o n s h i p a f t e r creation of new superset or subset classes (S5) Whenever a new superset-subset r e l a t i o n s h i p i s introduced into a view, the p o s s i b i l i t y exists that e x i s t i n g r e l a t i o n s h i p s may have to be relocated. Figure 13 shows such a case. In view VI DEPARTMENT Employs FULLTIME_EMPLOYEE, while view V2 reveals that every FULLTIME_EMPLOYEE i s an EMPLOYEE. Once the views are combined, i t becomes evident that the Employs r e l a t i o n s h i p should a s s o c i a t e DEPARTMENT with EMPLOYEE rather than with FULLTIME_EMPLOYEE. Hence, the Employs r e l a t i o n s h i p i s relocated. Relocation becomes necessary whenever the o r i g i n a l r e l a t i o n s h i p , i . e . Employs, should have referred to eithe r a more general o b j e c t , i . e . EMPLOYEE instead of FULLTIME_EMPLOYEE, or to a more s p e c i f i c object. 157 View 1 F U L L T I M E -E M P L O Y E E View 2 FULLT IME-E M P L O Y E E Isa E M P L O Y E E Figure 13: Relationship Relocation Representation of containment (S6) Whenever one object (class) represents the superset of another object and t h i s superset-subset r e l a t i o n s h i p i s meaningful for 158 the database, i t has to be represented by an Isa re l a t i o n s h i p between the two objects (see Figure 14). View 1 F U L L T I M E . E M P L O Y E E View 2 E M P L O Y E E Figure 14: Representation of Containment The i l l u s t r a t i o n i n the figure shows the creation of an Isa r e l a t i o n s h i p between an EMPLOYEE and a FULLTIME_EMPLOYEE en t i t y . 159 Representation of a common r o l e (W-relationship) (S7) Two objects can be unrelated but nevertheless have some a f f i n i t y to each other, i f they assume a common r o l e . Goldstein and Storey (1988) i d e n t i f y t h i s a f f i n i t y as a W-relationship. Figure 15 depicts two e n t i t i e s , COMPANY and PERSON, as unrelated but both assuming the r o l e of a car owner. Both people and companies can be car owners. G L O B A L S C H E M A 160 In such a s i t u a t i o n , new objects have to be created to represent the common ro l e , i . e . STOCKHOLDER, as well as to represent the s p e c i f i c r o l e s , i . e . , COMPANY_STOCKHOLDER and PERSON_STOCKHOLDER. Each object representing a s p e c i f i c r o l e w i l l be contained by one of the o r i g i n a l objects, i . e . COMPANY or PERSON, as well as by the ob j e c t representing the common r o l e . Whenever a common r o l e i s represented, r e l o c a t i o n of rel a t i o n s h i p s may have to take place. Representation of common superset without overlap (S8) A Superset but no overlap describes objects that exclude each o t h e r , such as FULLTIME_EMPLOYEE and PARTTIME_EMPLOYEE. Figure 16 i l l u s t r a t e s such a scenario and shows the creation of a new superset object EMPLOYEE, connected to the o r i g i n a l objects through two Isa rel a t i o n s h i p s . The example i n Figure 16 i s based on the assumption that the EMPLOYEE e n t i t y has not previously existed i n ei t h e r of the view. Whenever a common superset i s represented, relocation of r e l a t i o n s h i p s may have to occur. 161 View 1 F U L L T I M E . E M P L O Y E E View 2 PARTT IME. E M P L O Y E E G L O B A L S C H E M A F U L L T I M E . E M P L O Y E E P A R T T I M E . E M P L O Y E E Figure 16: Representation of a Common Superset without Common Subset 162 Representation of common superset with overlap (S9) In s i t u a t i o n s where two objects not only have a common superset but also a common subset (overlap) both the superset and the subset have to be represented by additional objects and Isa r e l a t i o n s h i p s between the o r i g i n a l objects and the superset and subset objects (see Figure 17). View 1 View 2 P R O D U C T . T E A M -MEMBER P R O J E C T . T E A M -MEMBER G L O B A L S C H E M A P R O D U C T -T E A M -M E M B E R P R O J E C T - S -P R O D U C T . T E A M -MEMBER P R O J E C T . T E A M -M E M B E R Figure 17: Representation of Common Superset and Common Subset 163 Figure 17 depicts PROJECT_TEAM_MEMBER and PRODUCT_TEAM_MEMBER e n t i t i e s . Both have the common superset EMPLOYEE and the common subset PROJECT&PRODUCT_TEAM_MEMBER. The Isa relationships r e p r e s e n t t h a t a l l team members are employees and that the members of the project&product team belong to both the project and the product team. Again, any previously e x i s t i n g superset, subset or Isa relationships w i l l not be reduplicated. Whenever a common superset or a common subset i s represented, r e l o c a t i o n of r e l a t i o n s h i p s may have to occur. Renaming of homonyms and synonyms (S10) Renaming becomes necessary when otherwise i d e n t i c a l objects c a r r y d i f f e r e n t names (synonym), or when d i f f e r e n t objects carry the same name (homonym). Once synonyms are treated, the objects should have the same name. That name should also be d i f f e r e n t from the name of any other object i n ei t h e r view. Once homonyms are treated, the involved objects should carry names t h a t are d i f f e r e n t from each other and d i f f e r e n t from a l l objects they are not known to be i d e n t i c a l to. 164 Addition of missing objects ( S l l ) Objects can be missing. Most views w i l l overlap only p a r t i a l l y . Hence, fo r any two views, a l l objects that e x i s t i n one view but not i n the other have to be added to the other view i n order to make the views i d e n t i c a l . The addition of missing objects i s part of the \"view completion\" strategy used i n t h i s i ntegration method. During integration, both views that take part i n the integration process are altered u n t i l f i n a l l y they are i d e n t i c a l . T h i s strategy i s d i f f e r e n t from those that create a t h i r d \"integrated\" view during the c o n f l i c t resolution process. Many c o n f l i c t cases require the combination of several elementary therapy procedures to correct a c o n f l i c t . For instance, a case of c o n s t r u c t mismatch p a i r e d with synonymy (Case 6) , requires a name change and a construct change, therapies S10 and one of SI, S2, or S3. Appendix 2 presents the c o n f l i c t cases and a p p l i c a b l e therapy procedures. Case 6 i s shown below fo r i l l u s t r a t i o n . CONSTRUCT MISMATCH AND SYNONYM Nl <> N2; T l <> T2; Ml = M2; CI <> C2; 6.1 E n t i t y i s Relationship. Solution: S10 and SI. 6.2 E n t i t y A t t r i b u t e i s E n t i t y - R e l a t i o n s h i p construct. Solution: S10 and S3. 165 6.2.1. Attribute i s En t i t y . 6.2.2. Attr i b u t e i s Relationship. 6.3. Relationship A t t r i b u t e i s E n t i t y . Solution: S10 and S2. 166 4 . 4 . The Impact of Heu r i s t i c s The main goal of t h i s research i s the development of a complete view integration method. The secondary goal i s an adaptation of t h i s method to operate with i n s u f f i c i e n t information. The integration method i n the form described so f a r does not take into account the source of i t s information requirements. For example, i f the method has to know whether EMPLOYEE i n view 1 and DEALER i n view 2 are of the same o b j e c t type (construct) , the method expects t h i s information to be ava i l a b l e . The source of the information i s of no concern. Among the four r e l e v a n t dimensions f o r each object, name, construct, meaning, and context, name and construct are the ones most e a s i l y assessed. Does EMPLOYEE have the same name as DEALER? Obviously not. Also the object type i s observable, because object types are e x p l i c i t l y stated i n E-R models. The assessment of meaning i d e n t i t y , and therefore also context i d e n t i t y , i s a much more d i f f i c u l t problem. The question i s whether two view objects r e f e r to the same r e a l world object. Recognition or inter p r e t a t i o n of r e a l world objects i s a task beyond most computer systems and not a concern of t h i s research. Nevertheless, recognition of meaning i d e n t i t y or difference i s the most c r u c i a l recognition task, since the other dimensions follow the meaning dimension. I.e., i f two objects have the same meaning, t h e i r names w i l l ultimately be the same, i f they have d i f f e r e n t meaning, t h e i r names w i l l ultimately be d i f f e r e n t . The f o l l o w i n g a l t e r n a t i v e s e x i s t to s a t i s f y the meaning information requirement: 1. user interrogation; 2. advance meaning s p e c i f i c a t i o n ; 3. method \"guesses\". The f i r s t a l t e r n a t i v e to s a t i s f y the meaning i n f o r m a t i o n requirement i s through user interrogation. Every time two objects are compared, the system could ask the user \"are these two o b j e c t s i d e n t i c a l i n meaning?\". This form of operation demands a s u b s t a n t i a l amount of question answering by the user, e s p e c i a l l y since for any object 01 i n view 1 at most one object 02 i n view 2 with the same meaning i s allowed to e x i s t . Advance meaning s p e c i f i c a t i o n requires an ex-ante d e f i n i t i o n of the meaning of each object i n a form that allows the method to compare i t to other objects and to decide on i d e n t i t y or d i f f e r e n c e . This requirement r e s u l t s i n two main problems. F i r s t , meaning descriptions may have to be very d e t a i l e d to d i f f e r e n t i a t e between objects that are quite s i m i l a r , yet not 168 completely i d e n t i c a l . Thus the up-front e f f o r t required i s very high. Secondly, meaning d e f i n i t i o n s have to be formulated i n such a form that there can be no misinterpretations. The terms used to define meaning have to be consistent over a l l object d e f i n i t i o n s . These two problems v i r t u a l l y r u l e out a p r i o r complete d e f i n i t i o n of each object's meaning. Method \"guesses\" r e q u i r e t h a t the i n t e g r a t i o n method has strong evidence on which i t can base i t s guesses. \"Guessing\" i m p l i e s t h a t whenever the method compares two objects, i t makes a d e c i s i o n whether to b e l i e v e t h a t the o b j e c t s are i d e n t i c a l or not. This i s the way i n which humans operate. When we say \"I know\", we mean that we believe, based on evidence for the fac t and no or l i t t l e evidence against the f a c t \" . I f evidence i s not available, the method i s bound to make mistakes. U n f o r t u n a t e l y , ample opportunity for mistakes e x i s t s , since the amount of p o s i t i v e information \u00E2\u0080\u0094 a n y Ol i s i d e n t i c a l to at most one 0 2 \u00E2\u0080\u0094 i s so much smaller than the amount of negative i n f o r m a t i o n . Hence, reliance on guesses i s not a desirable a l t e r n a t i v e . Apparently, none of the a l t e r n a t i v e s by i t s e l f provides a reasonable s o l u t i o n to the information requirement problem. The f i r s t a l t e r n a t i v e , interrogation, provides the information, yet at high c o s t t o the user. The second a l t e r n a t i v e , up-f r o n t d e f i n i t i o n , does not n e c e s s a r i l y p r o v i d e a l l the 169 information and i t requires a l o t of user e f f o r t i n addition to an unambiguous r e p r e s e n t a t i o n . The t h i r d a l t e r n a t i v e r e q u i r e s no user e f f o r t but does not guarantee t h a t the information requirements are s a t i s f i e d c o r r e c t l y . Consequently, the best strategy to s a t i s f y the requirements, i s to combine the good aspects of the discussed a l t e r n a t i v e s . User i n t e r r o g a t i o n i s the only method t h a t s a t i s f i e s the information requirements, therefore i t i s the dominant approach ( i f the user says that i n h i s world two objects are i d e n t i c a l , they are i d e n t i c a l , unless t h i s fact c o n f l i c t s with a previous statement). The other two a l t e r n a t i v e approaches can be used t o overcome or at a l l e v i a t e the weakness of d i r e c t user i n t e r r o g a t i o n , because they can l i m i t and p r i o r i t i z e the questions to be asked. Most of the questions of the type \" i s object 01 i d e n t i c a l to ...\" w i l l r e s u l t i n the answer \"no\" or the w i l l demand the comparison to a vast number of other objects at once. I f 01 i s compared to a l l objects i n 02 i n one comparison, the user has to deal with a large amount of information which may make i t d i f f i c u l t to answer c o r r e c t l y . Consequently, an improved method should reduce the number of objects OI has to be compared to. I f object i d e n t i t y i s the goal, only such 02s should be compared to 01 which could p o t e n t i a l l y be i d e n t i c a l to 01. In other words, a f i l t e r would be used to reduce the number of 170 o b j e c t s i n the comparison. Ex-ante meaning d e f i n i t i o n s of objects, i f i n unambiguous form, can be used i n such a manner. I f the purpose of ex-ante meaning d e f i n i t i o n s i n t h i s approach i s t o allow an automatic assessment of difference, meaning d e f i n i t i o n s can become much shorter. For example, the meaning d e f i n i t i o n of each o b j e c t could contain j u s t one fact, i t s v a l u e being e i t h e r \"animate object\", \"inanimate object\" to separate a l l E-R model o b j e c t s describing l i v i n g creatures from those describing things. I f a l l database objects were c o r r e c t l y c l a s s i f i e d , the method could automatically decide that EMPLOYEE and DEPARTMENT are d i f f e r e n t , because the former one i s a l i v i n g object, the l a t t e r one not. A few general categories can be chosen which can allow s u f f i c i e n t s p e c i f i c a t i o n and d i f f e r e n t i a t i o n of meaning without the need for an excessive up-front d e f i n i t i o n e f f o r t . Ein-Dor (1987) discusses the use of such \"common sense knowledge\" i n reasoning. Grounded on such a common sense knowledge based c l a s s i f i c a t i o n , the i n t e g r a t i o n method could quickly eliminate those objects 02 that are not i d e n t i c a l to object 01. The user would only have to decide among the remaining objects. A further reduction i n the number of objects involved i n the comparison can be i n i t i a t e d through the use of other available i n f o r m a t i o n , i n combination with the use of h e u r i s t i c s , as discussed previously. Instead of guessing which objects are 171 i d e n t i c a l , the method cou l d use any additional evidence to further reduce the number of objects under consideration. The following two views s h a l l exemplify t h i s approach which u t i l i z e s context information: View 1: EMPLOYEE\u00E2\u0080\u0094Employed_by\u00E2\u0080\u0094DEPARTMENT View 2: EMPLOYEE\u00E2\u0080\u0094Works_in\u00E2\u0080\u0094XYZ\u00E2\u0080\u0094Engaged_in\u00E2\u0080\u0094PROJECT Suppose, i t i s alrea d y known t h a t EMPLOYEE i n view 1 and EMPLOYEE i n view 2 are i d e n t i c a l . Now, the next task would be to f i n d out whether the re l a t i o n s h i p Employed_by i s i d e n t i c a l i n meaning to any object i n view 2. One reasonable assumption would be t o expect t h a t an object i d e n t i c a l to Employed_by would also be a r e l a t i o n s h i p i n view 2. This does not have to be the case but i s quite l i k e l y (hence, a h e u r i s t i c ) . This simple assumption reduces the number of contenders i n view 2 to the objects, Works_in and Engaged_in. Another reasonable assumption would be to expect that the object sought i n view 2 i s also associated with that view's EMPLOYEE e n t i t y . Again, t h i s does not n e c e s s a r i l y have to be the case, information could be missing i n view 2, yet i t i s an assumption l i k e l y to be t r u e . The second assumption leaves only Works_in as a p o t e n t i a l candidate to have the same meaning as Employed_by. Consequently, instead of asking the user \" i s the r e l a t i o n s h i p Employed_by i d e n t i c a l i n meaning to one of the following: Works_in, XYZ, Engaged_in, PROJECT?\", i t can more i n t e l l i g e n t l y ask, \" i s the r e l a t i o n s h i p Employed_by i d e n t i c a l i n meaning to 172 the r e l a t i o n s h i p W o r k s i n ? \" , thus s i m p l i f y i n g the decision task f o r the user. Not only context and construct can be used to make assumptions about the i d e n t i t y of objects. Other avai l a b l e information, such as names can be used too. Figure 18 provides an overview of p o t e n t i a l sources of evidence for meaning i d e n t i t y . The f i r s t aspect, meaning representation, has already been discussed. MEANING Figure 18: Sources of Evidence for Meaning Identity 173 The second aspect, context, i s broken down into three observable f a c t s : r e l a t e d objects, c a r d i n a l i t i e s , and roles of e n t i t i e s i n a r e l a t i o n s h i p . \"Related o b j e c t s \" denotes the general d e f i n i t i o n of context. C a r d i n a l i t i e s r e f e r s to the context of re l a t i o n s h i p s . I f two relationships do not only associate the same e n t i t i e s , but a l s o with the same mapping r a t i o s , the evidence f o r the r e l a t i o n s h i p s 1 i d e n t i t y i s even stronger. When a view contains multiple relationships associating the same set of e n t i t i e s , a d i f f e r e n t i a t i o n by c a r d i n a l i t i e s can be u s e f u l . The use of r o l e s a p p l i e s only when r o l e s are defined. I f names are given to the associating l i n k between an e n t i t y and a r e l a t i o n s h i p , then these r o l e names can be used for comparison. Third, a t t r i b u t e s can serve as an indicator for i d e n t i t y . The problem i s t h a t a t t r i b u t e s are o b j e c t s i n themselves and t h e r e f o r e s u b j e c t to the same d i f f i c u l t i e s with respect to i d e n t i t y assessment. One aspect of at t r i b u t e s , however, i s e a s i l y found out, t h e i r names. Thus, two o b j e c t s may be speculated to be i d e n t i c a l , i f t h e i r a t t r i b u t e s have i d e n t i c a l names. As f o r a l l previous indicators, there has to be room f o r i n t e r p r e t a t i o n . The requirement should not be that a l l a t t r i b u t e s have t o be i d e n t i c a l , y e t a t l e a s t some. A l t e r n a t i v e l y , the key a t t r i b u t e ( s ) c o u l d be the focus of 174 attention. I d e n t i c a l objects are l i k e l y to have i d e n t i c a l key at t r i b u t e s . Fourth, i d e n t i c a l domains can be an in d i c a t o r f o r i d e n t i c a l meaning, i f domains can be d e f i n e d unambiguously. For at t r i b u t e s , domains are the value sets from which the a t t r i b u t e values are drawn, i . e . \"Social Security Number\". For other o b j e c t s , an o b j e c t ' s superset d e f i n e s i t s domain. I.e., EMPLOYEE \u00E2\u0080\u0094 Isa\u00E2\u0080\u0094PERSON s p e c i f i e s the domain of EMPLOYEE as being a person. I f the other view contains also the PERSON e n t i t y , then the EMPLOYEE e n t i t y could e x i s t only among i t s subsets. F i n a l l y , the name of an object as an ind i c a t o r f o r i t s meaning can be another relevant piece of evidence. E s p e c i a l l y i f name i d e n t i t y i s not defined as s t r i c t i d e n t i t y of the character s t r i n g s , b u t i f i t a l s o a l l o w s f o r s i n g u l a r / p l u r a l d i f f e r e n t i a t i o n , as i n EMPLOYEE vs. EMPLOYEES. Both objects could be expected to be the same, even though t h e i r names are, s t r i c t l y interpreted, d i f f e r e n t . For the analysis of relatedness of objects, t h i s i n t e r p r e t a t i o n f l e x i b i l i t y could be widened, allowing f o r comparison of objects that only d i f f e r i n t h e i r names' p r e f i x e s . For example PART_TIME_EMPLOYEE, EMPLOYEE, and FULL_TIME_EMPLOYEE could be expected to be i d e n t i c a l or at l e a s t r e l a t e d , s i n c e they a l l t h e i r names contain the root word employee. 175 I t i s u n l i k e l y , that for any given object a l l these aspects p o i n t i n t o the same d i r e c t i o n , t h a t i s , i d e n t i f y the same o b j e c t . Often, i t may not be known what the context of a p a r t i c u l a r o b j e c t i s , naming pre f e r e n c e s w i l l d i f f e r , and d i f f e r e n t tasks may require d i f f e r e n t object a t t r i b u t e s . The approach to be taken i s to use these indicators as a f i l t e r of v a r i a b l e density. At f i r s t , the f i l t e r should be t i g h t , to suggest only the most l i k e l y candidate(s) f o r a meaning match, i . e . , only the objects of the same type with same context and of the same meaning category. Should t h i s f i l t e r be too wide s t i l l , i . e . , for a database with many e n t i t i e s of the people category, p a r t i a l overlap of a t t r i b u t e names, or i d e n t i t y of key a t t r i b u t e names can be used to r e s t r i c t the number of objects. Upon f a i l u r e , i . e . , i f none of the suggested objects resulted i n a proper match, the technique could remove one or more of the e a r l i e r applied r e s t r i c t i o n s , i . e . , look for a l l o b j e c t s of the same meaning category, regardless of object type and context. There e x i s t s no single best rule for the app l i c a t i o n of meaning ind i c a t o r s . The only indicator which i s always applicable and correct i n i t s prediction, should the information be available, i s the meaning category indicator. By d e f i n i t i o n two objects cannot be i d e n t i c a l i n meaning unless t h e i r meanings belong to the same category of meaning. I.e., EMPLOYEE and DEPARTMENT 176 cannot have the same meaning because one i s an animate object, the other one an inanimate object. Hence, t h i s i n d i c a t o r i s the only one that can eliminate objects with c e r t a i n t y . The other i n d i c a t o r s can only suggest t h a t an object may have d i f f e r e n t (or same) meaning. Only e m p i r i c a l data generated under a v a r i e t y of conditions can provide stronger evidence on which meaning indicators work better than others. For instance, i f the same systems analyst produces a l l views (based on d i f f e r e n t u s e r s ' information r e q u i r e m e n t s ) , one may expect t h a t o b j e c t type may be a r e a s o n a b l e i n d i c a t o r ( f i l t e r ) f o r meaning i d e n t i t y ; the u n d e r l y i n g assumption being that a single database designer w i l l be more consistent i n what he models as a re l a t i o n s h i p vs. an e n t i t y or a t t r i b u t e than a m u l t i p l i c i t y of designers. I f a l l views s p e c i f i c a t i o n s and designs are done by the same person (user d e s i g n e r ) , one should expect names to be used co n s i s t e n t l y throughout the views. Hence, names could provide a good basis to judge meaning i d e n t i t y . 177 4.5. Generalization Hierarchy for Database Objects The previous section introduced the idea of ex-ante meaning d e f i n i t i o n s according to predefined meaning categories. Here, the concept of a g e n e r a l i z a t i o n h i e r a r c h y s h a l l be introduced to f a c i l i t a t e the categorization. The d i f f i c u l t y i n developing such a c l a s s i f i c a t i o n scheme i s the f a c t that i t has to be acceptable to a l l people involved i n the database design process. In order to f u l f i l l t h i s goal, the generalization hierarchy should be: 1. complete; 2. consistent; 3. discriminative; 4. concise. C r i t e r i a 1 and 2 are minimum c r i t e r i a . F i r s t , a c l a s s i f i c a t i o n scheme that does not allow the user to c l a s s i f y a l l h i s objects i n accordance with i t i s i n s u f f i c i e n t to capture that user's knowledge. Second, i f the scheme induces the user to c l a s s i f y the same o b j e c t under d i f f e r e n t categories, i t v i o l a t e s the purpose of the scheme, namely to i d e n t i f y s i m i l a r i t y or differ e n c e of object meanings. 178 C r i t e r i a 3 and 4 are based on Leibniz's Minimality P r i n c i p l e (Leibniz, 1956, pp. 198-199). This p r i n c i p l e postulates that a representation i s superior to another one, i f i t requires a s h o r t e r e x p l a n a t i o n t o e x p l a i n t h e same phenomena. C o r r e s p o n d i n g l y , a g e n e r a l i z a t i o n h i e r a r c h y t h a t can d i f f e r e n t i a t e among a l a r g e r number of object classes than another one with the same number of d i f f e r e n t i a t i o n c r i t e r i a i s superior. What i s undesirable i s a c l a s s i f i c a t i o n scheme that i s very fine-grained for a subset of object classes but very coarse f o r the remainder of object classes. Similar to an unbalanced binary tree, the too fine/too coarse generalization hierarchy would waste too many l e v e l s of s p e c i a l i z a t i o n on too few phenomena. Unfortunately, choice of the \" r i g h t \" generalization hierarchy w i l l consequently depend on the knowledge domain and on the way i n which the person who c l a s s i f i e s objects d i f f e r e n t i a t e s among them. For example, a generalization hierarchy which co n t a i n s only one c l a s s f o r a l l \"people objects\" w i l l deal p o o r l y with a database that stores only data for d i f f e r e n t people r o l e s ( i . e . , employee, investor, saver, tax payer). Consequently, v a l i d a t i o n of the q u a l i t y of a generalization hierarchy i s possible only within the context of a p a r t i c u l a r knowledge domain and a s p e c i f i c person who c l a s s i f i e s objects. Hence i t i s necessary to i n c l u d e the c r e a t i o n of such a generalization hierarchy i n the requirements analysis e f f o r t . 179 The database designer has to develop a hierarchy which can represent the app l i c a t i o n domain and has the above mentioned desirable properties. I f no such s p e c i a l i z e d c a t e g o r i z a t i o n h i e r a r c h y e x i s t s , a domain-independent c a t e g o r i z a t i o n hierarchy could be used. The hierarchy created as part of t h i s project, i s rather f l a t , incorporating only few l e v e l s of s p e c i a l i z a t i o n . A f l a t generalization hierarchy has the obvious disadvantage o f l i m i t e d d i s c r i m i n a t i v e a b i l i t y . However, o b j e c t c l a s s i f i c a t i o n s are used to i d e n t i f y difference i n meaning, not meaning i d e n t i t y . Object c l a s s i f i c a t i o n i s only one of the i d e n t i f i e r s used by the integration method, and the method w i l l always i n t e r r o g a t e the user, i f i n doubt. Since the focus i s on difference i n meaning, even a f l a t generalization hierarchy has reasonable discriminative a b i l i t y , as the following example may i l l u s t r a t e . Consider a g e n e r a l i z a t i o n h i e r a r c h y that can d i f f e r e n t i a t e among 20 c l a s s e s , such as Person, Animal, O r g a n i z a t i o n . Object EMPLOYEE i s c l a s s i f i e d as a Person. The question to be answered i s \" i s object XYZ d i f f e r e n t i n meaning from object EMPLOYEE?\". Without f u r t h e r knowledge about XYZ, XYZ has equal p r o b a b i l i t i e s to belong into e i t h e r c l a s s , and thus a .05 chance of belonging into the class Person. Thus there 180 e x i s t s a .05 chance for the c l a s s i f i c a t i o n mechanism to suggest that EMPLOYEE and XYZ are not d i f f e r e n t i n meaning. In t h i s s i t u a t i o n (1 out of 20 cases) , the user would have to be con s u l t e d , i f not other indicators were able to answer the question. An increase of the number of classes to 40 would reduce the p r o b a b i l i t y to .025, an increase to 200 classes would r e s u l t i n a .005 pr o b a b i l i t y , requiring user interrogation only i n 1 out of 2 00 cases. The reductions i n p r o b a b i l i t y have to be weighed against the c l a s s i f i c a t i o n e f f o r t which i s an ex-ante investment. A g e n e r a l i z a t i o n hierarchy for the categorization of object c l a s s e s shows s i m i l a r i t i e s with the attempts to represent common sense knowledge i n a r t i f i c i a l i n t e l l i g e n c e . The c l a s s i f i c a t i o n h i e r a r c h y d i s c u s s e d here i s , however, less ambitious, s i n c e the task, judging whether two objects are d i f f e r e n t i n meaning, i s simpler than the task presented i n the a r t i f i c i a l i n t e l l i g e n c e applications ( i . e . , Schank's and Rieger's r e s t a u r a n t s c r i p t s (1974) or Hayes' naive physics (1979)). Ein-Dor suggests concept c l u s t e r s f o r common knowledge i n the business environment (1987). His categories are: 1. exchange, 2. time, 3. location, 4. measurement, 5. media of exchange, 6. obligations and commitments, 7. types of businesses, 8. behaviors, 9. naive economics, 181 10. 11. employment, people who engage i n business. This c l a s s i f i c a t i o n c l a r i f i e s the difference between a common knowledge representation and a generalization hierarchy. E i n -Dor's c l a s s e s are not mutually exclusive. For example, the employment s i t u a t i o n can be c l a s s i f i e d as group 10 as well as group 6. These c l a s s e s represent areas i n which a common sense computer program should have knowledge i n . The c a t e g o r i z a t i o n that can be used i n absence of any more domain oriented hierarchies, i s structured as follows: 1. Objects 1.1. L i v i n g objects (even i f now dead) 1.1.1. Plants (flora) 1.1.2. Animals (fauna) 1.1.3. Persons 1.1.3.1. Person (generic, not person roles) 1.1.3.2. Person roles 1.1.3.2.1. Person r o l e s i n person-person i n t e r a c t i o n ( i . e . , parent) 1.1.3.2.2. Person roles i n person-thing association ( i . e . , car owner) 1.1.3.2.3. Person r o l e s i n person-person-thing interactions ( i . e . , manager) 1.2. Inanimate objects 1.2.1. Abstract objects 1.2.1.1. Abstract objects that are organized (have structure) 1.2.1.1.1. Hierarchies ( i . e . , a business company) 1.2.1.2.2. Markets ( i . e . , the r e a l estate market) 1.2.1.1.1. Other Structures 1.2.1.2. Heaps, lumps and atomic abstract objects ( i . e . , a dream, a theory) 1.2.2. Concrete objects (\"things\") 2. Object c h a r a c t e r i s t i c s ( i . e . , color, size) According to t h i s categorization scheme, each view object can have a meaning l i s t c o n t a i n i n g up to 5 elements, such as [object,living,person,role,person-thing] for category 1.1.3.2.3. 182 Objects c l a s s i f i e d as belonging to d i f f e r e n t categories cannot be i d e n t i c a l i n meaning. I f the meaning l i s t f o r an object i s in c o m p l e t e l y s p e c i f i e d , i . e . , category 1.1.3. i t may not be d i f f e r e n t from an object c l a s s i f i e d as 1.1.3.2.3. and therefore user i n t e r r o g a t i o n may be necessary. Objects belonging to d i f f e r e n t categories but belonging to the same higher category may be rel a t e d i n meaning. More domain s p e c i f i c categorization schemes w i l l have more and better f i t t i n g categories but w i l l use the same reasoning mechanism to int e r p r e t the r e s u l t s of categorization. 183 4 . 6 . Assessment of the Method In an e a r l i e r chapter, the strengths and weaknesses of p r e v i o u s i n t e g r a t i o n methods were assessed. The same e v a l u a t i o n c r i t e r i a w i l l now be used t o h i g h l i g h t the c a p a b i l i t i e s and l i m i t a t i o n s of the method presented here. S i m i l a r to p r e v i o u s semantic i n t e g r a t i o n methods, the one i n t r o d u c e d i n t h i s r e s e a r c h r e q u i r e s d e s i g n e r i n t e r a c t i o n during the integration process. The designer has to be consulted to s e t t l e questions concerning i d e n t i t y or d i f f e r e n c e i n meaning. However, the method employs h e u r i s t i c s to reduce the number of questions that must be asked. View integration, as discussed here, covers a larger part of the integration problem than most other techniques. I t performs c o n f l i c t r e s o l u t i o n , view merging and addition of i n t e r - s e t r e l a t i o n s h i p s . B a t i n i et a l . (1983) cover add i t i o n a l aspects of the conceptual design process, including correctness and completeness t e s t s for i n d i v i d u a l views before the integration process (pre-integration). These te s t s , however, are not an e s s e n t i a l p a r t of the integration process; rather, they are elements of the view creation task. This research exceeds a l l preceding approaches i n the number of c o n f l i c t cases covered. Less important than the number of 184 cases, however, i s the fact that the c o n f l i c t l i s t i s exhaustive, based on a l l relevant object d i f f e r e n t i a t i o n c r i t e r i a . S i m i l a r t o other semantic methods, t h i s one reduces the complexity of the integration task by focussing on high l e v e l objects e n t i t i e s and r e l a t i o n s h i p s . The method also separates the t e s t f o r relatedness from the t e s t for i d e n t i t y . H euristics further reduce the task complexity. The question \" i s object 01 i d e n t i c a l i n meaning to one of the objects {02}?\" can be s i m p l i f i e d through r e d u c t i o n of the s i z e of the set {02}. H e u r i s t i c s are used to eliminate u n l i k e l y candidates from {02}. This research also investigated whether the integration problem could be described by an even smaller set of c o n f l i c t categories than the 17 general cases i d e n t i f i e d i n s e c t i o n 4.1. To s i m p l i f y the d e s c r i p t i o n of c o n f l i c t s , a graph notation was chosen which r e p r e s e n t s e v e r y o b j e c t , whether e n t i t y , r e l a t i o n s h i p , or a t t r i b u t e , as a node, and every association between o b j e c t s ( e n t i t y r o l e , a t t r i b u t e association) as an edge. Based on t h i s notation, view c o n f l i c t s take the form of missing nodes or edges, or inconsistently l a b e l l e d nodes (name mismatch). A mismatch between types of nodes, i . e . e n t i t y -a t t r i b u t e v s . e n t i t y - r e l a t i o n s h i p c o n s t r u c t , can be characterized as a graph contraction. A graph contraction i s the removal of an edge which r e s u l t s i n the merging of the two objects linked by the edge into one new object. I.e., an E-R 185 construct i s merged into one new object, an e n t i t y a t t r i b u t e . S i m i l a r l y , a r e l a t i o n s h i p r e p l a c e s a r e l a t i o n s h i p - e n t i t y -r e l a t i o n s h i p structure, when two edges are contracted i n the l a t t e r one. Both types of contraction are depicted i n Figure 19. Ent i ty attr ibute is E-R cons t ruc t Relat ionship represents E - R - E cons t ruc t Figure 19: Construct Mismatch Shown as Graph Contraction The examples i l l u s t r a t e that the graph notation i s able to describe the construct mismatch c o n f l i c t , i n addition to the mi s s i n g o b j e c t c o n f l i c t and the context mismatch c o n f l i c t , based on only two c r i t e r i a : missing nodes and missing edges. A m i s s i n g o b j e c t t r a n s l a t e s i n t o a missing node, context mismatch t r a n s l a t e s i n t o missing edges (plus p o t e n t i a l l y 186 missing nodes), and construct mismatch translates into missing edges and graph contraction. Since the notation can describe the same c o n f l i c t phenomena as the E-R model u s i n g fewer mechanisms, i t i s a more powerful description t o o l . The AVIS view integration program developed as part of t h i s r e s e a r c h employs the graph approach. In AVIS, views are d e s c r i b e d i n the form of nodes and edges. Nodes represent objects, and edges, r o l e s . Each object (node) i s defined by the same set of p r o p e r t i e s : type ( i . e . , a t t r i b u t e ) , view, object i d e n t i f i e r , object name, and object meaning (plus one more property not relevant for t h i s explanation) . Each r o l e (edge) c o n t a i n s the i d e n t i f i e r s of the two o b j e c t s i t i s connecting. Both are explained i n more d e t a i l i n the subsequent chapter. Even though the graph notation i s more powerful as a description t o o l than the E-R model, integration cases have been discussed w i t h i n t h i s r e s e a rch using E-R concepts. The E-R model i s widely used as a conceptual modelling language i n database design, while the above graph notation i s not. Thus, c o n f l i c t cases and solutions described by means of the E-R model are more e a s i l y understood and thus presumably more useful to the database designer than ones based on a graph notation. The differences between the i n t e r n a l graph representation i n AVIS and the external E-R representation require that AVIS frequently 187 t r a n s l a t e between these two representation forms. Nevertheless the i n t e r n a l representation i n the form of graphs i s very useful because i t allows the system to e a s i l y compare objects of d i f f e r e n t t y p e s a l o n g t h e i r r e l e v a n t dimensions. For instance,the question \"do object OI and object 02 have i d e n t i c a l meaning?\" can be e a s i l y phrased i n the graph notation, shown i n F i g u r e 20 i n i t s P r o l o g e q u i v a l e n t . T h i s simple example i l l u s t r a t e s that the integration method can compare objects of any type i n the same manner. I.e., T l may be \" a t t r i b u t e \" , while T2 i s \" e n t i t y \" 1 . identical_meaning(01,02) :-obj ect(Tl,VI,01,Nl,M), obj ect(T2,V2,02,N2,M). Figure 20: I d e n t i c a l Meaning Query i n Prolog Graph Notation An a d d i t i o n a l strength of the method discussed i n t h i s research i s the use of meaningful data objects. The E-R model allows the d e s c r i p t i o n of objects that are meaningful to database users. The integration method further allows the representation of some data semantics. 1 However, the example i n the figure shows an over s i m p l i f i c a t i o n of the meaning comparison problem. AVIS does not use Prolog's pattern matching mechanism i n t h i s simple form to assess meaning i d e n t i t y . Meaning comparison i s described i n more d e t a i l i n the subsequent implementation chapter. 188 Unlike other semantic integration methods, t h i s one includes an algorithm for the i d e n t i t y and for the relatedness t e s t s , which e x p l i c i t l y s p e c i f i e s the steps of the procedure. For example, the i d e n t i t y t e s t without h e u r i s t i c s contains a four-step procedure i n which i d e n t i t y or difference of the four r e l e v a n t o b j e c t c r i t e r i a i s assessed. Due to the form i n which meaning i d e n t i t y and relatedness questions are stated, namely as a 1:N comparison (\"Is object 01 i d e n t i c a l to one of {02}?\"), the computational e f f o r t grows l i n e a r l y with the number of objects. The procedure terminates when the i n i t i a l l y d i f f e r e n t views have become i d e n t i c a l . To be i d e n t i c a l , both views have to contain the same objects. Objects are i d e n t i c a l i f they are i d e n t i c a l i n a l l four relevant dimensions (meaning, context, construct, and name). To judge the value of the method, the questions of correctness and completeness of the r e s u l t i n g views have to be addressed. (The working prototype only demonstrates the workability of the method for s p e c i f i c cases.) Based on the e a r l i e r description of the integration algorithm, i t i s known that the procedure always terminates i f the i n i t i a l views contain a f i n i t e number of o b j e c t s . The procedure performs the i n t e g r a t i o n task through an adjustment of both i n i t i a l l y d i f f e r e n t views. When the procedure terminates, f o r each object i n one view, an i d e n t i c a l o b j e c t e x i s t s i n the other view. Hence, the completeness question depends on whether objects can be \" l o s t \" 189 during integration so that the f i n a l views do not contain a l l o b j e c t s from the i n i t i a l views. The correctness question concerns whether objects from the i n i t i a l views may be mis-represented i n the f i n a l view. Furthermore, i t has to address whether the order i n which views are integrated and/or the sequence i n which objects within a view are considered have any impact on the outcome of the integration process. In t h i s i n t e g r a t i o n method, objects cannot be l o s t . Every object represented i n at lea s t one i n i t i a l view w i l l also be represented i n the global schema. This does not imply that each o b j e c t w i l l appear i n i t s o r i g i n a l form. The object meaning w i l l be preserved, but the object representation i n name, construct and context may change. A r e l a t i o n s h i p may be relocated, a name may be changed, or an object's construct may be changed. After a construct change, an object w i l l i n most cases be represented through more than one new object, i . e . , a r e l a t i o n s h i p w i l l become a r e l a t i o n s h i p - e n t i t y - r e l a t i o n s h i p group. The only exception i s the change of a r e l a t i o n s h i p a t t r i b u t e into an entity, where the construct change replaces one o l d o b j e c t by one new object. Due to the d i r e c t i o n of change i n cases of construct mismatch, an old object i s always replaced by at lea s t one new object. Hence, objects cannot be l o s t during the integration process. 190 Although objects cannot be l o s t , the r e s u l t i n g view may s t i l l be i n c o r r e c t , i f objects are mis-represented or objects are added a r b i t r a r i l y . An object i s mis-represented i f the knowledge represented i n i t s post-integration form contradicts with the knowledge representation i n the pre-integration form. This includes name changes that r e s u l t i n names which do not convey the meaning of the object, construct changes which compress the i n f o r m a t i o n content of an object, meaning changes which r e s u l t i n incorrect meaning descriptions, and context changes which connect objects to objects they should not be connected t o . The i n t e g r a t i o n method performs none of these i n v a l i d operations, nor does i t add objects a r b i t r a r i l y . Objects are only added i f t h i s addition i s suggested by one of the views, that i s i f at lea s t one of the views contains an object that i s not part of other views. Name changes occur only when synonyms or homonyms are detected. The choice of sui t a b l e names to overcome these c o n f l i c t s i s a task f o r the designer who uses the method. Construct changes never r e s u l t i n the l o s s of in f o r m a t i o n , s i n c e the construct chosen i s always the one which i s able to convey the most information. Meaning changes are never made by the system (database designer) . Meaning i s s p e c i f i e d by the users of the system and can only be changed by the users of the system. Context changes occur f o r three reasons. F i r s t , construct changes cause context changes, as d e p i c t e d i n Figu r e 10 i n the c o n f l i c t therapy 191 section. Second, an association of an e n t i t y to a rel a t i o n s h i p r e s u l t s i n a context change ( e x e m p l i f i e d i n F i g u r e 12). T h i r d , r e l a t i o n s h i p r e l o c a t i o n r e s u l t s i n context change (shown i n F i g u r e 13) . A l l of t h e s e changes make the representation of data object i n one view compatible with that of another view. In the f i r s t two of these cases, an object 01 w i l l only be connected to an object 02, i f at l e a s t one view states that the two objects should be connected. I f a l l views are correct p r i o r to integration, t h i s operation cannot r e s u l t i n i n c o r r e c t context. Relationship r e l o c a t i o n takes p l a c e only i f during the i n t e g r a t i o n process, the database designer i d e n t i f i e s that the re l a t i o n s h i p i s applicable to the superset object rather than to the subset object (Figure 13). F i n a l l y , we must consider whether the same outcome, that i s , the same global structure, w i l l be achieved independent of the sequence i n which views are i n t e g r a t e d . In a two-view i n t e g r a t i o n problem, sequence refe r s to the order i n which objects compared. For example, i s 01 from VI compared to a l l o b j e c t s from V2 f i r s t , f ollowed by 07 from VI, or does 07 precede 01? In a multi-view integration problem, sequence also addresses the order i n which views are compared. I.e., i f three views, VI, V2, and V3 have to be integrated, w i l l VI be integrated f i r s t with V2 and the r e s u l t of t h i s integration be integrated with V3, or w i l l the integration begin with V2 and V3? 192 In both the two-view and the multi-view integration problems, the f o l l o w i n g operations are performed: objects e x i s t i n g i n a l l views become part of the global schema, objects e x i s t i n g i n at l e a s t one view become part of the global schema, objects represented d i f f e r e n t l y i n d i f f e r e n t views are adjusted and become p a r t of the g l o b a l schema. In addition, inter-view r e l a t i o n s h i p s are added to the global schema. Objects that e x i s t i n a l l views w i l l not be affected by the sequence of the integration process. They w i l l appear i n the same form i n the global schema. Objects that o r i g i n a l l y did not e x i s t i n a l l views w i l l also be added to the global schema, independent of the i n t e g r a t i o n sequence. Inter-view set rel a t i o n s h i p s are s i m i l a r l y missing objects, however missing i n a l l views. They a l s o w i l l be added, independent of sequence. In fact, they are added a f t e r a l l tests for i d e n t i t y of objects are completed. The c r i t i c a l element for t h i s assessment of the view integration procedure i s the adjustment of views when c o n f l i c t s are detected. In the two-view s i t u a t i o n , the sequence i n which objects are compared may vary. Does t h i s change a f f e c t the outcome of the i n t e g r a t i o n ? T h i s question t r a n s l a t e s i n t o two more basic q u e s t i o n s , namely f i r s t , does the sequence i n which objects are compared r e s u l t i n differences i n the diagnosis of c o n f l i c t s , and second, does a p o t e n t i a l l y d i f f e r e n t diagnosis r e s u l t i n a d i f f e r e n t global schema? 193 The c o n f l i c t diagnosis procedure uses as i t s most important c r i t e r i o n the meaning dimension. Once objects with i d e n t i c a l meaning are found, c o n f l i c t s are detected based on differences i n the remaining dimensions, name, construct, and context. For each object i n each of the views, at most one object with i d e n t i c a l meaning can e x i s t i n the other view. This i s true, independent of the sequence i n which o b j e c t are compared. Furthermore, with the exception of name changes f o r homonyms, the remaining dimensions of an object are not changed before meaning i d e n t i t y with another object has been established. T h e refore, f o r any two o b j e c t s from d i f f e r e n t views, the object comparison w i l l y i e l d the same r e s u l t , independent of the sequence of comparisons, u n l e s s the database designer u s i n g the method i s i n c o n s i s t e n t i n renaming objects when homonyms are found. One other p o t e n t i a l source of error e x i s t s , but i t i s also i n the domain of the database designer. The designer may f i n d i t d i f f i c u l t i n ce r t a i n s i t u a t i o n s to decide whether two objects are i d e n t i c a l i n meaning. Therefore, i f both objects 01 and 02 from view VI appear to the designer as i f they could match the meaning of object 03 from V2, then the order of comparison may bias the designer to decide for 01 i n one s i t u a t i o n and for 02 i n some other s i t u a t i o n . This i s a p a r t i c u l a r problem i n cases of construct mismatch, where, for instance, an ent i t y 194 a t t r i b u t e i n one view corresponds to an e n t i t y - r e l a t i o n s h i p construct i n the other view (see Figure 11). In t h i s example, the database designer has to decide whether the a t t r i b u t e S u p p l i e r c o r r e s p o n d s t o the e n t i t y S u p p l i e r or to the r e l a t i o n s h i p Supply. But even though the designer may have some d i s c r e t i o n i n deciding which of the objects i s the matching one (entity or r e l a t i o n s h i p ) , the c o n f l i c t w i l l be resolved i n exactly the same way. The a t t r i b u t e w i l l be replaced by an E-R construct. The same i s true for other forms of construct mismatch. In summary, as long as the designer i s c o n s i s t e n t i n his assessment of meaning i d e n t i t y of objects, the diagnosis w i l l always be the same, independent of sequence. I f the designer i s i n c o n s i s t e n t i n h i s assessment of meaning i d e n t i t y , the procedure w i l l s t i l l produce i d e n t i c a l outcomes for cases of construct mismatch. In the multi-view s i t u a t i o n , invariance of the outcome (global schema) to changes i n the order of view comparisons i s the concern. Can objects end up i n the global schema with d i f f e r e n t names, d i f f e r e n t constructs, or d i f f e r e n t contexts, based on the order i n which views are processed. Again, t h i s i s not the case. The integration method prevents those v a r i a t i o n s f o r a l l but naming d e c i s i o n s which are i n the designer's domain. For construct changes, there i s only one d i r e c t i o n of 195 change, to avoid loss of information. For example, i f out of n views, n-1 represent an object as a r e l a t i o n s h i p a t t r i b u t e and only one view represents i t as an ent i t y , the object w i l l s t i l l become an e n t i t y i n the global structure. In a l l cases, the most information r i c h object representation w i l l be the one chosen f o r the g l o b a l s t r u c t u r e . Context changes are dealt with i n a s i m i l a r manner. For example, i f a r e l a t i o n s h i p R i n view VI has as i t s context the set of e n t i t i e s {El}, i n view V2 the set {El, E3}, and i n view V3 the context {El, E2}, the g l o b a l schema w i l l show {El, E2, E3} as R's context, independent of the sequence i n which the views were integrated. The same i s true for a t t r i b u t e s . E n t i t i e s and relationships i n the global view have a t t r i b u t e sets which are the union of the a t t r i b u t e s e t s of the corresponding o b j e c t s from the o r i g i n a l views (except, of course, when an a t t r i b u t e i s converted to another construct). In conclusion, even i n a multi-view s i t u a t i o n , the method w i l l produce the same global schema, independent of sequence, i f the designer i s consistent i n h i s decisions on meaning i d e n t i t y . 196 5. IMPLEMENTATION - THE AVIS PROGRAM 5.1. Overview An implementation of the view integration method i s available i n form of the AVIS (Automatic View Integration System) program. AVIS i s written i n Prolog. The purpose of the program i s not to show correctness of the c o n f l i c t r e s o lution method. Correctness of the method should be judged based on i t s u n d e r l y i n g assumptions, the r u l e s guiding view integration, and the conclusion drawn from them concerning the diagnosis and therapy procedure. The program can only serve as a testbed to show mistakes or omissions i n d e t a i l s of the resolution procedure. Furthermore, i t can show the f e a s i b i l i t y of an automated view integration procedure. Appendix 3 contains the screen displays of a view integration s e s s i o n with AVIS to i l l u s t r a t e the operation of the system and i t s r o l e as a testbed. 5.2. Function and Structure of the AVIS Program To f u l f i l l i t s purpose as a t e s t b e d and an i n d i c a t o r for f e a s i b i l i t y , the program i s an implementation of the diagnosis and therapy procedure o u t l i n e d i n e a r l i e r s e c t i o n s . The 197 program always operates on a set of two views which are to be integrated. Such a set of two views has to be loaded into the system at the outset of the integration session. The program proceeds by checking c o n f l i c t hypotheses. For each hypothesis that i s checked, one e l i g i b l e object from view 1 i s chosen and compared to a l l e l i g i b l e o b j e c t s from view 2. Hypothesis t e s t s are c a r r i e d out i n the sequence e s t a b l i s h e d by the integration rules and h e u r i s t i c s . Depending on the outcome of a t e s t , an appropriate therapy a c t i v i t y i s performed, followed by another t e s t . A therapy can be \"do nothing\" i f objects do not have to be changed, or any of the other therapy actions discussed previously. The program terminates when both views have become i d e n t i c a l . The program structure which achieves t h i s function i s depicted i n Figure 21. Following the t y p i c a l architecture of knowledge-based systems, the program i s designed i n highly decoupled form. For instance, the sequence i n which hypotheses are t e s t e d i s not f i x e d (programmed) , but determined by the sequence i n which they occur on the OBJECT COMPARISON AGENDA (box 8 i n the f i g u r e ) . T h e r efore, an \"urgent\" hypothesis t e s t ( t y p i c a l l y performed during a therapy operation consisting of more than one therapy action) can pre-empt tests that would normally have occurred next. Another form of decoupling separates the step which recognizes that an object i s missing (box 4) , from the step that a c t u a l l y adds the object to the view (box 5). 198 H VO H -Q C fl> % H CO O vQ 0) 3 w r+ f-j c o rt C O b j e c t C o m p a r i s o n A g e n d a i te m Integrate to continue Figure 27: AVIS Meaning Identity Indicators 212 A more advanced form of meaning indicators, i s based on the meaning r e p r e s e n t a t i o n (meaning c a t e g o r i z a t i o n ) f e a t u r e . While c u r r e n t l y meaning l i s t s f o r o b j e c t s have no form r e s t r i c t i o n s , t h e r e f o r e a l l o w i n g the use of any symbol to define the meaning of an object; future meaning l i s t s w i l l be more r e s t r i c t e d i n the choice of terms. Terms w i l l have to be elements of a categorization hierarchy and w i l l be therefore unambiguous. 213 6. SUMMARY AND EXTENSIONS The main contribution of t h i s research i s the development of a complete view integration procedure. The research went beyond the problem of inter-view constraint representation (relatedness of objects) . I t systematically categorized inter-view c o n f l i c t s i n t o c o n f l i c t types, based on an analysis of the sources of c o n f l i c t s . The source of a l l c o n f l i c t s i s mismatches between the meaning dimension on one hand and a l l other r e l e v a n t object dimensions, name, construct, and context, on the other hand. Whenever two objects are i d e n t i c a l i n meaning, they also have to be i d e n t i c a l i n t h e i r other dimensions. I f not, a c o n f l i c t a r i s e s . S i m i l a r l y , i f two objects have d i f f e r e n t meanings they also have to d i f f e r i n the name dimension to be c o n f l i c t - f r e e . The method presented i n t h i s research can diagnose a l l p o s s i b l e combinations of mismatches and has therapy rules for a l l of them. In addition to rules for recognition and resolution of c o n f l i c t s , an algorithmic view integration procedure was described. I t s p e c i f i e s the sequence of tests f o r object i d e n t i t y and object relatedness. At the termination of t h i s procedure, two i n i t i a l l y d i f f e r e n t views become i d e n t i c a l and represent a l l relevant inter-view constraints. Thus, eith e r of the views has become a g l o b a l schema c o n t a i n i n g the two o r i g i n a l views. The i n t e g r a t i o n procedure developed here begins with a t e s t for object i d e n t i t y . At the end of t h i s step, both views contain 214 the same objects. The subsequent t e s t for relatedness determines a l l inter-view constraints for a l l o r i g i n a l l y unique objects ( e x i s t i n g i n only one view) . The t e s t for relatedness may r e s u l t i n the addition of e n t i t i e s to represent superset and subset objects and i n the addition of Isa re l a t i o n s h i p s . Furthermore, the research provided h e u r i s t i c s to simp l i f y the integration problem for the user. H e u r i s t i c s were developed to ease the user's task of i d e n t i f y i n g o b j e c t p a i r s with i d e n t i c a l meaning. Assumptions such as \" ( i n absence of i n f o r m a t i o n to the contrary,) two o b j e c t s with i d e n t i c a l meaning w i l l have i d e n t i c a l constructs\", reduce the number of objects among which the user has to look for a matching object. In case of information to the contrary, i . e . , i f no p a i r of o b j e c t s with same meaning were found, the h e u r i s t i c would f a i l and would require a more painstaking search f o r a match. The r e s e a r c h exemplified how the introduction of h e u r i s t i c s a l t e r s the integration procedure. The method was designed for use as a view integration t o o l , through implementation as a knowledge based system ( i . e . , the AVIS system). Implementation i n the form of a computer program assures adherence to the sequence of c o n f l i c t analysis and r e s o l u t i o n s t e p s . I t a l s o eases as much as p o s s i b l e the designer's task. Nevertheless, the c o n f l i c t recognition and reso l u t i o n rules which form the core of the research are v a l i d 215 independent of any implementation. The rules have been developed based on rules of modelling, based on the E-R model and based on database design p r i n c i p l e s , rather than through t r a c i n g of database design expert behavior. Future extensions to the research w i l l focus on at l e a s t two areas. F i r s t , more h e u r i s t i c s w i l l be developed. This w i l l not only s i m p l i f y the user's task further, i t w i l l also shed more l i g h t on the question of how we can assess when two objects are i d e n t i c a l i n meaning. The assessment of meaning i d e n t i t y i s the most d i f f i c u l t part of the integration process. C u r r e n t l y , the i n t e g r a t i o n method does not decide on the i d e n t i t y of two objects without user consultation. I t would be d e s i r a b l e to have the method decide, at l e a s t i n some cases, whether two objects have the same meaning. One possible a p p r o a c h t o extend the method i n t h i s d i r e c t i o n i s the development of c a t e g o r i z a t i o n h i e r a r c h i e s f o r p a r t i c u l a r a p p l i c a t i o n a r e a s . In t h i s r e s e a r c h , a v e r y c o a r s e c a t e g o r i z a t i o n h i e r a r c h y has been introduced, one which f a c i l i t a t e s deciding whether two objects have d i f f e r e n t meanings. More elaborate, as well as more domain s p e c i f i c hierarchies would allow a sharper d i s t i n c t i o n between concepts and thus allow for better judgment on i d e n t i t y or difference i n meaning. T h i s measure would r e q u i r e t h a t users be very precise and e x p l i c i t i n t h e i r choice of names for e n t i t i e s , r e lationships, and a t t r i b u t e s i n the pre-integration stage. Hence, use of a 216 categorization hierarchy may be one good source of evidence, but may not be s u f f i c i e n t . Ultimately a procedure w i l l have to use more sources of evidence and w i l l have to be tolerant of user s p e c i f i c a t i o n errors, i n order to make judgments on meaning i d e n t i t y that are as good as human judgments. A second area of extension to focus on i s the detection of errors i n user views. The i n t e g r a t i o n method i n i t s current form assumes that views are complete ( a l l relevant objects included), consistent (no c o n f l i c t i n g knowledge), and minimal (each object only represented once) 1. I f views are incorrect, inconsistent or not minimal, the global schema w i l l be incorrect, inconsistent or not minimal. For example, i f one view stated (incorrectly) that \" a l l EMPLOYEES are FULLTIME_EMPLOYEEs\", while another view stated (correctly) that \"every FULLTIME_EMPLOYEE i s an EMPLOYEE\", the method would represent both constraints i n the global schema (inconsistency) , not recognizing that the only l o g i c a l l y correct i n t e r p r e t a t i o n of these two statements would require EMPLOYEE and FULLTIME_EMPLOYEE to be i d e n t i c a l . Mistakes l i k e t h i s one could be detected and corrected during the integration process. To permit recognition 1 The constraints on input views may seem rather stringent. However, we can expect views to be i n consistent and minimal form, i f they have been created with a view creation system such as Storey's (1988). Completeness has to be assumed, unless evidence to the contrary e x i s t s . A l l p reviously discussed integration approaches make s i m i l a r demands on the inputs to t h e i r integration methods. 217 of such errors, a set of error scenarios and correction rules would have to be developed. Another possible extension that goes s u b s t a n t i a l l y beyond the scope of t h i s research i s the t r a n s l a t i o n of the findings for database i n t e g r a t i o n to knowledge base integration. While databases c o n t a i n f a c t s , knowledge bases contain facts and r u l e s and are t h e r e f o r e much more d i f f i c u l t to integrate. Nevertheless, with the increase i n the development of knowledge based systems and corresponding e f f o r t s to improve the knowledge a c q u i s i t i o n e f f o r t such a project may become a f r u i t f u l endeavour for the future. 218 7. REFERENCES Al-Fedaghi, S. and P. Scheuermann. Mapping Considerations i n the Design of Schemas for the Relational Model. IEEE Trans. Software Engineering, SE-7, No. 1, 1981. Armstrong, W.W. Dependency Structures of Database Relation-s h i p s . Proc. 1974 IFIP Congress, Amsterdam: North Holland, pp. 580-583. A t z i n i , P., C. B a t i n i , M. L e n z e r i n i , and F. V i l l a n e l l i . INCOD: System for Conceptual Design of Data and Transactions i n the Entity-Relationship Model. Proceedings of the Second Int'l Conference on the Entity-Relationship Approach, Washington, D.C., October 1981, pp. 379-414. Bachman, Charles W. and Manilal Daya. The Role Concept i n Data Models. VLDB 77, pp. 464-476. B a r r A. and E. Feigenbaum. The Handbook of Artifical Intelligence. London: Pitman, 1981. Ba t i n i , C , M. Lenzerini, S.B. Navathe. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys, Vol. 18, No. 4, 1986, pp. 323-364. B a t i n i , C. , V. De A n t o n e l l i s , A. Di Leva. Database Design A c t i v i t i e s within the DATAID Project. Quarterly Bulletin of the IEEE Computer Society Technical Committee on Database Engineering, Vol. 7, No. 4, 1984, pp. 16-21. (1984a) B a t i n i , C., B. Demo, A. Di Leva. A Methodology f o r Conceptual Design of O f f i c e Databases. Information Systems, Vol. 9, No. 4, 1984. (1984b) B a t i n i , C , M. Talamo, and R. Tamassia. Computer Aided Layout of E n t i t y R e l a t i o n s h i p Diagrams. Journal of Software and Systems, 1984. (1984c) B a t i n i , C. , M. L e n z e r i n i . A Methodology f o r Data Schema Integration i n the Ent i t y Relationship Model. IEEE Transactions on Software Engineering, Vol. 10, No. 6, 1984, pp. 650-663. B a t i n i , C , M. Lenzerini, M. Moscarini. Views Integration. In: Methodology and Tools for Data Base Design by S. Ceri (ed.). Amsterdam: North-Holland, 1983. Ba t i n i , C. and M. Lenzerini. A Conceptual Foundation for View Integration. Proceedings of IFIP Working Conference, Budapest, Hungary, May 1983. 219 B a t i n i . C. , M. L e n z e r i n i , and G. Santucci. Computer-Aided Methodology for Conceptual Database Design. Information Systems, Volume 7, No. 3, 1982, pp. 265-280. Beeri, C. and P.A. Bernstein. Computational Problems Related to the Design of Third Normal Form Schemas. ACM TODS, Vol. 4, No. 1, 1979, pp. 30-59. B e r n s t e i n , P. Synthesizing Third Normal Form Relations from Functional Dependencies. ACM Transactions on Database Systems, Volume 1, No. 4, December 1976, pp. 277-298. Bernstein, P h i l i p A., J.R. Swenson, and D.C. T s i c h r i t z i s . A U n i f i e d Approach t o F u n c t i o n a l Dependencies and Relations. Proc. ACM 1975 SIGMOD Conf. , San Jose, C a l i f o r n i a , pp. 237-245. Biskup, Joachim and Bernhard Convent. A Formal View Integration Method. Int'l ACM SIGMOD Conf. 1986, pp. 398-407. Biskup, Joachim and Bernhard Convent. A Formal View Integration Method. Forschungsbericht 208, U n i v e r s i t a t Dortmund, 1985. Brodie, Michael. On the Development of Data Models. In On Conceptual Modelling by Michael Brodie, John Mylopoulos, Joachim Schmidt (eds.). New York: Springer, 1984. Brown, Robert. Logical Database Design Techniques. Mountain View, CA: The Database Design Group, 1982. Casanova, Marco. Theory of Data Dependencies over Relational E x p r e s s i o n s . Proc. ACM SIGACT/SIGMOD Symp. on DB Systems, 1982, pp. 189-198. Casanova, Marco and Ronald Fagin. Inclusion Dependencies and t h e i r I n t e r a c t i o n with F u n c t i o n a l Dependencies. Proc. ACM SIGACT/SIGMOD Symp. on DB Systems, 1982, pp. 171-176. Casanova, M. and V. V i d a l . A Sound Approach to View Integration. Proceedings of the ACM Conference on Principles of Database Systems, March 1983, pp. 36-47. C e r i , S. and G. G o t t l o b . N o r m a l i z a t i o n of R e l a t i o n s and Prolog. Communications of the ACM, Vol. 29, No. 1, 1986, pp. 524-544. Chen, Peter. The Entity-Relationship Model: Towards a Unified View of Data. ACM TODS, Volume 1, No. 1, 1976, pp. 9-36. C u r t i c e , Robert M. and Paul E. Jones, J r . Logical Database Design. New York: an Nostrand Reinhold Co., 1982. 220 Date, Chris. An Introduction to Database Systems. Reading: Addison-Wesley, 1981. DeMarco, Tom. Structured Analysis and Systems S p e c i f i c a t i o n . Englewood C l i f f s : Prentice-Hall, 1979. Dyba, E. P r i n c i p l e s of Data Element I d e n t i f i c a t i o n . AuerJbach Data Base Management Services, P o r t f o l i o No. 23-01-03, 1977. Ein-Dor, P h i l l i p . Commonsense Business Knowledge Representation A Research Proposal. Working Paper, T e l - A v i v University, February, 1987. E l m a s r i , R. , J . Larson and S. Navathe. Schema Integration Algorithms for Federated Databases and Logical Database Design. Technical Report, Honeywell Corporate Research Center, 1987. Elmasri, Ramez and Sham Navathe. Object Integration i n Logical Database Design. IEEE International Conference on Data Engineering, Los Angeles, 1984, pp.42 6-433. E l m a s r i , Ramez, A. Hevner, and J . Weeldreyer. The Category Concept; An Extension to the Entity-Relationship Model. Data and Knowledge Engineering, Volume 1, No. 1, June 1985, pp. 75-116. Elmasri, Ramez, James A. Larson, Sham Navathe, and T. Sashidar. To o l s f o r View Integration. Quarterly Bulletin of the IEEE Computer Society Technical Committee on Database Engineering, Vol. 7, No. 4. 1984. Elmasri, Ramez and G. Wiederhold. Properties of Relationships and t h e i r Representations. Proceedings of the National Computer Conference, AFIPS, Volume 49, 1980, pp. 319-326. Fagin, R. The Decomposition versus the Synthetic Approach to R e l a t i o n a l Database Design. Proceedings of the 3rd VLDB, 1977, pp. 441-446. Goldstein, Robert C. and Veda Storey. Unravelling Is-A Networks i n Database Design. Working Paper, U n i v e r s i t y of B r i t i s h Columbia, November, 1988. Hayes, P a t r i c k . The Naive Physics Manifesto. In Expert Systems in the Micro Electronic Age by Donald Michie (ed.). Edinburgh: Edinburgh University Press, 1979, pp. 242-270. Hayes-Roth, Frederick, Donald Waterman, Douglas Lenat. Building Expert Systems. Reading: Addison-Wesley, 1983. Housel, Barron C , Vance E. Waddle, and S. Bing Yao. The 221 Functional Model for Logical Database Design. Proceedings of the 5th VLDB, 1979, pp. 194-203. Hubbard, G. and N. Raver. Automating Logical F i l e Design. Proceedings 1st VLDB, 1975, pp.227-253. Leibniz, G o t t f r i e d Wilhelm. Philosophical Letters and Papers, V o l . 1 ( e n g l i s h t r a n s l a t i o n ) . Chicago: The University of Chicago Press, 1956. Mannino, M. and W. Effelsberg. Matching Techniques i n Global Schema Design. Proceedings IEEE COMPDEC, Los Angeles, 1984, pp. 418-425. Martin, James. Managing the Database Environment. Englewood C l i f f s : Prentice H a l l , 1983. McFadden, Fred and J e f f r e y H o f f e r . Data Base Management. Menlo Park: Benjamin Cummings, 1988. Minsky, Marvin. A Framework for Representing Knowledge. In The Psychology of Computer Vision by P. Winston (Ed.). New York: McGraw-Hill, 1975. Mylopoulos, J . and H. Levesque. An Overview of Knowledge Representation. In On Conceptual Modelling by Brodie, Mylopoulos and Schmidt (Eds.). New York: Springer, 1984, pp. 3-17. Navathe, Shamkant, Ramez Elmasri, James Larson. Integrating User Views i n Database Design. IEEE Computer, 1986, pp. 50-62. Navathe S, S. Gadgil. A Methodology for View Integration i n Logical Database Design, i n Proc. ACM SIGMOD, Austin, 1978. Navathe, Shamkant, and Mario Schkolnick. View Representation i n L o g i c a l Database Design. Proceedings Int'l ACM SIGMOD Conference, 1978, pp. 144-156. New Orleans Database Design Workshop Report (Summary), VLDB, Rio(1979). Nilsson, N i l s . Principles of Artificial Intelligence. Palo Al t o : Tioga Press, 1980. Raver, N. and G.U. Hubbard. Automated Logical Database Design Methodology and Techniques. IBM Systems Journal, Vol. 16, No. 3, 1977. Robinson, J . A Machine-oriented Logic Based on the Resolution P r i n c i p l e . JACM, Volume 12, No. 1, 1965, pp. 23-41. 222 Russell, Bertrand. A History of Western Philosophy. London: George A l l e n & Unwin, 1946. Schank, Roger and Charles Rieger. Inference and the Computer Understanding of Natural Language. Artificial Intelligence, Volume 5, No. 4, 1974, pp. 373-412. Sheppard, D. P r i n c i p l e s of Data Structure Design. AuerJbach Data Base Management Series, P o r t f o l i o No. 23-01-04, 1977. Shipman,D. The Functional Data Model and Data Language DAPLEX. ACM TODS, Vol. 6, No. 1, March 1980, pp. 140-173. Simon, Herbert and A. Ando. Aggregation of V a r i a b l e s i n Dynamic Systems. In Essays on the Structure of Social Science Models by Ando, F i s h e r , and Simon. Cambridge: MIT Press, 1963. Storey, Veda. View Creation: An Expert System for Database Design. Washington: ICIT Press, 1988. Teory, T . J . and J.P. Fry. Design of Database Structures. Englewood C l i f f s : Prentice H a l l , 1982. Ullman, J e f f r e y . Principles of Database Systems. Stanford: Computer Science Press, 1980. Vessey, I r i s and Ron Weber. Structure Tools and Conditional Logic: An Empirical Investigation. Communications of the ACM, Vol. 29, No. 1, January 1986, pp. 48-57. Vetter, M. Database Design by Implied Data Synthesis. VLDB 77, pp. 428-440. Waterman, Donald A. A Guide to Expert Systems. Reading: Addison-Wesley, 1986. Weber, Ron. Data Models Research i n Accounting: An Evaluation of Wholesale D i s t r i b u t i o n Software. The Accounting Review, Vol. 61, No. 3, July 1986, pp. 498-518. Yao, S. Bing, Vance E. Waddle, Barron C. Housel. View Modeling and I n t e g r a t i o n U s i n g t h e F u n c t i o n a l Data Model, IEEE Transactions on Software Engineering, Volume SE-8, November 1982, pp. 544-553. Yao, S. Bing, Vance E. Waddle, Barron C. Housel. An Interactive System for Database Design and Integration. In Principles of Database Design, Vol. 1, S. Bing Yao (ed.), Englewood C l i f f s , N.J.: Prentice H a l l , 1985. 223 APPENDIX Appendix 1: C o n f l i c t Cases 1. IDENTICAL OBJECTS Nl = N2; T l = T2; Ml = M2; CI = C2; Solution: do nothing. 1.1. E n t i t y i s En t i t y . 1.2. Relationship i s Relationship. 1.3. Att r i b u t e i s At t r i b u t e . 2. IDENTICAL OBJECTS WITH DIFFERENT CONTEXT Nl = N2; T l = T2; Ml = M2; CI <> C2; 2.1. R e l a t i o n s h i p i s R e l a t i o n s h i p of d i f f e r e n t degree or associating d i f f e r e n t e n t i t i e s . S o l u t i o n : t i e not yet a s s o c i a t e d e n t i t i e s to r e l a t i o n s h i p ( s ) . I f e n t i t i e s cannot be found, t e s t f o r construct mismatch (5.2.1. or 6.2.1) and missing e n t i t y (17.1.). 2.2. Att r i b u t e i s Att r i b u t e of a d i f f e r e n t e n t i t y o r r e l a t i o n s h i p (both a re p o s s e s s i o n a t t r i b u t e s ) . S o l u t i o n : c o n v e r t b o t h a t t r i b u t e s i n t o E-R constructs or e n t i t i e s , s i m i l a r to 6.2. or 6.3. 3. TRUE SYNONYMS (SAME OBJECT TYPE) Nl <> N2; T l = T2; Ml = M2; CI = C2; Solution: rename at le a s t one object so that Nl = N2. 3.1. E n t i t y / E n t i t y . 3.2. Relationship/Relationship. 3.3. Att r i b u t e / A t t r i b u t e . 4. TRUE SYNONYMS WITH DIFFERENT CONTEXT Nl <> N2; T l = T2; Ml = M2; CI <> C2; Solution: rename and make contexts i d e n t i c a l (combine solutions 3. and 2.). 4.1. Relationship/Relationship. 4.2. At t r i b u t e / A t t r i b u t e . 224 5. CONSTRUCT MISMATCH Nl = N2; T l <> T2; Ml = M2; CI <> C2; 5.1. E n t i t y i s Relationship. Solution: convert the re l a t i o n s h i p into an en t i t y . Create new relationships to associate the new en t i t y with the e n t i t i e s i t associated as a re l a t i o n s h i p . 5.2. E n t i t y A t t r i b u t e i s E n t i t y - R e l a t i o n s h i p construct. S o l u t i o n : c o n v e r t the a t t r i b u t e i n t o an E-R construct (entity and r e l a t i o n s h i p ) . 5.2.1. Attribute i s En t i t y . 5.2.2. Attribute i s Relationship. 5.3. Relationship A t t r i b u t e i s E n t i t y . Solution: convert the at t r i b u t e into an en t i t y . 6. CONSTRUCT MISMATCH AND SYNONYM Nl <> 2; T l <> T2; Ml = M2; CI <> C2; S o l u t i o n : rename objects to make names i d e n t i c a l and deal with construct mismatches as i n 5. 6.1. E n t i t y i s Relationship. 6.2. E n t i t y A t t r i b u t e i s E n t i t y - R e l a t i o n s h i p construct. 6.2.1. Attribute i s En t i t y . 6.2.2. Attr i b u t e i s Relationship. 6.3. Relationship A t t r i b u t e i s E n t i t y . 7. DIFFERENT AND UNRELATED OBJECTS Nl <> N2; T l = T2; Ml <> M2; not (related(Ml,M2)) ; CI = C2 or CI <> C2; 7.1. Objects are d i f f e r e n t , unrelated and have no common r o l e . Solution: do nothing. 7.1.1. E n t i t y / E n t i t y . 7.1.2. Relationship/Relationship. 7.1.3. At t r i b u t e / A t t r i b u t e . 7.2. O b j e c t 1 and Object 2 i n same r o l e (W-re l a t i o n s h i p ) . Solution: create a common ro l e object, s p e c i a l role o b j e c t s , and Isa relat i o n s h i p s between the role objects and objects OI and 02. I f objects are not e n t i t i e s , transform them into e n t i t i e s f i r s t . 7.2.1. E n t i t y / E n t i t y . 7.2.3. Relationship/Relationship. 7.2.3. At t r i b u t e / A t t r i b u t e . 225 8. TRUE HOMONYM Nl = N2; T l = T2; Ml <> M2; CI = C2 or CI <> C2; Solution: rename at lea s t one object, giving i t a name t h a t i s not assigned to any other object i n the view. Thereafter t r e a t common r o l e occurrences s i m i l a r to 7. 8.1. Objects are d i f f e r e n t , unrelated and have no common ro l e . 8.1.1. E n t i t y / E n t i t y . 8.1.2. Relationship/Relationship. 8.1.3. At t r i b u t e / A t t r i b u t e . 8.2. O b j e c t 1 and Object 2 i n same r o l e (W-re l a t i o n s h i p ) . 8.2.1. E n t i t y / E n t i t y . 8.2.2. Relationship/Relationship. 8.2.3. Att r i b u t e / A t t r i b u t e . 9. DIFFERENT OBJECTS WITH DIFFERENT CONSTRUCTS Nl <> N2; T l <> T2; Ml <> M2; CI <> C2; 9.1. Objects are d i f f e r e n t , unrelated and have no common r o l e . Solution: do nothing. 9.1.1. Entity/Relationship. 9.1.2. Relationship/Attribute. 9.1.3. En t i t y / A t t r i b u t e . 9.2. O b j e c t 1 and Object 2 i n same r o l e (W-re l a t i o n s h i p ) . Solution: create a common rol e object, s p e c i a l role o b j e c t s , and Isa rela t i o n s h i p s between the role objects and objects 01 and 02. I f objects are not e n t i t i e s , transform them into e n t i t i e s f i r s t . 9.2.1. Entity/Relationship. 9.2.2. Relationship/Attribute. 9.2.3. En t i t y / A t t r i b u t e . 10. DIFFERENT OBJECTS WITH DIFFERENT CONSTRUCTS. BUT HOMONYMS Nl = N2; T l <> T2; Ml <> M2; CI <> C2; Solution: t r e a t objects l i k e true homonyms. Change the name of at leas t one object to make i t d i f f e r e n t from a l l other o b j e c t names i n the same view. Treat common rol e objects as i n 9. 226 10.1. Objects are d i f f e r e n t , unrelated and have no common r o l e . 10.1.1. Entity/Relationship 10.1.2. Relationship/Attribute 10.1.3. E n t i t y / A t t r i b u t e 10.2. O b j e c t 1 and Object 2 i n same r o l e (W-re l a t i o n s h i p ) . 10.2.1. Entity/Relationship. 10.2.2. Relationship/Attribute. 10.2.3. E n t i t y / A t t r i b u t e . 11. DIFFERENT BUT RELATED OBJECTS Nl <> N2; T l = T2; Ml <> M2; related(Ml,M2); CI - C2; 11.1. One o b j e c t c o n t a i n s the other (Object 1 contains Object 2 or vic e versa). Solution: create an Isa r e l a t i o n s h i p between the two objects. 11.1.1. E n t i t y / E n t i t y . 11.1.2. Relationship/Relationship. 11.1.3. Att r i b u t e / A t t r i b u t e . Solution: before creating an Isa r e l a t i o n -ship, convert a t t r i b u t e s into e n t i t i e s ( f o r r e l a t i o n s h i p attributes) or into E-R constructs (for e n t i t y a t t r i b u t e s ) . 11.2. Object 1 and Object 2 have a common superset (but do not overlap). S o l u t i o n : create a superset object and Isa relationships from objects 01 and 02 to the superset object. 11.2.1. E n t i t y / E n t i t y . 11.2.2. Relationship/Relationship. 11.2.3. At t r i b u t e / A t t r i b u t e . S o l u t i o n : precede general s o l u t i o n by t r a n s f o r m a t i o n i n t o e n t i t i e s or E^R constructs. 11.3. Object 1 and Object 2 have a common superset and overlap Solution: combine solutions for 11.2. and 11.3. 11.3.1. E n t i t y / E n t i t y . 11.3.2. Relationship/Relationship. 11.3.3. At t r i b u t e / A t t r i b u t e . 12. DIFFERENT BUT RELATED HOMONYMS Nl = N2; T l = T2; Ml <> M2; related(Ml,M2); CI = C2; Solution: rename and solve s i m i l a r to 11. 12.1. One o b j e c t c o n t a i n s the other (Object 1 contains Object 2 or vic e versa). 12.1.1. E n t i t y / E n t i t y . 12.1.2. Relationship/Relationship. 12.1.3. At t r i b u t e / A t t r i b u t e . 12.2. Object 1 and Object 2 have a common superset (but do not overlap). 12.2.1. E n t i t y / E n t i t y . 12 . 2 . 2 . Relationship/Relationship. 12.2.3. At t r i b u t e / A t t r i b u t e . 12.3. Object 1 and Object 2 have a common superset and overlap. 12.3.1. E n t i t y / E n t i t y . 12.3 .2. Relationship/Relationship. 12.3.3. At t r i b u t e / A t t r i b u t e . 13. DIFFERENT BUT RELATED OBJECTS WITH DIFFERENT CONTEXT Nl <> N2; T l = T2; Ml <> M2; related(Ml,M2); CI <> C2; 13.1. En t i t y A t t r i b u t e related to E n t i t y Attribute of a d i f f e r e n t e n t i t y . S o l u t i o n : transform a t t r i b u t e s i n t o E-R constructs and solve relatedness as i n case 11. 13.1.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or v i c e versa). 13.1.2. Common superset. 13.1.3. Common subset and superset. 13.2. E n t i t y A t t r i b u t e r e l a t e d t o R e l a t i o n s h i p A t t r i b u t e Solution: transform e n t i t y a t t r i b u t e into E-R c o n s t r u c t , r e l a t i o n s h i p a t t r i b u t e i n t o e n t i t y and solve relatedness as i n 11. 13.2.1. Attribute 1 contains At t r i b u t e 2 (or vice versa). 13.2.2. Common superset. 13.2.3. Common subset and superset. 13.3. Relationship Attribute related to Relationship A t t r i b u t e Solution: transform a t t r i b u t e s into e n t i t i e s and solve relatedness as i n 11. 13.3.1. Attribute 1 contains A t t r i b u t e 2 (or vice versa). 13.3.2. Common superset. 13.3.3. Common subset and superset. 13.4. Relationship related to Relationship Solution: transform re l a t i o n s h i p s into entities and solve relatedness as i n 11. 13.4.1. Relationship 1 contains Relationship 2 (or v i c e versa). 13.4.2. Common superset. 228 13.4.3. Common subset and superset. 14. DIFFERENT BUT RELATED HOMONYMS WITH DIFFERENT CONTEXT Nl = N2; T l = T2; Ml <> M2; related(Ml,M2); CI <> C2; Solution: rename to avoid homonym and solve s i m i l a r to 13. 14.1. En t i t y A t t r i b u t e related to E n t i t y Attribute of a d i f f e r e n t e n t i t y . 14.1.1. Attribute 1 contains A t t r i b u t e 2 (or vice versa). 14.1.2. Common superset. 14.1.3. Common subset and superset. 14.2. E n t i t y A t t r i b u t e r e l a t e d t o R e l a t i o n s h i p At t r i b u t e . 14.2.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or vice versa) . <> 14.2.2. Common superset. 14.2.3. Common subset and superset. 14.3. Relationship Attribute r e l a t e d to Relationship A t t r i b u t e . 14.3.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or vice versa). 14.3.2.. Common superset. 14.3.3. Common subset and superset. 14.4. Relationship related to Relationship 14.4.1. Relationship 1 contains Relationship 2 (or v i c e versa). 14.4.2. Common superset. 14.4.3. Common subset and superset. 15. DIFFERENT BUT RELATED OBJECTS OF DIFFERENT TYPE Nl <> N2; T l <> T2; Ml <> M2; related(Ml,M2); CI <> C2; 15.1. En t i t y A t t r i b u t e related to Entity-Relationship construct. Solution: transform e n t i t y a t t r i b u t e into E-R construct and solve relatedness s i m i l a r to 11. 15.1.1. En t i t y A ttribute related to En t i t y . 15.1.1.1. One object contains the other. 15.1.1.2. Common s u p e r s e t . 15.1.1.3. Common subset and superset. 15.1.2. En t i t y A t t r i b u t e related to Relationship. 15.1.2.1. One object contains the other. 15.1.2.2. Common superset. 15.1.2.3. Common subset and superset. 15.2. Relationship A t t r i b u t e related to En t i t y . 15.2.1. One object contains the other. 229 15.2.2. Common superset. 15.2.3. Common subset and superset. 15.3. E n t i t y related to Relationship. 15.3.1. One object contains the other. 15.3.2. . Common superset. 15.3.3. Common subset and superset. 16. DIFFERENT BUT RELATED HOMONYMS OF DIFFERENT TYPE Nl = N2; T l <> T2; Ml <> M2; related(Ml,M2); CI <> C2; S o l u t i o n : rename at l e a s t one o b j e c t to avoid homonym and solve s i m i l a r to 15. 16.1. En t i t y A t t r i b u t e related to Entity-Relationship construct 16.1.1. En t i t y A ttribute r e l a t e d to En t i t y . 16.1.1.1. One object contains the other. 16.1.1.2. Common superset. 16.1.1.3. Common subset and superset. 16.1.2. En t i t y A ttribute related to Relationship. 16.1.2.1. One object contains the other. 16.1.2.2. Common superset. 16.1.2.3. Common subset and superset. 16.2. Relationship A t t r i b u t e related to En t i t y . 16.2.1. One object contains the other. 16.2.2. Common superset. 16.2.3. Common subset and superset. 16.3. E n t i t y related to Relationship. 16.3.1 One object contains the other. 16.3.2. Common superset. 16.3.3. Common subset and superset. 17. MISSING OBJECT Object 2 does not e x i s t . Solution: add missing object. 17.1 17.2 17.3 Ent i t y missing. Relationship missing. At t r i b u t e missing. 230 Appendix 2: C o n f l i c t Solutions 1. IDENTICAL OBJECTS Nl = N2; T l = T2; Ml = M2; CI = C2; Solution: do nothing. 1.1. E n t i t y i s E n t i t y . 1.2. Relationship i s Relationship. 1.3. Att r i b u t e i s At t r i b u t e . 2. IDENTICAL OBJECTS WITH DIFFERENT CONTEXT Nl = N2; T l = T2; Ml = M2; CI <> C2; 2.1. R e l a t i o n s h i p i s R e l a t i o n s h i p of d i f f e r e n t degree or associating d i f f e r e n t e n t i t i e s . Solution: S4, possibly SI or S2 or Sll. 2.2. Att r i b u t e i s At t r i b u t e of a d i f f e r e n t e n t i t y o r r e l a t i o n s h i p (both a r e p o s s e s s i o n a t t r i b u t e s ) . Solution: S2 or S3. 3. TRUE SYNONYMS (SAME OBJECT TYPE) Nl <> N2; T l = T2; Ml = M2; CI = C2; Solution: S10. 3.1. E n t i t y / E n t i t y . 3.2. Relationship/Relationship. 3.3. At t r i b u t e / A t t r i b u t e . 4. TRUE SYNONYMS WITH DIFFERENT CONTEXT Nl <> N2; T l = T2; Ml = M2; CI <> C2; 4.1. Relationship/Relationship. Solution: S10 and S4, possibly SI, or S2, or Sll. 4.2. At t r i b u t e / A t t r i b u t e . Solution: 520 and S2 or S3. 5. CONSTRUCT MISMATCH Nl = N2; T l <> T2; Ml = M2; CI <> C2; 5.1. E n t i t y i s Relationship. Solution: SI. 5.2. E n t i t y A t t r i b u t e i s E n t i t y - R e l a t i o n s h i p construct. Solution: S3. 5.2.1. Att r i b u t e i s E n t i t y . 5.2.2. Att r i b u t e i s Relationship. 5.3. Relationship A t t r i b u t e i s E n t i t y . 231 Solution: S2. 6. CONSTRUCT MISMATCH AND SYNONYM Nl <> 2; T l <> T2; Ml = M2; CI <> C2; 6.1. E n t i t y i s Relationship. Solution: S10 and SI. 6.2. E n t i t y A t t r i b u t e i s E n t i t y - R e l a t i o n s h i p construct. Solution: S10 and S3. 6.2.1. Att r i b u t e i s En t i t y . 6.2.2. Att r i b u t e i s Relationship. 6.3. Relationship A t t r i b u t e i s E n t i t y . Solution: 10 and S2. 7. DIFFERENT AND UNRELATED OBJECTS Nl <> N2; T l = T2; Ml <> M2; not (related(Ml,M2) ) ; CI = C2 or CI <> C2; 7.1. Objects are d i f f e r e n t , unrelated and have no common r o l e . Solution: do nothing. 7.1.1. E n t i t y / E n t i t y . 7.1.2. Relationship/Relationship. 7.1.3. At t r i b u t e / A t t r i b u t e . 7.2. O b j e c t 1 and Object 2 i n same r o l e (W-r e l a t i o n s h i p ) . 7.2.1. E n t i t y / E n t i t y . Solution: S7. 7.2.2. Relationship/Relationship. Solution: SI and S7. 7.2.3. At t r i b u t e / A t t r i b u t e . Solution: S2 or S3 followed by S7. 8. TRUE HOMONYM Nl = N2; T l = T2; Ml <> M2; CI = C2 or CI <> C2; 8.1. Objects are d i f f e r e n t , unrelated and have no common r o l e . Solution: S10. 8.1.1. E n t i t y / E n t i t y . 8.1.2. Relationship/Relationship. 8.1.3. At t r i b u t e / A t t r i b u t e . 8.2. O b j e c t 1 and Object 2 i n same r o l e (W-r e l a t i o n s h i p ) . 8.2.1. E n t i t y / E n t i t y . Solution: S10 followed by S7. 8.2.2. Relationship/Relationship. 232 Solution: S10 and SI followed by S7. 8.2.3. At t r i b u t e / A t t r i b u t e . Solution: S10 and S2 or S3 followed by S7. 9. DIFFERENT OBJECTS WITH DIFFERENT CONSTRUCTS Nl <> N2; T l <> T2; Ml <> M2; CI <> C2; 9.1. Objects are d i f f e r e n t , unrelated and have no common ro l e . Solution: do nothing. 9.1.1. Entity/Relationship. 9.1.2. Relationship/Attribute. 9.1.3. E n t i t y / A t t r i b u t e . 9.2. O b j e c t 1 and Object 2 i n same r o l e (W-r e l a t i o n s h i p ) . 9.2.1. Entity/Relationship. Solution: SI followed by S7. 9.2.2. Relationship/Attribute. Solution: SI and S2 or S3 followed by S7. 9.2.3. E n t i t y / A t t r i b u t e . Solution: S2 or S3 followed by S7. 10. DIFFERENT OBJECTS WITH DIFFERENT CONSTRUCTS. BUT HOMONYMS Nl = N2; T l <> T2; Ml <> M2; CI <> C2; 10.1. Objects are d i f f e r e n t , unrelated and have no common ro l e . Solution: S10. 10.1.1. Entity/Relationship. 10.1.2. Relationship/Attribute. 10.1.3. E n t i t y / A t t r i b u t e . 10.2. O b j e c t 1 and Object 2 i n same r o l e (W-r e l a t i o n s h i p ) . 10.2.1. Entity/Relationship. Solution: S10 and SI followed by S7. 10.2.2. Relationship/Attribute. Solution: S10, SI and S2 or S3, followed by S7. 10.2.3. E n t i t y / A t t r i b u t e . Solution: S10 and S2 or S3, followed by S7. 11. DIFFERENT BUT RELATED OBJECTS Nl <> N2; T l = T2; Ml <> M2; related(Ml,M2); CI = C2; 233 11.1. One o b j e c t c o n t a i n s the other (Object 1 contains Object 2 or v i c e versa). 11.1.1. E n t i t y / E n t i t y . Solution: S6. 11.1.2. Relationship/Relationship. Solution: SI and S6. 11.1.3. At t r i b u t e / A t t r i b u t e . Solution: S2 or S3, followed by S6. 11.2. Object 1 and Object 2 have a common superset (but do not overlap). 11.2.1. E n t i t y / E n t i t y . Solution: S8. 11.2.2. Relationship/Relationship. Solution: SI and S8. 11.2.3. At t r i b u t e / A t t r i b u t e . Solution: S2 or S3, followed by S8. 11.3. Object 1 and Object 2 have a common superset and overlap 11.3.1. E n t i t y / E n t i t y . Solution: S9. 11.3.2. Relationship/Relationship. Solution: SI and S9. 11.3.3. At t r i b u t e / A t t r i b u t e . Solution: S2 or S3, followed by S9. 12. DIFFERENT BUT RELATED HOMONYMS Nl = N2; T l = T2; Ml <> M2; related(Ml,M2); CI = C2; 12.1. One o b j e c t c o n t a i n s the other (Object 1 contains Object 2 or vic e versa). 12.1.1. E n t i t y / E n t i t y . Solution: S10 and S6. 12.1.2. Relationship/Relationship. Solution: S10 and SI and S6. 12.1.3. At t r i b u t e / A t t r i b u t e . Solution: S10 and S2 or S3, followed by S6. 12.2. Object 1 and Object 2 have a common superset (but do not overlap). 12.2.1. E n t i t y / E n t i t y . Solution: S10 and S8. 12.2.2. Relationship/Relationship. Solution: S10 and SI and SB. 12.2.3. At t r i b u t e / A t t r i b u t e . Solution: S10 and S2 or S3, followed by S8. 234 12.3. Object 1 and Object 2 have a common superset and overlap. 12.3.1. E n t i t y / E n t i t y . Solution: S10 and S9. 12.3.2. Relationship/Relationship. Solution: S10 and SI and S9. 12.3.3. At t r i b u t e / A t t r i b u t e . Solution: S10 and S2 or S3, followed by S9. 13. DIFFERENT BUT RELATED OBJECTS WITH DIFFERENT CONTEXT Nl <> N2; T l = T2; Ml <> M2; related(Ml,M2); CI <> C2; 13.1. E n t i t y A t t r i b u t e related to E n t i t y Attribute of a d i f f e r e n t e n t i t y . 13.1.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or v i c e versa). Solution: S3 and S6. 13.1.2. Common superset. Solution: S3 and S8. 13.1.3. Common subset and superset. Solution: S3 and S9. 13.2. E n t i t y A t t r i b u t e r e l a t e d to R e l a t i o n s h i p A t t r i b u t e 13.2.1. Attribute 1 contains A t t r i b u t e 2 (or v i c e versa). Solution: S2 and S3 and S6. 13.2.2. Common superset. Solution: S2 and S3 and S8. 13.2.3. Common subset and superset. Solution: S2 and S3 and S9. 13.3. Relationship Attribute related to Relationship A t t r i b u t e 13.3.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or vice versa). Solution: S2 and S6. 13.3.2. Common superset. Solution: S2 and S8. 13.3.3. Common subset and superset. Solution: S2 and S9. 13.4. Relationship related to Relationship 13.4.1. Relationship 1 contains Relationship 2 (or v i c e versa). Solution: SI and S6. 13.4.2. Common superset. Solution: SI and S8. 13.4.3. Common subset and superset. Solution: SI and S9. 235 14. DIFFERENT BUT RELATED HOMONYMS WITH DIFFERENT CONTEXT Nl = N2; T l = T2; Ml <> M2; related(Ml,M2); CI <> C2; 14.1. En t i t y A t t r i b u t e related to E n t i t y Attribute of a d i f f e r e n t e n t i t y . 14.1.1. Attribute 1 contains A t t r i b u t e 2 (or vice versa). Solution: S10 and S3 and S6. 14.1.2. Common superset. Solution: S10 and S3 and S8. 14.1.3. Common subset and superset. Solution: S10 and S3 and S9. 14.2. E n t i t y A t t r i b u t e r e l a t e d t o R e l a t i o n s h i p A t t r i b u t e . 14.2.1. Attribute 1 contains A t t r i b u t e 2 (or vice versa). Solution: S10 and S2 and S3 and S6. 14.2.2. Common superset. Solution: S10 and S2 and S3 and S8. 14.2.3. Common subset and superset. Solution: S10 and S2 and S3 and S9. 14.3. Relationship Attribute related to Relationship Attribute. 14.3.1. Att r i b u t e 1 contains A t t r i b u t e 2 (or vice versa). Solution: S10 and S2 and S6. 14.3.2. Common superset. Solution: S10 and S2 and S8. 14.3.3. Common subset and superset. Solution: S10 and S2 and S9. 14.4. Relationship related to Relationship 14.4.1. Relationship 1 contains Relationship 2 (or v i c e versa). Solution: S10 and SI and S6. 14.4.2. Common superset. Solution: S10 and SI and S8. 14.4.3. Common subset and superset. Solution: S10 and SI and S9. 15. DIFFERENT BUT RELATED OBJECTS OF DIFFERENT TYPE Nl <> N2; T l <> T2; Ml <> M2; related(Ml,M2); CI <> C2; 15.1. E n t i t y A t t r i b u t e related to Entity-Relationship construct. 15.1.1. E n t i t y A t t r i b u t e related to En t i t y . 15.1.1.1. One object contains the other. Solution: S3 and S6. 15.1.1.2. Common superset. Solution: S3 and S8. 15.1.1.3. Common subset and superset. Solution: S3 and S9. 236 15.1.2. E n t i t y A t t r i b u t e related to Relationship. 15.1.2.1. One object contains the other. Solution: S3 and S6. 15.1.2.2. Common superset. Solution: S3 and S8. 15.1.2.3. Common subset and superset. Solution: S3 and S9. 15.2. Relationship A t t r i b u t e related to En t i t y . 15.2.1. One object contains the other. Solution: S2 and S6. 15.2.2. Common superset. Solution: S2 and S8. 15.2.3. Common subset and superset. Solution: S2 and S9. 15.3. E n t i t y related to Relationship. 15.3.1. One object contains the other. Solution: SI and S6. 15.3.2. Common superset. Solution: SI and S8. 15.3.3. Common subset and superset. Solution: SI and S9. 16. DIFFERENT BUT RELATED HOMONYMS OF DIFFERENT TYPE Nl = N2; T l <> T2; Ml <> M2; related(Ml,M2); CI <> C2; 16.1. En t i t y Attribute related to Entity-Relationship construct 16.1.1. Enti t y A t t r i b u t e r e l a t e d to En t i t y . 16.1.1.1. One object contains the other. Solution: S10 and S3 and S6. 16.1.1.2. Common superset. Solution: S10 and S3 and S8. 16.1.1.3. Common subset and superset. Solution: S10 and S3 and S9. 16.1.2. En t i t y A t t r i b u t e related to Relationship. 16.1.2.1. One object contains the other. Solution: S10 and S3 and S6. 16.1.2.2. Common superset. Solution: S10 and S3 and S8. 16.1.2.3. Common subset and superset. Solution: S10 and S3 and S9. 16.2. Relationship A t t r i b u t e r e l a t e d to En t i t y . 16.2.1. One object contains the other. Solution: S10 and S2 and S6. 16.2.2. Common superset. Solution: S10 and S2 and S8. 16.2.3. Common subset and superset. Solution: S10 and S2 and S9. 16.3. En t i t y related to Relationship. 237 16.3.1. One object contains the other. Solution: S10 and SI and S6. 16.3.2. Common superset. Solution: S10 and SI and S8. 16.3.3. Common subset and superset. Solution: S10 and SI and S9. 17. MISSING OBJECT Object 2 does not e x i s t . Solution: Sll. 17.1. E n t i t y missing. 17.2. Relationship missing. 17.3. Att r i b u t e missing. 238 Appendix 3: View Integration Session with AVIS A view integration session with AVIS i s i l l u s t r a t e d through a set of 22 screen displays. The problem \"c34\" consists of two s mall views which have to be integrated. Figure 28 depicts the structure of the views. View 1: BRANCH Contract View 2: DEALER CONTRACT Figure 28: View Integration Sample Problem The screens shown below exemplify questions asked by the AVIS system as w e l l as AVIS' support f u n c t i o n s . These support f u n c t i o n s f o r i n s t a n c e i n d i c a t e to the d e s i g n e r what the program a l r e a d y knows or what the current contents of each view are. Example screens which display system questions to the user w i l l not d e p i c t user r e p l i e s . The short summary d e s c r i p t i o n of each screen shown below, however, states the user answers and documents the purpose of each screen. Screen Purpose 1 AVIS t i t l e screen, asks user to choose a problem f i l e . Chosen here: \"c34\". 2 F i r s t system question. User answers \"1003\". The following screens 3 - 8 exemplify support functions which can be a c t i v a t e d at any time during the integration session when the system i s ready to accept input. Some of the screens may i n i t i a l l y have no or l i t t l e content, i . e . , screen 4. They are shown here to demonstrate the system status at the beginning of an integration session and to allow a comparison with l a t e r system s t a t i . Screens 3 - 8 show the system status before the 239 user's answer \"1003\". The user gave hi s answer a f t e r seeing screen 8. 3 Shows \"Agenda\", consisting of present and future object comparison tasks (preview). 4 Shows \"Old Agenda\", consisting of current and previous object comparison tasks (history l o g ) . 5 & 6 Show the contents of views 1 and 2 (at the outset of the integration session). 7 Shows l i s t of \" f a c t s \" , knowledge about the set of views based on previous object comparisons. Here the l i s t i s s t i l l empty. 8 Meaning comparison screen. Shows what the system knows about the match between objects. Here, best match i s with \"1003 - dealer\". 9 System question 2. User answers \"n\". 10 System question 3. User answers \"n\", but not u n t i l seeing screen 11. 11 \"Old Agenda\" now shows the p r e v i o u s f o u r system questions. Note that the system never asked the user f o r Synonym (agenda item 2) because i t can assess without user help whether objects carry d i f f e r e n t names (simple s t r i n g comparison). 12 System question 4. User answers \"0\", but not u n t i l seeing screen 14. 13 \"Meaning match\" support function suggests no s i m i l a r i t y . 14 Fact l i s t shows the knowledge asserted at t h i s point i n time. I.e., objects 3 and 1003 are i d e n t i c a l (same). 15 System question 5. User answers \"0\". 16 System question 6. User answers \"1005\". Note that the system reports i n the lower r i g h t window that i n the mean time, a new object, 2013 - branch, has been added. 17 System question 7. User answers \"n\". The number 18 on the upper r i g h t hand corner of the screen shows that the system has i n t e r n a l l y created 18 questions, but has asked the user only 7. The remaining ones were answered by the system. 18 \"THE AGENDA IS EMPTY\". The system has created two i d e n t i c a l views, without f u r t h e r questions t o the user. Note the i n t e r n a l count of 30 questions (upper r i g h t corner). 19 The \"Old Agenda\" shows the l a s t 12 questions, answered by the system without user i n t e r a c t i o n . 20 F i r s t part of the Fact l i s t . 21 & 22 The adjusted views 1 and 2. A l l newly created objects can be i d e n t i f i e d by t h e i r object i d e n t i f i e r s (>2000). 240 SCREEN 1 A V I S AAA AA AA AA AA VV vv VV vv vv vv AA AA VV VV AA AA VV VV AA AAAAAAA AA VVV AA AA V AA AA ssssssssss ss ss ssssssss ss ss sssssssss AA AA AA AA AA AA AA AA AA AA AA AA AA AA AAA - Automated View Integration Systems -by C h r i s t i a n Wagner, 1988. Response What Is the name of the f i l e containing the views? (TEST.PRO) c34 SCREEN 2 Testing for hypothesis: SIMILAR ENTITY, involving the e n t i t y DEALER (3) and one of the following objects; 1003 --1004 --1005 --dealer customer contract Make Agenda Hypo Test New Objects -Response-Please type in the number of the applicable object (or 0 i f none). Assert Objects 241 SCREEN 3 Testing for hypothesis: SIMILAR ENTITY, Involving the e n t i t y DEALER (3) and one of the following objects; Agenda Current Agenda Item H: 3 Similar E n t i t y - (31(1003,1004,1005) Response-Press to continue SCREEN 4 A V I S Testing for hypothesis: SIMILAR ENTITY, Involving the e n t i t y DEALER (3) and one of the following objects; Old Agenda -l-> 3 - Similar E n t i t y - (3111003,1004,1005] Reaponse-Press to continue 242 SCREEN 5 Objects VIEW 1 - RELATIONSHIPS 502 - supplyl\"dealer\",\"branch\") VIEW 1 - ENTITIES 3 - dealer I\"contract\"1 4 - branch!\"branch_no\"J Response-Press to continue SCREEN 6 Objects \u00E2\u0080\u0094 VIEW 2 - RELATIONSHIPS 1502 - dealer_contract(\"dealer\",\"contract\"1 1503 - customer_contractI\"customer\",\"contract\"1 VIEW 2 - ENTITIES 1003 - dealer I\"dealer_no\") 1004 - customer(\"customer_no\"I 1005 - contract(\"contract no\") . Response-Press (spacebar) to continue 243 SCREEN 7 Testing foe hypothesis: SIMILAR ENTITY, involving the e n t i t y DEALER (3) and one of the following objects: 1003 -1004 -1005 -dealer customer contract Make Agenda Hypo Test Pacts Response-Press to continue SCREEN 8 Testing for hypothesis: SIMILAR ENTITY, involving the e n t i t y DEALER (3) and one of the following objects: Meaning Match I \" s e l l s \" , \" s u p p l i e s \" ] Match between e n t i t y DEALER (3) and objects below: ID NAME Match of: NAME MEANING CONTEXT 1003 dealer y y unknown 1004 customer n n unknown 1005 contract n n unknown Response-Press to continue 244 S C R E E N 9 Testing for hypothesis: SIMILAR RELATIONSHIP, involving r e l a t i o n s h i p SUPPLY (502) and r e l a t i o n s h i p DEALER_CONTRACT (1502) Make Agenda 3 -> agenda(similar_meanlng,I 3),(1003)) Hypo Test -TO BE EXECUTED: simllar_meanlng((502),(15021) New Objects Response Please answer with y or n to indicate whether the hypothesis i s true or f a l s e . Assert Objects ao(4,3,1003,n) - t e 3 t_hypo(7) S C R E E N 10 Testing for hypothesis: RELATED RELATIONSHIP, involving r e l a t i o n s h i p SUPPLY (502) and r e l a t i o n s h i p DEALER_CONTRACT (1502) Make Agenda 1 -> agenda(homonyms,(502),(1502)) Hypo Test TO BE EXECUTED: related((502),(15021) New Objects \u00E2\u0080\u0094 Response Please answer with y or n to indicate whether the hypothesis Is true or f a l s e . Assert Objects ao(l,502,1502,n) test_hypo(7) 245 SCREEN 11 Testing for hypothesis: RELATED RELATIONSHIP, involving r e l a t i o n s h i p SUPPLY (502) and r e l a t i o n s h i p DEALER_CONTRACT (1502) Old Agenda 1: -l-> 3 2: 3-> 4 3: o-> 1 4: l-> 13 Similar E n t i t y - [31(1003,1004,1005] Synonym - (31(1003) Similar Relationship - (5021(1502) Response-Press to continue test_hypo(7) SCREEN 12 Testing for hypothesis: ENTITY ATTRIBUTE IS RELATIONSHIP CONSTRUCT, involving the a t t r DEALER_NO (2001) and one of the following objects: 4 \u00E2\u0080\u0094 branch 502 -- supply Make Agenda 13 -> agenda(ea_ls_rc, (5021, (1502)) Hypo Test TO BE EXECUTED: ea_is_rc([20011,(4, 5021 ) New Objects \u00E2\u0080\u0094 Response Please type In the number of the applicable object (or 0 i f none). Assert Objects ao(13,502,1502,n) - test_hypo(7) 246 S C R E E N 13 Testing for hypothesis: ENTITY ATTRIBUTE IS RELATIONSHIP CONSTRUCT, Involving the a t t r DEALER_NO (2001) and one of the following objects: Meaning Match Match between a t t r DEALER_NO (2001) I\"key\"! and objects below: ID NAME Match of: 4 branch 502 supply NAME MEANING-n n n n -CONTEXT none none Response-Press to continue test_hypo(7) S C R E E N 14 Testing for hypothesis: ENTITY ATTRIBUTE IS RELATIONSHIP CONSTRUCT, Involving the a t t r DEALER_NO (2001) and one of the following objects: 4 \u00E2\u0080\u0094 branch 502 \u00E2\u0080\u0094 supply Make Agenda 13 -> agenda(ea_ls_rc, (502), 11502)) Hypo Test TO BE EXECUTED: ea_is_rc((20011, ( 4,502)) Facts sinllar_meanlng(3,1003) same(3,1003) dlsslallar_meanlng(502,1502) unrelated(502,1502) Response-Press to continue - test_hypo(7) 247 S C R E E N 1 5 Testing for hypothesis: SIMILAR ENTITY, Involving the e n t i t y BRANCH (4) and one of the following objects: 1004 1005 -customer contract Make Agenda Hypo Test New Objects Response Please type In the number of the applicable object (or 0 l f none). Assert Objects S C R E E N 1 6 Testing for hypothesis: ENTITY ATTRIBUTE IS RELATIONSHIP CONSTRUCT, involving the a t t r CONTRACT (600) and one of the following objects: 1005 -1502 -contract dealer contract Make Agenda Hypo Test TO BE EXECUTED: ea_ls_rc((600), (1005,1502)) New Objects H-slmllar_meanlng added objects: 2013 \u00E2\u0080\u0094 branch -Response-Please type In the number of the applicable object (or 0 l f none). Assert Objects ao(301,4,0,n) - test_hypo(7) 248 S C R E E N 17 18 Testing for hypothesis: ENTITY ATTRIBUTE IS RELATIONSHIP CONSTRUCT, involving a t t r DEALERNO (2001) and r e l a t i o n s h i p SUPPLY (502) Make Agenda 13 -> agenda(ea_ls_rc,(15021,(502)) Hypo Test TO BE EXECUTED: ea_is_rc((20011, (5021 ) New Objects H-Biiss ing added objects: 2023 customer contract -Response-Please answer with y or n to indicate whether the hypothesis is true or f a l s e . Assert Objects ao(13,1502, 502,n) - test_hypo(7) S C R E E N 18 A V I S 30 Make Agenda 13 -> agenda(ea_ls_rc, (2027], (20171) PRECONDITION FAILED: related*(2027),[2017]) Hypo Test Response THE AGENDA IS EMPTY New Objects H-mlss ing added objects: 2027 -- supply Assert Objects ao(13,2027,2017,n) - asso(1301) 249 S C R E E N 19 A V I S 30 Old Agenda 19: 0 -> 19 - H i s s i n g R e l a t i o n s h i p - 150211] 20: -5 -> 1 - S i m i l a r R e l a t i o n s h i p - 120231(1503 ,1502] 21: - 5 -> 1 - S i m i l a r R e l a t i o n s h i p - (5021(2027) 22: -5 -> 1 - S i m i l a r R e l a t i o n s h i p - (20171(20271 23: -6 -> 1 - S i m i l a r R e l a t i o n s h i p - (15021(2023) 24: -6 -> 1 - S i m i l a r R e l a t i o n s h i p - (15031(20231 25: -6 -> 1 - S i m i l a r R e l a t i o n s h i p - (20271(2017 ,5021 26: -7 -> 13 - R e l a t e d R e l a t i o n s h i p - (20171(2027) 27: -7 -> 13 - R e l a t e d R e l a t i o n s h i p - 120231(1502] 28: -8 -> 13 - R e l a t e d R e l a t i o n s h i p - (15021(20231 29: -8 -> 13 - R e l a t e d R e l a t i o n s h i p - 115031(2017) 30: -8 -> 13 - R e l a t e d R e l a t i o n s h i p - (20271(20171 R e s p o n s e -P r e s s t o c o n t i n u e a s s o ( 1 3 0 1 ) S C R E E N 20 A V I S 30 Hake Agenda 13 -> a g e n d a ( e a _ l s _ r c , 1 2 0 2 7 ] , ( 2 0 1 7 ] ) PRECONDITION FAILED: related((2027),I 2017)) Hypo Test F a c t s s l a l l a r _ m e a n i n g ( 3 , 1 0 0 3 ) same(3 ,1003) d 1 s s l m l l a r _ a e a n i n g ( 5 0 2 , 1 5 0 2 ) u n r e l a t e d ! . 502 ,1502 ) e a _ i s _ n o t _ r c ( 2 0 0 1 , 4 ) e a _ l s _ n o t _ r c ( 2 0 0 1 , 5 0 2) d l s s l m i l a r _ m e a n l n g ( 4 , 1 0 0 4 ) d l s s i m i l a r _ m e a n l n g ( 4 , 1 0 0 5 ) s i m l l a r _ m e a n i n g ( 4 , 2 0 1 3 ) same(4 ,2013) d u p ( 4 , 2 0 1 3 ) . . . more > R e s p o n s e THE AGENDA IS EMPTY -a s s o ( 1 3 0 1 ) 250 S C R E E N 21 Objects \u00E2\u0080\u0094 \u00E2\u0080\u0094 1 VIEW 1 - RELATIONSHIPS 502 - supply!\"dealer\",\"branch\"1 2017 - dealer_contract!\"dealer\",\"contract\" I 2023 - customer_contract!\"customer\",\"contract\" ) VIEW 1 - ENTITIES 3 - dealer(\"contract\" 1 4 - branch!\"branch_no\") 2015 - contract(\"contract_no\"I 2021 - customer(\"customer no\"I Response-Press to continue asso(1301) S C R E E N 22 Objects \u00E2\u0080\u0094 VIEW 2 - RELATIONSHIPS 1502 - dealer_contractI\"dealer\",\"contract\"] 1503 - customer_contractI\"customer\",\"contract\") 2027 - supply!\"dealer\",\"branch\"J VIEW 2 - ENTITIES 1003 - dealer I\"dealer_no\"I 1004 - customer I\"customer_no\"I 1005 - contract I\"contract_no\"J 2013 - branch!\"branch_no\"1 Response-Press to continue asso(1301) 251 "@en . "Thesis/Dissertation"@en . "10.14288/1.0098369"@en . "eng"@en . "Business Administration"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "View integration in database design"@en . "Text"@en . "http://hdl.handle.net/2429/29312"@en .