Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Performance considerations in relational and hierarchical data base management systems 1980

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


UBC_1980_A6_7 T64.pdf [ 4.67MB ]
JSON: 1.0051806.json
JSON-LD: 1.0051806+ld.json
RDF/XML (Pretty): 1.0051806.xml
RDF/JSON: 1.0051806+rdf.json
Turtle: 1.0051806+rdf-turtle.txt
N-Triples: 1.0051806+rdf-ntriples.txt

Full Text

PERFORMANCE CONSIDERATIONS IN RELATIONAL AND HIERARCHICAL DATA BASE MANAGEMENT SYSTEMS by Mary Kathleen Tod BSc, Queens University, 1971 A thesis submitted i n p a r t i a l fulfillment of the requirements for the degree of MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Department of.Computer Science) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA January, 1980 © Mary Kathleen Tod, 1980 In presenting th i s thes is in pa r t i a l fu l f i lment of the requirements for an advanced degree at the Univers i ty of B r i t i s h Columbia, I agree that I further agree that permission for extensive copying of th is thesis for scho lar ly purposes may be granted by the Head of my Department or by his representat ives. It is understood that copying or pub l i ca t ion of th is thesis for f inanc ia l gain sha l l not be allowed without my writ ten permission. the L ibrary sha l l make i t f ree ly ava i l ab le for reference and study. Depa rtment The Univers i ty of B r i t i s h Columbia 2 0 7 5 W e s b r o o k P l a c e V a n c o u v e r , C a n a d a V 6 T 1W5 i i ABSTRACT This paper w i l l examine two data base management systems; IMS, an example of a hierarchical data base management system and System R, a relational system. Each system w i l l be described in general terms followed by a discussion of their relative performance. A claim i s made that IMS, because of i t s structure and the procedural nature of i t s language can be more efficient than System R. Some examples are given to il l u s t r a t e this claim. There i s an increasing need today for readily available information;; thus more and more information i s being stored i n disk f i l e s to f a c i l i t a t e rapid retrieval. A relational approach to data base management permits a high-level non-procedural interface for the user, which i s important i f more people require access to information i n a flexible fashion and can not afford or wait for traditional application development. If however a relational data base management system i s not as efficient as hierarchical or network systems then new retrieval methods must be found. Associative processors are introduced as a possible solution . to this problem and three examples are discussed. RAP, a relational associative processor developed at the University of Toronto, has been used i n a performance study to demonstrate the dramatic performance improvements offered by associative processors . This is also discussed. i i i C O N T E N T S • V"i"5v? 1.0 Introduction p. 1 2.0 Evolution of Data Base Management Systems p.3 3.0 IMS -- A Hierarchical Data Base Management System p. 9 3.1 Data Base Access Commands p. 12 3.2 Logical Data Bases p. 16 3.3 IMS Storage Structures . p. 21 4.0 Structure has i t s Advantages p. 29 4.1 The Data Base Administrator affects Performance p. 30 4.2 Designer and Programmer affect Performance p. 39 4.3 Conclusions p. 49 5.0 Relational Data Base Management Systems p. 50 5.1 The Relational Model p. 53 5.2 SQL p. 59 5.3 Relational Storage System p. 65 5.4 The Optimizer p. 69 Cont'd. i v 6.0 Performance Considerations i n System R p. 72 6.1 Other Performance Considerations p. 80 6.2 Another Study p. 84 6.3 Summary p. 86 7.0 Associative Processors p. 87 7.1 Associative Hardware p. 89 7.2 CASSM p. 93 7.3 RARES p. 96 7.4 RAP — A Relational Associative Processor p. 98 7 . 5 RAP Performance p. 106 8.0 Conclusions p. 108 1 1.0 INTRODUCTION Today information processing is v i t a l to any large or small organization. Information processing includes the a b i l i t y to extract information, to maintain and update information, to analyse information and to rely on the accuracy and currency of this stored information. Information i s required to perform the daily business transactions, to plan for the future and to control the development towards the future. Information processing is essential to support management and decision making roles. As a result there are trends i n business towards large corporate data bases which are accessible to non-dp personnel using high-level query languages. Thus there i s a movement away from designing specialized applications to access data i n the most efficient way possible. Online banking (including automated funds transfer), automated manufacturing, inventory control, personnel and payroll applications a l l involve access to data bases. With this trend to integrated data base solutions to business systems, the size of data bases and hence the amount of data stored on machine readable media is increasing. At the same time computer hardware costs are decreasing and this w i l l lead eventually to an age where home terminals are a r e a l i t y . Such terminals w i l l provide numerous functions relying on large quantities of shared data. 2 With the growing number of areas accessing large data bases there i s a demand for efficient data retrieval and flexible high-level tools for data access. The needs for data have changed dramatically since the early days of computing. The next section w i l l discuss how data base management systems (DBMS) have changed over the years to support such new requirements. 3 2.0 EVOLUTION OF DATA BASE MANAGEMENT SYSTEMS The f i r s t systems that processed information on a large scale used punched cards, the ultimate i n sequential f i l e s . This medium was slow and unsuitable for large volumes of data. In addition, the fixed size of the card was often inappropriate although techniques were developed to overcome this restriction. In the early 1950's magnetic tape was introduced as a faster, more flexible storage medium. Problems of fixed length were solved since variable length records could be maintained on tape; however, f i l e s remained sequential. As more applications were written i t was recognized that certain operations recurred, for example: sort, delete a record, add a record, generate a report. It seemed logical and very useful to generalize these f i l e handling functions Also data sharing and integrated f i l e s were considered i n an attempt to reduce the overall programming burden. The new f i e l d of data management had begun. Having provided generalized functions for f i l e maintenance, sorting and report generation, the computing community began exploring the concept of data sharing. It was recognized that there was l i t t l e advantage to a business i f one department hoarded i t s own data forcing a second department to acquire and maintain i t s own set of overlapping data. This led to a need to define data i n a f i l e i n 4 some standard way so that two (or more) parties could access their relevant pieces of information. Hence data de"finitionslanguages were horn, examples of which are, COMPOOL, JOVIAL and COBOL. With these new generalized methods of defining data, report generation toolssbecame more sophisticated i n accessing the data. Some examples of these report generation tools are RPG, MARK I and MARK II, GIRLS and FFS. If we look at the parallel developments in storage technology we see that storage media had undergone a change from punched card technology to magnetic tape. Thus the access speed was significantly increased hut the access method remained sequential. Files were typically read from beginning to end, record by record, in order to extract the necessary information. Because pieces of information are inter-related i t became necessary to add f a c i l i t i e s for data structuring to the l i s t of data management f a c i l i t i e s . The f i r s t structuring technique that was used in DBMS was hierarchic. "One of the main advantages of these structures i s their inherent computer efficiency with regard to storage space and peripheral storage accesses. Another advantage i s a correspondence of hierarchic records to the report structures r 2 7 often required by commercial enterprises." -* A hierarchy i s a very natural structuring and, to a generation brought up thinking and using sequential access methods, the a b i l i t y to flatten a hierarchy into a sequential storage medium proved i t s 5 worth. For example, the f i r s t IMS system demanded requests for data i n a top to bottom, l e f t to right fashion corresponding with the way the hierarchy was flattened for storage on tape. Greater f l e x i b i l i t y i n data access was brought about with the widespread a v a i l a b i l i t y and increased r e l i a b i l i t y of direct access storage devices. Random access became truly feasible. This led to developments of network structures i n data base management systems; IDS developed by General Electric, TOTAL and ADABAS are examples. The Codasyl DBTG played a major role in the thrust toward network systems. Hierarchic systems also benefitted from direct access methods. A perfectly f l a t structure was no longer necessary when implementing a hierarchy. I Originally stored and accessed i n the order 1, .2, 3, h, 5? 6J this hierarchy could be accessed randomly i f direct access pointers are used. 6 It is interesting to note that I M S developed from a hierarchy implemented in a "flat" fashion to one with pointers and eventually became a network data base management system, thus taking advantage of the changing storage and access methods. Another development in data base management systems stemmed from the need to define multiple views of one set of data. Schemas or logical data bases allow one program to access one subset of data from the entire data base. In this example: T Y p £ I I T Y p t 2. TYPE 3 T V PG 4- a logical data base could be created consisting only of segments of type 1, 3, k. In this case the program need not know about type 2 records and in fact is not permitted to access type 2 records. Hence security measures are introduced. There are s t i l l problems to be solved. Improved data independence, high-level access languages, completeness of the data base model and improved data security and integrity are some areas of concern. 7 With data independence i t i s possible to change the physical storage structure of the data or the data access strategy without affecting any application programs. Current data base management systems provide some independence but as demonstrated i n later sections i t is quite limited. How easy to use and flexible are the current data base /~3 7 accessing languages? The CODASYL report describes host- language and self-contained systems. A host-language system permits data base access from a host-language such as PL/l and subroutine calls to perform a well defined set of operations against the data base (eg. retrieve, delete, insert). A self- contained system'provides an interface suitable for a non-programmer. This interface usually consists of a highly specialized language that i s as english-like as possible. Because of the increased demand for flexible data access to support business functions, better data base accessing languages must be developed. Does the data base model have the a b i l i t y to represent real-world relationships or i s the model too confining? The hierarchic and network data models impose constraints on the a b i l i t y to represent relations and thus on the a b i l i t y to model the real- world. As data sharing increases so does the importance of security and data integrity, and the size of any one data base. Hence efficient access becomes even more important. 8 Data independence, high-level languages, the a b i l i t y to represent relations, security and integrity and fast access are important concerns. Relational data base management systems may solve some of them. In the following sections a hierarchical DBMS and a relational DBMS w i l l be compared. A discussion of associative processors i n Section 7 w i l l show that they permit faster access than conventional random access disks. 3.0 CURRENT DATA BASE MANAGEMENT SYSTEMS — AN EXAMPLE In order to ill u s t r a t e the current state of commercially available DBMS IMS, an IBM product/,will be discussed. The concepts introduced can, I fee l , be generalized to other systems such as TOTAL, ADABAS and SYSTEM 2000. IMS i s a hierarchical data base management system, that i s , i t uses a hierarchical structure to represent whatever world i t models (eg. business organization, personnel information, inventory systems). If we look back to the days of tape f i l e s or card f i l e s we can see the beginnings of hierarchical data management. Suppose we have an employee with certain dependants, job experience and educational s k i l l s . In early days this information could have been represented as a series of records, each with a record type stored i n a predefined order: o SmlWv , 3" 0 0 o 1 Z 2 bav-id, "7, M 3 Job «, otonQ c c 0 4- O 0 1 0 10 In IMS this information would, be represented with the following diagram: £DUfiATiOM In this example, the "world" consists of any number of employees and for each employee, three relationships: 1. dependants of the employee. 2. job experience of the employee. 3. educational s k i l l s of the employee. Notice that these relationships are not ex p l i c i t l y stored i n the data base. Rather, they are implied by the structure of the data base. Also, i f the employee has no dependants this information is represented by having no dependant records i n the data base. The information e x p l i c i t l y stored i s the various data fields within each record — or segment to use IMS terminology. In our example the boxes labelled DEPENDANT, EXPERIENCE, EDUCATION and EMPLOYEE are 11 segments. The EMPLOYEE segment may contain such information as SEX, SALARY, BIRTHDATE, ADDRESS, etc. Similarily the segment EDUCATION may contain such information as GRADE, EDUCATIONAL INSTITUTION, COURSE DATES, COURSE COST and so on. In IMS the basic unit of retrieval, update, insertion and deletion i s the segment. IMS i s a sophisticated data management tool and i t i s definitely outside the scope of this discussion to f u l l y describe a l l of IMS's many features. However to prepare the way for further discussion i t i s necessary to describe the basic data manipulation commands and logical data bases, and then to discuss IMS storage structures i n some detail. Finally, several examples w i l l be given to demonstrate how the typical programmer uses IMS effi c i e n t l y . 12 3.1 DATA BASE ACCESS COMMANDS IMS provides f a c i l i t i e s to retrieve, update, insert and delete segments from a data base. Included are: IMS COMMAND FUNCTION GU (GET UNIQUE) GN (GET NEXT) ISRT (INSERT ) REPL (BEPLACE) DLET (DELETE) direct segment retrieval get next segment in sequence add a new segment update existing segment delete a specific segment These commands are best explained with some examples. In order to maintain some consistency within this paper I w i l l introduce an application with which I have been involved. This particular system is an insurance application containing information on policies, subscribers (those who have policies with the company) and claims. The exact details of the data bases w i l l not be given here but where relevant some of the structures w i l l be shown. 13 Le ve.L PoutS CLAIM Race FBC£lPT This structure represents an individual (SUBSCRIBER segment) who i s enrolled with the insurance company for a certain type of coverage (POLICY and RULE segments). This individual has a wife and children (DEPENDANT segment) and over the years has submitted certain claims to the company (CLAIM and RECEIPT segments). The level of this hierarchy i s three. Each data base must have a root segment and the root segment must have a key f i e l d . In this data base the SUBSCRIBER segment is the root segment and i t s key i s a social insurance number (SIN(). Non-root segments do not necessarily have keys. 1 4 GET UNIQUE In IMS a GU provides random access to a segment with a given key f i e l d value. When issued a GU ignores any current positioning information maintained by IMS. Instead, using either an index or a hashing algorithm IMS finds the specified segment directly. A GU i s normally used to access a new root with no relationship to the root previously accessed. Given a specific SIN i n the SUBSCRIBER data base IMS w i l l position i t s e l f for processing that root and a l l of i t s child segments. Child segments are DEPENDENT, POLICY, RULE, CLAIM and RECEIPT segments pertinent to that root. GET NEXT This command directs IMS to move forward i n the data base from it's current position. To understand i t s use one must under- stand the implied ordering of an IMS data base: top to bottom and l e f t to right in the hierarchy. For a given SUBSCRIBER, as established by a GET UNIQUE, successive GN commands would retrieve DEPENDANT segments, then the POLICY segment (assume only one policy per individual), a l l RULE segments, then the f i r s t CLAIM segment, a l l RECEIPT segments for that CLAIM, the second CLAIM segment and i t s RECEIPT segments, etc. GET NEXT commands may be qualified. For example i f an inquiry i s issued for a l l claims for a given subscriber, the following could be done: 1 5 1 . GU for the given SUBSCRIBER, i t s key f i e l d (SIN) must be supplied. 2. GN qualified to indicate that only CLAIM segments are to be retrieved. Note that step 2 would be repeated u n t i l IMS indicated no more claims for that subscriber exist. INSERT Whenever a new segment must be added to the data base the ISRT command i s used. If a dependent segment i s added i t s parent must f i r s t exist. IMS provides certain rules for insertion. For example, a last-in-first-out rule can be applied or segments may be inserted so that their key f i e l d i s i n a specific order. REPLACE REPL is used when updating fields i n a segment. The segment must have been previously accessed by a special type of retrieval command, GHU or GHN. These commands operate exactly as GU and GN but perform the extra function of marking the segment for subsequent replacement (or deletion). When a segment is marked i n this fashion i t i s unavailable for other access. DELETE DLET i s used to delete segments from the data base. When a segment is deleted a l l of i t s children are also deleted. Note that the GHU and GHN are also used with the delete function. 16 3.2 LOGICAL DATA BASES During the early design phases of the insurance application the data base design underwent a significant change. I n i t i a l l y there were thirteen unrelated physical data bases. A l l relational information was stored as data fields i n the physical data base or as logic within some particular program. For example: RULE Here we have two separate data bases, the SUBSCRIBER and the POLICY data bases. In the . ENROLMENTcsegment there i s a f i e l d called POLICY-NUMBER. If we attempt to find details on a l l policies i n which a subscriber i s enrolled we would construct the following program: 17 1. Issue a GU to the SUBSCRIBER segment giving the social insurance number of the particular subscriber. 2. Do u n t i l no more ENROLMENT segments. 3. Get the next ENROLMENT segment. h. Using the POLICY-NUMBER stored i n that segment issue a GU to the POLICY segment i n the POLICY data base. There are some problems. One is that duplicate information exists; the POLICY-NUMBER must be maintained i n two separate places. A second problem i s that the programmer must be aware of how the data i s physically separated i n order to access the correct information. The programmer must know: 1. that the policy-number i s the key in the policy data base. 2. that the different IMS calls are issued to two different physical data bases. This i s required i n the IMS command. 3. that an implied relationship exists between the SUBSCRIBER data base and the POLICY data base through the ENROLMENT segment. Now suppose information i s required about a specific policy and the various people who subscribe to that policy. The data bases 18 are not organized to suit this query, thus presenting a third problem. In fact to satisfy such a query the entire SUBSCRIBER data base would have to be scanned.' Because there were varied and conflicting requirements for access to information i n the insurance application, the data bases were redesigned using logical data bases instead of physical data bases. In IMS logical data bases are constructed through the use of pointers. Consider the previous example with SUBSCRIBERS and POLICIES. saesc/zieee. HoU>£& 19 The "arrowed" segments contain pointers from one physical data base to another. Contrast this with s t r i c t l y physical data bases i n which data (the POLICY-NUMBER)is used to represent a relationship. Here the SUBSCRIBER data base i s log i c a l l y connected via the ENROLMENT segment to the POLICY data base and the POLICY data base is l o g i c a l l y connected via the POLICY HOLDER segment to the SUBSCRIBER data base. Through the use of logical connections a programmer can now equally readily find a l l policies for a given subscriber or a l l subscribers for a given policy. This is possible because logi c a l data bases i n IMS provide more tailored views of the data. One programmer can work with view A and another with view B, without knowing that physical separation of data exists. POi. pauci /?Ui-£ 20 Someone, such as a data base administrator (DBA), w i l l describe the physical data bases to IMS and include in that description information about the different pointer segments (what physical data base they reside i n and what physical data base segment they point to for example). A DBA w i l l also describe the logical data bases as represented i n view A and view B so that IMS can handle the physical data separation and appropriate pointer updating. When a new subscriber is added with a given policy IMS manages the pointers and can automatically create a POLICY HOLDER segment once an ENROLMENT segment i s created. Thus improved data integrity i s provided. In implementing logical data bases one must be aware that there i s overhead involved on the part of IMS in maintaining pointers. Each proposed pointer segment must be examined with particular emphasis on how often i t is used and the mode of use, batch or online. Information that is required frequently for online use is a prime candidate for a pointer segment. 21 3.3 IMS STORAGE STRUCTURES When using a data base management system such as IMS the programmer need not be concerned with the physical storage structure and the access to such storage. Instead the interface i s through the data management language that has been provided. This language may consist of a host language such as PL/l and some predefined subroutine calls or i t may be a specialized self-contained language designed only to support data access (often these languages are as english-like as possible). The data management language must then interface with a specific access method which i n turn interfaces with the physical records. For IMS this may be described with the following diagram: Livterfiicei User / P/ryramivier- //WS / 2>Ll to Stored AZsi^oyivuittoti Hi&rarchic&l Acdess /nertoas /SAM , CSAAt l/SAM 22 The f i r s t interface was generally considered i n the explanation of the various IMS calls and the use of logical data bases. Through interfaces 2 and 3, the program and the data base language function without any knowledge of device dependent details and with a hierarchical picture of the stored segments. A program "does not know (a) anything about physical records (blocks); (b) how stored fields are associated to form stored records (c) how sequencing is performed (eg. i t may be by means of physical continguity, an index, or a pointer chain); or (d) how direct access fk 7 . . . i s performed". u -J Knowledge of such details i s provided at interface k. There are four hierarchical access methods: HSAM, HISAM, HDAM and HIDAM. HSAM and HISAM are strongly sequential i n nature, in fact HSAM can exist on tape. For direct access, most existing IMS data base implementations use either HDAM, hierarchical direct access method or HIDAM, hierarchical indexed direct access method. HDAM and HIDAM use hierarchical and/or child-twin pointers to link segments together. These pointers provide a mechanism for linking a l l segments within a root occurrence and for linking root occurrences. The subscriber data base w i l l be used to indicate how these two types of pointers di f f e r , and then HDAM and HIDAM are described in more detail. C/ii/d - V-t*)Ln. pointers i/t SV8S£&/8£/£. data.. 24 Hierarchical pointers "begin at the root and follow the segments from top to bottom and le f t to right through the hierarchy. Only one pointer exists i n the root segment pointing to the f i r s t occurrence of the left-most child segment. Each child segment points either to i t s f i r s t left-most child or to the next occurrence of that segment type. When there are no more segments of that type the pointer crosses to the next segment type i n sequence; for example, the last DEPENDANT segment points to the f i r s t POLICY segment. The second diagram illustrates child-twin pointers. In this example there are three pointers i n the root (SUBSCRIBER) segment, one for each type of child segment. If the SUBSCRIBER segment had more children there would be more pointers, always one pointer for each child type. A child segment then points to i t s next segment occurrence. If the child segment i t s e l f had children there would be more pointers, again, one for each child type. The hierarchical pointer scheme requires less storage while the child-twin pointer scheme provides faster access to the different points i n the hierarchy. 25 HDAM "HDAM provides direct access (by sequence f i e l d value) to the root segments, via a hashing and chaining technique, together with pointer access to the subordinate segments." In the customer account data base the key of the root segment i s the SIN, this key is supplied with a value, given to the hashing algorithm and an address is generated. If this i s the f i r s t key value to hash to the calculated address then no subsequent chaining is required. However i f the SIN collides with an existing SIN at the same address, the SIN i s chained to the f i r s t SIN via a pointer. A l l child segments of the root (and their child segments, etc.) are held together through the use of hierarchical or twin-child pointers.  2 7 So, as long as there are a small number of root segments that hash to the same location the access to a given root segment i s fast. Dependent segments are placed as physically close as possible to the root segment i n order to minimize disk seek time when looking at a root and a l l l i s • children. 28 HIDAM Instead of direct access to the roots, HIDAM provides indexed access via the sequence f i e l d of the root to the root segments and then pointer access to any dependent segments. "A HIDAM data base actually consists of two data bases: a "data" data base, which contains the. actual data, and an associated INDEX data base, which provides the (dense) index. There i s an entry i n the INDEX data base for each root segment i n the "data" data base." In the insurance example an index would be established for the key f i e l d of the SUBSCRIBER segment, social insurance number. When a social insurance number is provided i n an IMS c a l l , IMS uses that value to search the index data base. From the index data base IMS retrieves physical location information so that the actual data can be found. data, Aase 2 9 k.O STRUCTURE HAS ITS ADVANTAGES A data "base management system such as IMS can be c r i t i c i z e d for i t s lack of data independence or for the a r t i f i c i a l and confining nature of the hierarchical structures i t imposes on information. Another disadvantage is the host-language interface through which a programmer must work in order to communicate with data. Such a host-language requires a programmer to know too much about the system; how data i s stored, how i t is accessed, exact record layouts are examples. These criticisms are warranted i f we could ignore the incredible growth i n data bases and slow disk access times. As i t i s , the structure inherent i n IMS (and other hierarchic or network DBMS) is the mechanism that permits relatively fast access to such data bases. There are two major factors to be considered i n designing data base applications: one i s the amount of storage space required (eg. the size of the data base) and the second is the speed with which information can be assembled from the data base. The next section deals with methods or organizing an IMS data base and processing data from the data base that w i l l improve response. In general these methods are related to the data base structure (the hierarchy). We w i l l f i r s t examine what a data base administrator can do and go on to consider what each individual programmer can do. A l l of these techniques have been employed in the insurance application mentioned earlier. 30 4.1 THE DATA BASE ADMINISTRATOR AFFECTS PERFORMANCE Segment P o s i t i o n We begin with a general purpose statement about IMS that has bearing on segment p o s i t i o n i n g as w e l l as on several other areas i n our discussion. "IMS/VS manages the space within the records, and a l l o c a t e s and reclaims space as segments are inserted and deleted. In a l l o c a t i n g space, an attempt i s made to keep adjoining segments i n the h i e r a r c h i c sequence as close together i n storage as po s s i b l e . " *- - / ('Emphasis added). /Sao 7 When IMS po s i t i o n s segments f o r t h i s data base i n p h y s i c a l storage SEG3 i s " c l o s e s t " to the ROOT segment and SEG3 i s "f u r t h e s t " from the ROOT segment. When segments are close together i t i s l i k e l y that one p h y s i c a l access to the dis k w i l l b ring these segments i n t o core together. Thus a r e t r i e v a l of ROOT followed by one or more r e t r i e v a l s 3 1 of SEG.T segments w i l l involve only one access. I f however, access to t h i s data base record t y p i c a l l y involves ROOT and SEG3 then the hierarchy should be: /Zoo — As a general r u l e segments that are accessed frequently should be close to the t o p - l e f t i n the hierarchy. Segments that are accessed infre q u e n t l y and/or i n batch-only processing (where the time, constraints are not so severe) should be at the bottom-right. 3 2 HDAM vs HIDAM In the insurance application, because most functions were done online, direct access to root segments was required hut the choice had to he made between HDAM and HXHAM data bases. "HIDAM is useful when both sequential and direct access by root key are r-Q -7 required." -' HDAM on the other hand i s better for direct access than HIDAM because the index data base i s removed. HDAM was chosen for the insurance data bases to f a c i l i t a t e online processing since the majority of online processes required direct access. It was decided that although some batch programs required sequential processing of data, a sort step would provide the necessary ordering at a small additional cost. The human factor involved i n online systems demanded fast response. 3 3 ROOT A D D R E S S A B L E A R E A , S Y N O N Y M C H A I N S An HDAM data base consists of one primary area called the root addressable area (RAA) and an overflow area. A 1 l 1 1 S£& & , — 5£C C 2 Area, /OL a. OL C/ , cz 82 C3 „ Cr 34 Root segments are stored in the root addressable area their address being determined by a hashing algorithm chosen by the data base administrator (DBA). In addition a limited number of dependent segments are stored i n the RAA, this limit i s set by the DBA. The length of the RAA is also controlled by the DBA. Row can the DBA choose the most appropriate ( l ) size for the root addressable area (2) hashing algorithm and (3) number of dependent segments stored i n the RAA? In the insurance application statistics were gathered to predict the average, minimum and maximum number of child segments per parent segment for a l l data base segments. Armed with these figures the data base administrator could choose suitable parameters for each physical data.base. In fact these stati s t i c s led to s p l i t t i n g one data base into three data bases for better performance. Given the typical number of child segments per parent the number of dependent segments stored i n the RAA was determined. It is important to store as many segments as possible, preferably a l l , i n the RAA so as to minimize accesses. This i s valid given the statement that IMS keeps segments i n a data base record together i f at a l l possible. Hence one access to the root segment w i l l get a l l child segments most of the time. "At least root segments should be stored i n the root addressable area. In addition, active dependent segments should be placed i n the root addressable area 35 since this w i l l provide fast access to them because of their physical f~a 7 proximity to the root segment." u -7 The to t a l size of the root addressable area can be determined knowing the average size of a data base record and the number of roots. But because a data base i s not static an additional factor i s the amount of free space necessary so that insertions can be done within the RAA instead of the overflow area. MS = N x E NB x TB~ S MS + MS x F MS ( 1 + F) MS x F 1 where N = number of bytes i n the average data base record (including user and IMS maintained data). E = expected number of data base records. NB = number of bytes per block, multiplied by .8 to keep chaining at a reasonable level. MS = minimum size of RAA i n blocks. FjF 1 = a growth factor, F 1 > 1 S = optimum size of RAA. 1 Note that F , the growth factor must be a function of data base + . - + ZT107 activity. *- 3 6 In choosing a hashing algorithm i t is important to choose one that minimizes the number of synonyms produced and distributes the data f a i r l y evenly over the root addressable area. To do this an analysis of root segment keys must be done to determine their characteristics. Provided with IMS are several hashing modules and a u t i l i t y which takes as input the root segment keys and produces a map for each algorithm showing key distribution. 3 7 SECONDARY INDEXES If a segment i s frequently accessed through a non-key f i e l d (this i s particularly applicable to root segments), regardless of the root key f i e l d , then i t i s a candidate for a secondary index data base. In this situation an index data base is created, as i n HIDAM data bases, containing one record for each key of the secondary index data base. S ec / s e c z 7 38 A secondary index set up for SEG7 w i l l allow access directly to this segment without f i r s t qualifying the ROOT segment and SEG6. 3 9 4.2 DESIGNER AND PROGRAMMER AFFECT PERFORMANCE Having dealt with how a data base administrator can optimize an IMS application, we turn to the role of designers and programmers. BATCH vs ONLINE Programs within the insurance application were s p l i t into batch and online transactions. Batch transactions (report writers and statistics gatherers, for example) usually examine a l l roots of a particular data base and report on these in order. Hence using HDAM data bases a sort step is required. This requires additional time and thus such transactions are not scheduled online where response time is so important. Rather they are scheduled as overnight jobs and do not interfere with the online system. Some insurance programs originally designed to be online and yet work sequentially through one data base were modified so that only one root of the data base i s used. For example, an inquiry transaction designed originally to give information on a l l insurance claims for a particular subscriber now gives information on either the most recent insurance claim or a unique insurance claim, as specified by the user, for a particular subscriber. In another situation a program was s p l i t into an online portion and a batch portion to minimize data base a c t i v i t y during an online session. 4 0 Further, batch programs are not run concurrently with online programs since batch programs tend to make heavy use of buffers which would in turn slow online performance. If an online program requests a piece of information i t i s brought into the IMS bufffer pool, some actions are then performed and information i s displayed to the terminal. While awaiting the user's response , i f a batch program i s working away chances are that the buffers in the pool w i l l be completely changed before the online program is able to continue. When the online program does continue i t w i l l cause another access to be executed, probably for the same information. If the batch program runs at a different time additional disk accesses can be minimized. 4 1 LOCALITY OF IMFORMATIOM Suppose the following data base structure exists: Roar- Se6. A sec a se<s> 2> It i s always adviseable to retrieve a root segment and process a l l child segments of that root (two SegAAs, one SegB, one SegC and three SegD's) before continuing to examine another root segment either i n the same or a different data base. This w i l l minimize accesses since a root and i t s child segments can typically be retrieved with one access. 4 2 As another example of l o c a l i t y consider: 8ooT sec A Sad C Sec a S£6. H sec r sec r 43 There are too many segments to retrieve with one access in the above example. SEGG, SEGH, SEGI, SEGJ are stored i n a separate block. However, IMS tends to keep segments together according to structure as much as possible. A programmer should therefore try to process these segments in order (top to bottom and l e f t to right) and thus minimize disk access time. 4 4 HIERARCHICAL POINTERS A programmer need not be aware of hierarchical or child/twin pointers i n the data base. However, i f hierarchical pointers are in use, performance can be enhanced i f the program accesses each segment in order according to the hierarchical pointers. sea. / s e e . 3 In addition to extra disk accesses as discussed i n the section on l o c a l i t y , IMS w i l l work much harder following pointers i f a program asks for SEG1 to SEG8 i n any order other than that shown above. This is not so with child/twin pointers since the ROOT segment would then contain a separate pointer for each child segment type instead of one pointer pointing to SEG1. 45 LOGICAL DATA BASES 6 DA&L 3ase / 2kz&. Suse U Segment D i n DATA BASE 1 points to segment 2 i n DATA BASE 2 and segment Y points to segment A. Thus a logical data base could be constructed joining the two data bases together (data base 3 ) . _ A program that needs information from DATA BASE 1 and DATA BASE 2 could use this logical view however each time segment Z is accessed that program i s crossing from one physical data base to . another and i n doing so causes additional accesses. 4 6 8 c X Att fa. 'Base 3, A /iojtcaC 2)&fo Base. V To reduce this cost, one program i n the insurance application uses two separate physical data bases, extracts information from each then sorts and merges the resultant information to achieve the same effect. It i s important to know when physical boundaries are crossed. 47 MULTIPLE POSITIONING For each data base that an application program accesses IMS maintains a pointer to the current position within the data base. In most situations this i s sufficient. However, on occasion a programmer must take advantage of multiple positioning in IMS. For example: s e c z Here i t is necessary to access the f i r s t occurrence of S E G 1 , then the f i r s t occurrence of S E G 2 , the second occurrence of S E G 1 then the second occurrence of S E G 2 , etc. With multiple positioning IMS w i l l maintain a pointer to the current position i n segments of type S E G 1 and a second pointer for segments of type S E G 2 . The programmer must specify a request for multiple positioning i n a situation of this type. 48 OTHER CONSIDERATIONS A programmer must be aware that a GU c a l l i n IMS is more expensive than a GN c a l l . Hence once position is established via a get unique i t i s advantageous to use as much information as possible from that parent segment via GN calls. This encourages a programmer to follow the hierarchical structure whenever possible. Insert and delete operations are expensive i n IMS due to pointer and space management activities that take place. If at a l l possible a replace (REEL) should be done instead of a delete followed by an insert. In the insurance application one data base was extensively redesigned to use replace calls instead of delete/insert ca l l s . A l l calls to IMS should be as f u l l y qualified as possible thus eliminating much overhead in IMS and potentially reducing the search time. 49 4 . 3 CONCLUSIONS An attempt has been made to show that the structure of a system such as IMS i s one of the features allowing current DBMS to cope with increasing data base sizes and response demands of online systems. In the next section relational data base systems w i l l be examined and the question of whether these systems can indeed perform as well as hierarchic or network systems w i l l be addressed. 50 5.0 RELATIONAL DATA BASE MANAGEMENT SYSTEMS There are several criticisms of hierarchical and network, data hase management systems. One criticism is the lack of data independence that these systems provide, and a second concerns the cumbersome interface that a programmer must deal with. Relational datacbase management systems provide a much higher degree of data independence and a simpler interface. Because of the growth in data base size, an emphasis on integrated data and thus shared data, the need for readily available information and the dynamic nature of such information, i t has become increasingly important for application programs and query languages to be independent of changes in the stored representation of information. The term for this family is data independence. Systems like IMS have provided some data independence; logical data bases and schemas are examples. However as E.F. Codd has written, "Three of the principle kinds of data dependencies which s t i l l need to be removed are: ordering dependence, indexing dependence and access path dependence." IMS can be used to i l l u s t r a t e these different dependencies. ORDERING DEPENDENCY: The roots of an IMS data base have some order associated with them; for example an employee data base could be ordered by employee number. If the ordering of this f i l e were altered so that 51 a l l employees were ordered by social insurance number, then any programs depending on an ordering according to employee number would have to change. INDEXING DEPENDENCY: IMS provides indexing f a c i l i t i e s at the non-root level, these are called secondary indexes. Programs can then be written that w i l l access data based on the index. If the index is removed these programs would require modification. ACCESS PATH DEPENDENCY: IMS presents a hierarchic picture of data i n a data base. If this structure, which in effect defines the access path, changes then the programs using the structure may also have to change. In particular, i f the hierarchy is being used e f f i c i e n t l y as explained in the previous chapter the programs would have to change. Data manipulation languages designed to work with hierarchic or network DBMS depend heavily on the underlying structure of the data model. For example, IMS has the commands GN and GNP that are specifically designed to guide a programmer through a hierarchy from top to bottom and l e f t to right. In addition the languages that define structure, logical connections and access path sensitivity in IMS are not at- a l l consistent with the data manipulation language. Hence a user must learn several interfaces. To provide a comparison relational data base management systems w i l l now be discussed. System R, a prototype under development 52 at IBM is used as an example. Particular attention w i l l be paid to the underlying storage techniques and basic retrieval mechanisms that System R provides. 53 5.1 THE RELATIONAL MODEL A paper by E.F. Codd provides a d e f i n i t i o n of the terra r e l a t i o n : "Given s e t s , Ŝ ,, ... S^ (not n e c e s s a r i l y d i s t i n c t ) , R i s a r e l a t i o n on these n sets i f i t i s a set of n-tuples'each 'of which has i t s f i r s t element from S.̂ , i t s second element from S„, and so on. We s h a l l r e f e r t o S. as the j t h domain of R. As de f i n e d above, R i s s a i d t o have degree n." . '"•«.-' The most convenient way t o represent a r e l a t i o n i s w i t h a t a b l e . Suppose we consider an insurance a p p l i c a t i o n as mentioned i n the previous s e c t i o n , t h e f o l l o w i n g r e l a t i o n s could occur: SUBSCRIBER RELATION SIN STATUS NAME • LOCATION BD SEX S1 ACTIVE JONES VANCOUVER 31 01 50 M S2 ACTIVE FORD VICTORIA 15 06 he F S3 CANCELLED KING VANCOUVER 03 03 35 F ACTIVE BOOTH KAMLOOPS 27 11 hz M ENROLMENT RELATION. GROUP NO. SIN DATE STATUS COMPANY G1 Sk 1977 ACTIVE A G2 SI 1976 . ACTIVE B G2 S3 1976 UNPAID B G3 S2 1978 ACTIVE C Gk Sk 1975 CANCELLED D 54 CLAIM RELATION . CLAIM NO. SIN DATE STATUS AMOUNT CI SI 06 06 76 PAID $100 C2 S1 20 09 77 PAID $ 30 C3 S3 04 04 78 REJECTED $ 0 C4 . • S3 15 03 77 PAID • $ 12 C5 S3 31 01 78 PAID $ 39 C6 S4 12 05 75 PAID $ 52 The SUBSCRIBER r e l a t i o n shows a l l subscribers that have contracts with the insurance company. ENROLMENT states that a p a r t i c u l a r subscriber belongs to a p a r t i c u l a r group (GROUP NO.), was enrolled i n a given year (DATE) and has a c e r t a i n status on that group. A l l claims that have been received are l i s t e d i n the CLAIM r e l a t i o n . For those claims that are paid (STATUS) the amount of payment i s l i s t e d (AMOUNT). Consider the d e f i n i t i o n of a r e l a t i o n . This system contains three r e l a t i o n s (more r e l a t i o n s w i l l be added l a t e r ) , the domains of the SUBSCRIBER r e l a t i o n are SIN, STATUS, NAME, LOCATION, BD and SEX and the degree of the CLAIM r e l a t i o n i s f i v e . Furthermore, each row i n a t a b l e represents a tu p l e of the r e l a t i o n , the ordering of rows i s i n s i g n i f i c a n t , a l l rows are d i s t i n c t and the ordering of columns i s s i g n i f i c a n t . Each r e l a t i o n has a primary key, one or more domains which uniquely i d e n t i f y each tuple i n the r e l a t i o n . The keys i n the above r e l a t i o n s are SIN, GROUP NO. and SIN, and CLAIM NO r e s p e c t i v e l y . Notice that i n the enrolment r e l a t i o n the two domains 55 SIN and COMPANY could also be a primary key. It is also possible that a domain of one relation i s a primary key of another relation, for example SIN i n the CLAIM relation. There are two basic retrieval languages for relational data base management systems: relational algebra and relational calculus. The relational algebra provides basic operations that are performed on one or more relations to produce a relation as a result. These basic operations are permutation, projection and join. Permutation results i n the interchange of two or more columns of a relation and is necessary because ordering of columns i n a relation i s significant. Thus one permutation of the CLAIM relation would produce columns ordered as CLAIM NO., STATUS, DATE, SIN, AMOUNT. Note that the set of queries answerable by a relation is the same as the set answerable by any permutation of that relation. A projection consists of selecting certain columns from a relation, forming a second relation and removing any duplicates from that second relation. Using the notation of Date *~ - 1 : ENROLMENT £"GROUP NO., C0MPANY_7 results i n a relation GROUP NO. COMPANY G1 A G2 B G3 C G4 D 56 N o t e t h a t t h e t u p l e G 2 , B e x i s t s o n l y o n c e . C . J . D a t e d e s c r i b e s t h e j o i n o p e r a t i o n " T w o r e l a t i o n s w i t h a common d o m a i n , D , c a n b e j o i n e d o v e r t h a t d o m a i n . T h e r e s u l t i s a r e l a t i o n i n w h i c h e a c h t u p l e c o n s i s t s o f a t u p l e . f r o m t h e f i r s t r e l a t i o n c o n c a t e n a t e d w i t h a t u p l e f r o m t h e s e c o n d f i k 7 r e l a t i o n w h i c h c o n t a i n s t h e s a m e D - v a l u e " . ^" - ' T o i l l u s t r a t e a j o i n o p e r a t i o n a f o u r t h r e l a t i o n i s a d d e d , t h e C O N T R A C T r e l a t i o n . C O N T R A C T R E L A T I O N I GROUP N O . R U L E N O . G1 R1 G1 R5 G2 R 2 G3 R 2 G3 R 3 Gk Rk Gk R l A j o i n o f E N R O L M E N T a n d C O N T R A C T o v e r GROUP N O . , E N R O L M E N T * C O N T R A C T w o u l d g i v e : G R O U P N O . S I N D A T E S T A T U S COMPANY R U L E N O . G1 " Sk 1977 A C T I V E A R1 G1 Sk 1977 A C T I V E A R5 G2 SI 1976 A C T I V E B R2 G2 S 3 1976 U N P A I D B R2 G3 S 2 1978 A C T I V E C R2 G3 S2 1978 A C T I V E C R3 Gk Sk 1975 C A N C E L L E D D Rk Gk Sk 1975 C A N C E L L E D D R l 5 7 Operations i n the r e l a t i o n a l algebra can, of course, be combined: ENROLMENT £"SIN, C0MPANY_7 * CONTRACT would r e s u l t i n : SIN COMPANY RULE NO. Sk A R1 Sk A R5 SI B R2 S3 B R2 S2 C R2 S2 C R3 Sk D Rk Sk D R1 The r e l a t i o n a l algebra i s more complex and more procedural than r e l a t i o n a l c a l c u l u s . I t demands t h a t a user know p r e c i s e l y what t a b l e s are i n the system, what common domains are i n these t a b l e s and the order o f domains w i t h i n a t a b l e . Most implementations o f r e l a t i o n a l DBMS use a h i g h - l e v e l r e l a t i o n a l c a l c u l u s as a user i n t e r f a c e . R e l a t i o n a l c a l c u l u s a l l o w s a user t o de f i n e the d e s i r e d r e s u l t of a query and the u n d e r l y i n g data management system t r a n s l a t e s t h a t d e f i n i t i o n i n t o a s e r i e s of r e l a t i o n a l o p e r a t i o n s , such as p r o j e c t i o n and j o i n . S p e c i f y i n g a p r o j e c t i o n i n r e l a t i o n a l c a l c u l u s d i f f e r s l i t t l e from r e l a t i o n a l a l g e b r a : ^ (ENROLMENT. GROUP NO., ENROLMENT. COMPANY)J 58 this is equivalent to ENROLMENT £ GROUP NO., COMPANYJ A join operation is easier to specify using relational calculus: ^('ENROLMENT. SIN, ENROLMENT . COMPANY, CONTRACT. RULE NO.): ENROLMENT . GROUP NO. = CONTRACT . GROUP NO. ^ this i s equivalent to ENROLMENT £SIN, COMPANY J * CONTRACT . The relational calculus allows a user to specify what information i s required not how to get i t . It i s important to note that the two methods are equivalent in retrieval power and that an algorithon exists for converting any calculus expression into an algebra expression. "̂̂ J7 ^ n e x a m p i e Q f relational calculus w i l l be given when we deal with System R's retrieval language SQL. 59 5.2 SQL System R provides a language called SQL, formerly SEQUEL, which can be used as a stand alone user language or interfaced with PL/1. SQL is a high-level language used for data retrieval, data manipulation, data definition and data control. This is i n contrast with IMS since IMS provides separate and distinct f a c i l i t i e s for data definition and data control. SQL QUERY FACILITIES A l l queries that can be expressed i n SQL have the same basic format. SELECT ( some type of data ) FROM ( a table containing the data ) WHERE ( some condition describing the choice of data) Using the tables developed earlier we w i l l demonstrate the various query f a c i l i t i e s . SELECT SIN FROM SUBSCRIBER WHERE SEX = 1 F * This query might be 'read' as "Select a l l social insurance numbers for subscribers that are female". Note that the domain name SIN has been provided and the name of the SUBSCRIBER relation. Several domains may be selected: SELECT SIN, NAME FROM SUBSCRIBER WHERE SEX = ' F ' A l l domains of a given relation may be selected: SELECT * FROM SUBSCRIBER WHERE SEX = ' F ' A l l entries i n one domain may be selected: SELECT SIN FROM SUBSCRIBER Multiple conditions may be specified: SELECT SIN FROM SUBSCRIBER WHERE SEX = 1 F 1 AND STATUS = ' ACTIVE * The condition can contain an arithmetic expression: SELECT CLAIM NO. FROM CLAIM WHERE AMOUNT > 2 * DEDUCTIBLE (This assumes that a value i s assigned to DEDUCTIBLE.) Built-in functions are available: SELECT AVG (AMOUNT) FROM CLAIM This would give the average amount paid out for a claim. Multiple relations may be specified: SELECT CLAIM NO. FROM - ' CLAIM WHERE >\TJv •-, SIN = SELECT SIN FROM SUBSCRIBER WHERE SEX •=• ' F 1 61 Other f a c i l i t i e s include specifying the order in which queries are returned — ascending or descending hy some domain, a SELECT UNIQUE f a c i l i t y that w i l l eliminate duplicates, reordering of domains so that they are not i n the same order as in the table, and the a b i l i t y to retrieve more than one set of information and perform set operations on the result (UNION, INTERSECTION, MINUS). - r'VH>' , . /""l6'7 A more complete description of SQL is given by G.H. Denny. - 1 It- should be noted that i n order to form the queries i n SQL a user must know ( l ) the exact format of the tables, the domain names and (2) how information i s stored in the tables, for example the domain SEX is stored as 'F' for female and 'M' for male. 62 DATA MANIPULATION In SQL a user may UPDATE a t u p l e or t u p l e s i n a r e l a t i o n , DELETE t u p l e s from a r e l a t i o n and INSERT new t u p l e s i n a r e l a t i o n . The format of these commands i s very s i m i l a r t o the format of the SELECT command. DELETE CLAIM WHERE DATE < 01 01 77 'Delete a l l claims p r i o r t o January 1, 1977.1 UPDATE ENROLMENT SET STATUS = ACTIVE WHERE SIN = 'S3' 'Update the st a t u s o f S3 t o a c t i v e . 1 INSERT INTO CLAIM <C7, S4, 030077, PAID, $35> 'Insert a p a i d c l a i m i n t o the CLAIM r e l a t i o n . ' Note t h a t the values heing i n s e r t e d must be i n domain order. 6 3 DATA DEFINITION Definition of new relations is done via a CREATE TABLE command and relations can be destroyed through a DROP TABLE command. When creating a table an ordering may be specified (an ORDER BY clause is used). This ordering has a bearing on the physical proximity of table entries. A user who has need of specific tables and specific table ordering can use the DEFINE VIEW command. This i s similar to a yf 177 schema as discussed i n Date. *~ ' -' Another command, EXPAND TABLE, allows for dynamic modification of relations by adding new fi e l d s . 64 DATA CONTROL There are four areas concerned with data control: transactions, authorization, integrity and triggers. A transaction in System R consists of a series of SQL commands that perform a function. The importance of transactions i s their relationship to backup and recovery. The beginning of a transaction defines a point to which a user may backup. Authorization refers to the a b i l i t y of a user to read, insert, delete and update tables within the data base. Assertions help maintain data integrity. For example, ASSERT ON INSERT TO CLAIM: AMOUNT > $0 means that no claim can have an amount less than zero. Triggers allow a user to define other integrity information. For example a trigger could be defined so that whenever the STATUS in the ENROLMENT relation i s set to ACTIVE the STATUS for the same SIN in the SUBSCRIBER relation i s also set to ACTIVE. This brief discussion of data management f a c i l i t i e s using SQL leads to a description of how System R stores information and f i n a l l y to an outline of the optimizer that determines the optional way to satisfy a data request i n SQL. 65 5.3 RELATIONAL STORAGE SYSTEM — RSS A l l data necessary for data base management, including user data, access path information, catalog information and intermediate results, i s stored in a collection of logical address spaces called segments. (This is an unfortunate choice of terminology since IMS uses the term 'segment' for i t s record level of information.) The storage component of RSS manages physical disk space and to this end maps the segments described above to physical locations on disk. Each segment i n fact consists of a set of equal sized pages and the storage component deals with slots on disk the size of one page. Thus physical page slots are allocated and freed as required and RSS maintains a page map for each segment. Relations which consistsof a number of tuples are stored within segments. "Associated with every tuple of a relation i s a /"18 7 tuple identifier or TID." '•" -* The RSS is responsible for storing and accessing tuples and for maintaining the various pointer structures that link tuples of the same or different relations together. A tuple identifier i s implemented as shown i n this diagram: 66 A tuple identifier contains a page number and a slot number. The page number references a particular page within a segment while the slot number directs the RSS to an area always maintained at the bottom of the page. From that slot an additional pointer that has been stored via RSS points to the actual tuple. Since data i s accessed i n pages this scheme retrieves the requested information in a single page access, in most cases, once the TID i t s e l f i s accessed. 67 Tuples are associated with each other either through the use of images or links which are defined by the user and maintained through RSS. An image is a logical ordering of an n-ary relation, with respect to values in one or more sort f i e l d s . Images are maintained through the use of a multi-page index structure and the index information is stored on one or more pages within the segment containing the relation. While images are used to order tuples within a relation, links are used to connect tuples i n different relations. The links perform a similar function to child/twin or hierarchical pointers in IMS. RSS maintains these links and adds to them as new tuples are added to the data base. •St £2 °~- SS" S6 x : Su£SC£/8e£ Si £3 S4- fa lection Ci C 2 cr C8 do c// c/z £13 C/4- « T - 6LA/M 6 3 In this diagram an image exists for the SUBSCRIBER relation ordering tuples according to subscriber number. Note that the image is stored separately from SUBSCRIBER information. A binary link exists from SUBSCRIBER to claim associating a l l claims with the subscriber that issued them. A unary link could be added to order claims on date i f that was a requirement. (Not shown i n the diagram.) Thus a unary link i s used to order tuples within one relation without using an image. Note that an image requires extra storage space and additional 10 processing. Binary links always connect tuples from two different- relations. The following diagram illustrates how images are implemented using a VSAM-like tree which is based on the B-tree concept. \7&2 ,Tll>3 TiDii Tiiil, 7it>4- II 12. TidS /6 T.J>7 69 5.4 THE OPTIMIZER "The objective of the optimizer is to find a low cost means of executing a SEQUEL statement, given the data structures and access paths available." /J" 19 _7 T n e optimizer has two main cr i t e r i a i n determining the best access path: minimizing page fetches and minimizing CPU instructions. In addition the optimizer can weight CPU cost or l / o cost depending on whether the system i s compute bound or l / o bound. As discussed earlier, System R provides for physical clustering of tuples and relations based on images and links. It i s the function of the optimizer to take advantage of these properties whenever possible -in determining the execution of a given query. The optimizer goes through several steps i n determining the best access. (1) Classify the statement according to certain language features such as the user of a join operator or a simple restriction operator. (2) Check the system catalogues to determine what pertinent images and. links are available. For example: SELECT A, B FROM R WHERE- C IS LESS THAN X In this query an image on the f i e l d C would be pertinent to efficient execution. 70 (3) For each language feature mentioned in ( l ) above the optimizer can choose from a set of predetermined methods of execution. One or more methods w i l l be reasonable for the query. Two methods that would be reasonable i n the above situation are: Method A: Use a clustering image which matches a predicate whose comparison operator is not 1 = '. • • Method B: Use. a hon-clustering image which matches a predicate whose comparison operator is not " = ". (4) Given more than one reasonable method, calculate an expected cost and choose the minimum cost method. For method A, assuming half the tuples i n the relation satisfy the predicate, the expected cost i s R/ (2 x T) where R is the relation cardinality and T i s the average number of tuples per data page. For method B, the expected cost to retrieve a l l tuples i s R/2. If an image or link exists that w i l l directly f a c i l i t a t e the query, a method w i l l be chosen that uses i t . In addition, i f more than one link or image is available, the method which uses the link or image with a clustering property w i l l be chosen since i t w i l l reduce 10 significantly. As a further i l l u s t r a t i o n , consider another query. SELECT A,B FROM R WHERE C = X AND D = Y 71 A clustering image exists on C and a non-clustering image exists on D. The optimizer w i l l choose to use the clustering image existing on C. Note that in this case i t would he best to be able to use both images and compare selected tuple identifiers before accessing any tuples from the relation. However, the optimizer as described does not provide this f a c i l i t y . This section i s intended to b r i e f l y describe the optimizer f a c i l i t y of System R. A discussion of i t s effectiveness is outside the scope of this paper since l i t t l e detailed information i s available. 7 2 6.0 PERFORMANCE CONSIDERATIONS IN SYSTEM R Given a brief explanation of System R the insurance application w i l l be examined in more detail to compare the performance of some typical tasks using IMS or System R. The following relations exist: SUBSCRIBER SIN ' NAME ADDRESS STATUS AGE ENROLMENT GROUP NO. SIN DATE STATUS COMPANY CLAIM CLAIM NO. SIN DATE STATUS ' AMOUNT RECEIPTS CLAIM NO. DEP TYPE DATE AMOUNT DEPENDANTS DEP SIN STATUS AGE CONTRACT GROUP NO. RULE NO. RULES RULE NO. CATEGORY PERCENT LIMITS SUBSCRIBER: DEPENDANTS: CLAIM: describes a l l people who have contracts, a l l dependants of a given subscriber, a l l claims that have been processed. 7 3 ENROLMENT: the group/company to which a subscriber belongs, RECEIPTS: details of the claim. CONTRACT: the contract rules for a particular company. RULES: details of specific rules. An IMS version of this data base might be: sae&c&iseR. Z>6P£A/MIVTS / / (COA/THIACT ) 74 There are three data bases: one with subscriber related information, one with group/contract information and one describing a l l claims that.have been processed. Relationships between the •subscriber and contract information as well as the subscriber and claim information exist, thus various "logical views" of the data base can be constructed. SUBSCRIBER EXAMPLES OF LOGICAL VIEWS CLAIM SUBSCRIBER SUBSCRIBER DEPENDANTS 7 5 The f i r s t typical task is to evaluate a new claim for a given subscriber. Such a process consists of several steps (note that this i s a simple view of claim evaluation): 1. Does the claim already exist? 2. Does the subscriber exist? 3. For each receipt i n the claim a) does the dependant exist? b) is the item covered for the date? c) determine amount to be paid. k. Insert the claim i n the data base. 5. For each receipt, insert the receipt information. It i s assumed that a l l IMS data bases are HDAM, (See Section 3), which fac i l i t a t e s random processing at the expense of sequential processing. In System R, i t is assumed that links exist providing clustering so that the data bases' underlying structure closely resembles that in IMS. The c r i t i c a l element in this analysis is the number of disk accesses (10s) that are required to perform the task. 7 6 STEP IMS IOS •SYSTEM R IOs 2. 3a. 3b. 3c. 5.. GU (claim) GU (subscriber) GNP (dependant) GNP. (enrolment) GNP (rules) ISRT (claim) ISRT (receipts) 0 SELECT SELECT SELECT SELECT SELECT INSERT 'INSERT 1 (or more) for index page. 1 for data page. 1 (or more) for index page. 1 ,for data page. 0 i f link exists from subscriber to dependant. 0 i f link exists to • cluster information. 0 .for contract information i f link exists. T (or more) for index to .. rule. 1 for rule information. 1 (or more) for index. 1 for data. 0 i f link between claim and receipts exists. TOTAL 8 minimum In this example even when System R has a similar underlying structure to that of IMS, additional access time i s required : 8 accesses versus h accesses'. This analysis i s generous to System R since an index access w i l l t y p i c a l l y require as many accesses as the level of the index. If the index level were two for each relation then the number of accesses i n System R would become 12 and an index level of three would force the number of accesses to 1 6 . The prime reason for additional access time 77 required by System R is the indexing scheme used to implement images for relations, as discussed earlier. If such an index scheme were replaced by a hashing scheme as i n HDAM, System R would lose the f l e x i b i l i t y and relative speed of scanning a relation effectively which i t must be able to do for proper performance of join operators, a cornerstone of relational data base management. In an ar t i c l e called "Computing Joins of Relations", L.R. Gotlieb has analyzed different algorithms,? used i n computing relation joins. He has determined that i/O ac t i v i t y must be minimized and that ordered key l i s t s when kept on both domains being joined are the most effective way of minimizing i/Os. These ordered key l i s t s are similar to images in System R. The concept of relation joining is not necessary in IMS since relations are already joined through the hierarchical data base structure of IMS. In an IMS data base i t is important therefore to structure the hierarchy to join those relations that are in fact most frequently joined in the application. Thus for example DEPENDANTS i s a child segment of SUBSCRIBER since i t is often necessary to join the SUBSCRIBER and DEPENDANTS relations, similarly RECEIPTS i s a child segment of CLAIM. As a second example suppose a user asked the system to " l i s t a l l paid claims for a subscriber". In System R a SELECT statement is used: 78 SELECT CLAIM WO. FROM CLAIM WHERE STATUS = PAID AND SIN = S1 Assuming that a link i s provided between the SUBSCRIBER relation and a l l claims in the CLAIM relation, the above statement would require one access to index the subscriber and one access to retrieve the page containing that subscriber's information. In contrast an IMS programmer would code: CALL PLITDLI ( FOUR, GU SUBSCRIBER - POINTER, RETURN - AREA, ' ' KEY = S1 ) ; DO WHILE (SUBSCRIBER - STATUS = BLANK) ; CALL PLITDLI ( FOUR, GNP, SUBSCRIBER - POINTER, RETURN - AREA, ' ' NO'KEY ) ; END ; 79 While this i s obviously not as concise and straight forward as an equivalent System R statement i t does result in a single access, i f symbolic pointers are used to implement the logical relationship between SUBSCRIBER and' CLAIMS. Another typical task involves enrolling a new subscriber. To do this both systems w i l l have to: .1. create a SUBSCRIBER entry. 2. create dependants for this subscriber. 3. link subscriber to an appropriate .group. STEP IMS 1. ISRT (subscriber) 2. ISRT (dependants) 3. ISRT (enrolment) I/O SYSTEM R I/O INSERT 1 (or more) for index. 1 ..for data. INSERT 0 i f link is ' available. INSERT 0 i f link i s available. TOTAL 1 2 (or more) The above analysis assumes that no images exist for either SUBSCRIBER or DEPENDANTS relation. Further IOs are required i f images do exist. These few examples show situations where System R is not as efficient as IMS. Assuming for a moment that some other examples can be constructed showing System R to be more efficient that IMS there are other aspects of System R.that also degrade performance. These w i l l be discussed i n the next section. 8 0 6.1 OTHER PERFORMANCE CONSIDERATIONS USING SYSTEM R The functions of System R's relational storage system include concurrency control, f a c i l i t i e s for checkpoint and restart, transaction management, transaction recovery, physical storage management, maintenance of links and images as well asddata access. At another level within System R, the Relational Data System (RDS) "provides high level, data independent f a c i l i t i e s " " r?A 7 for data retrieval, manipulation, definition and control." *- What overhead Is involved in providing these functions? F.H. Lochovsky and D.C. Tsichritzis have examined relational, network and hierarchical data base management systems with user /~22 7 performance i n mind. *" m J Three factors were considered: 1 . proportion of correctly coded application programs. 2. coding time. 3. debugging time. In their study users were chosen and assigned to one of the three data base management systems and given time to familiarize themselves with the user interfaces. Following this, a set of programs were devised and the users implemented these programs. Results of this study showed that the relational DBMS provided a better user interface than the hierarchical DBMS in a l l three areas; correctness, coding time and debugging time. One of the major problems i n using the hierarchical DBMS was setting the data base position pointer before 81 "navigating" through the hierarchy, while another was using the get next within parent (GNP) command correctly. In addition, some problems were associated with the unnaturalness of hierarchical DBMS. The reason for introducing this study i s to emphasize that the advantages associated with using a relational data base management system result to a large degree from the high level user interface provided — for example SQL in System R. However queries expressed i n SQL must be analyzed by the optimizer. This involves the overhead of parsing the SQL statement, classifying the statement according to certain rules, examining various system catalogs to find suitable links and images and then selecting the best method of satisfying such a query. A l l this must be done before any relevant data can be retrieved. A possible solution would be to perform the above actions at compile time instead of execution time, however, this i s impossible i f System R i s to maintain the f l e x i b i l i t y of dynamic link and image definition, dynamic table definition and dynamic table extension. Arguments can be made to show that a relational data base management system, such as System R, requires additional storage space. Thus, over a wide range of accesses to the data base more physical IOs w i l l be required. Consider for example such features as backup page maps used to handle segment recovery. Extra storage is also required for intermediate relations that are created by various relational operators. Page maps are maintained for each 82 segment, a segment consists of physical pages and the page map show the physical location of each page of a segment on disk. For recoverable segments (intermediate relations are not recoverable segments) a backup page map i n i t i a l l y has entries identical to the current page map. As information in the segment changes the current page map also changes but the backup page map maintains an old version for possible recovery. Relations require more storage because the key information has to be stored i n both "parent" and " child" segments. Consider, for example, the various relations in the insurance application: SIN is maintained in SUBSCRIBER and DEPENDANTS relations, CLAIM NO. exists i n CLAIM and RECEIPTS whereas in ahhierarchical data base SIN occurs once only i n the SUBSCRIBER segment, CLAIM NO. occurs once only i n the CLAIM segment. As mentioned i n the previous discussion, a l l relations in System R are accessed through an indexing scheme. Thus access to any tuple or relation directly involves one or more accesses to an index page followed by one access to a data page. This i s a distinct disadvantage to System R. In System R there i s no well defined concept of lo c a l i t y of information, no equivalent to the "top - l e f t " in a hierarchy which i s known by a programmer to be close (i.e. : in the same physical page) to the root segment. A programmer can not therefore take advantage of segment ordering in order to minimize IOs. 8 3 Because links and images may be defined with a clustering property i n System R and this clustering property can be defined dynamically there i s some danger that these w i l l be used indiscriminately, resulting i n serious fragmentation of relations and clusters. These are some areas for concern i n System R and indeed in any relational DBMS., A summary is provided by an art i c l e i n Computing Surveys. "The user of a procedural DSL, like -the DBTG DML, can select for his particular interaction the most efficient access strategy, correspond- ing to a given schema definition. He can give the system directions for traversing the ... model to locate the desired information rather than letting the system choose a route of i t s own. In situations where efficiency i s c r i t i c a l , such a system might be able to out-perform higher level, less procedural interactions." 84 6.2 ANOTHER STUDY In a similar study M. Stonebraker and G. Held have compared network, hierarchical and relational data base management systems. This study concentrated on the different languagealevels; high level non-procedural languages and low level procedural languages. It was concluded i n the study that non-procedural languages result i n increased programmer efficiency, increased data independence and better protection and integrity at the expense of machine efficiency. To support the statement concerning machine efficiency three situations are discussed i n which a procedural system can be more effective. 1. In some situations i t i s possible to express a particular query in more than one fashion. It is argued that a programmer using a procedural language w i l l determine the most efficient way of implementing the request before carrying out the programming task. On the other hand, using a non-procedural language the user is unaware of the fastest method and may not choose appropriately. 2. Stonebraker and Held demonstrate that a non- procedural language may be inappropriate for certain requests. The example they use i s to find the member of a particular department who has the second highest salary. In a procedural 85 system such as IMS i f the segment containing salary is ordered on salary value then i t is simple to satisfy this request. Expressing such a request in SQL would require: SELECT X. SALARY FROM EMP X, EMP Y WHERE COUNT' (X. SALARY BY X. SALARY 'WHERE • X. DEPARTMENT = 12 AND Y. SALARY > X. SALARY ) = 2 This i s very d i f f i c u l t to formulate and there i s no guarantee that the optimizer could determine an efficient way to execute this request. 3. The third situation deals with an example that i s similar to those discussed in Section 6.0. 8 6 6.3 SUMMARY Despite the relative efficiencies of hierarchical systems, such as IMS, and relational data base management systems, such as SYSTEM R, there is s t i l l a basic bottleneck to overcome -- the cost in terms of performance associated with current mass storage hardware. A query such as "retrieve a l l claims less than one year old" should not require a DBMS to bring each set of claim information into core, check whether the condition is satisfied and reject or accept the claim accordingly. Nor should the DBMS require some specialized set of pointers based on the claim date in order to answer such a query. Instead "smart hardware" should be available that w i l l search the data based on i t s content rather than i t s physical address. Content addressable memories were proposed as early as the 1950's, however, due to the high cost associated with them, they have only been implemented on a very small scale. In the next section some trends in data base management research w i l l be discussed, associative memories w i l l be described along with some specific systems. A more detailed examination of RAP, a system developed at the University of Toronto w i l l also be given. 87 7.0 ASSOCIATIVE PROCESSORS The 1977 Conference on Very Large Data Bases included a session on directions i n data base research. Some quotes from that session are significant. "A number of storage techniques and search algorithms in use i n current data management systems are impractical for very large data bases. We are interested in new hardware, probably exploiting LSI technology, that would make those techniques and algorithms feasible for very large data base systems. An example would be a novel implementation of various 'key word' /~2k 7 algorithms for searching free text in a 'smart memory1. ** ~/ "In order to realize an operational data base system i t is substantial to achieve a reasonable response time. There are two kinds of very time consuming processes in the system. One is the problem solving process which appears i n [the} natural language understanding part and {the| deductive translation part. We need special purpose hardware such as a LISP machine or more sophisticated machine. The other-is the data base manipulation process. We need a very efficient data base machine, especially when we have a very large data base. The most important and 2 d i f f i c u l t target is to improve the performance of n - type operations such as join and projection in relational algebra. 88 "Areas of particular interest in very large data base systems include the following ... specialized hardware. In addition to back-end data base management systems, we are interested in associative memories, intelligent disks, intelligent /~26 7 terminals, and graphics systems." "Management of very large data bases will be heavily dependent on new hardware technologies supporting new storage and retrieval methods. Especially associative memories and /~27 7 parallel access algorithms seem to be a promising approach." These quotes highlight certain ideas. It is evident that there is concern over the ability of current data base management systems to handle very large data bases. Improved hardware seems to be the answer considering the comments on associative memories, LISP machines and data base machines. The join and projection operations of relational systems are specifically noted to be problems. 8 9 7.1 ASSOCIATIVE HARDWARE "Currently, the microprocessor/computer-on-a-chip revolution i s providing the potential for production of very cost-effective high-performance computers through u t i l i z a t i o n of a large number / * 2 8 7 of these processors in parallel or in a network." ^ An associative processor, as mentioned earlier, can retrieve stored data using the content of the data rather than i t s physical address. In addition, an operation such as retrieval can be performed i n an associative memory on many pieces of data at the same time. Using /~29 7 Flynn's terminology ^ - / $o classify computer architecture an associative processor f a l l s into the SIMD class (parallel processors also f a l l into this class). SIMD means that there i s a single instruction stream operation on a multiple data stream. This leads to the observation that due to this parallelism and the a b i l i t y to retrieve information based on content, an associative processor w i l l have a faster data processing rate than traditional devices. Given a faster data processing rate an associative processor w i l l be "more effective in handling many types of information processing problems such as information storage and retrieval of rapidly changing data bases, fast search of a large data base, arithmetic and l o s i c a l operation, on large sets o f data" f»-7 and otners. Associative processors are being used at present for some highly specialized data processing functions. Because of cost considerations these processors are small and are used for tasks 90 such as v i r t u a l memory management, resource allocation, interrupt processing and scheduling tasks. Use of associative processors w i l l grow in the future but they w i l l remain special purpose to be used i n applications having a large number of independent data sets, that 'can be processed in parallel, a need for fast response and a need for addressing based on content. It i s beyond the scope of this paper to describe details of the various hardware techniques used to implement associative memories and processors however, the main features w i l l be discussed. One way to understand an associative memory is to consider that the input consists of a search argument of x bits and the output- consists of a bit for each word in the memory indicating success or failure on matching the search argument. Search Argument- Associative Memory (n words) Search Results (n bits) Not a l l systems work in this fashion, some systems output the contents of those words in memory that matched the original search argument. On a more detailed level, an associative processor contains a data register which is loaded with the data to be compared with 91 the data i n memory, A mask register is. used to mask off those f i e l d s i n the memory that are not included i n the search, a word select register indicates which words are to be searched, a results register contains one b i t for each word and the bit i s turned on i f Its corresponding word matches the search c r i t e r i a . The number of matches i s contained i n the match indicator and the multiple match resolver points to the f i r s t word that matched. A control unit i s used to specify the operation (eg. equals, greater than, less than) that i s to be performed. . Consider the query introduced e a r l i e r : retrieve a l l claims less than one year old. In this example the data register contains the date, a l l f i e l d s except the date f i e l d are masked off via the •mask register. The word select register would indicate that a l l words are to be searched and the operation less than i s put into the control register. Data Register 0 0 01 06 77 0 0 Mask Register 0 0 1 0 0 Data bi * S1 06 06 7 6 PAID $ 1 0 0 1 0 C2 SI 2 0 0 9 77 PAID $ 30 1 1 C3 S 3 Ok ok 7 8 REJECTED $ .0 1 0 Ck S3 15 03 77 . '..PAID $ 12 1 0 C5 S3 31 01 7 8 PAID $ 3 9 1 1 C6 Sk 12 0 5 7 5 PAID $ 5 2 1 0 l '"word' . • search select results register register 92 As a result the match indicator would have a value of two and the multiple match resolver would point to word two. To be used effectively for data manipulation an .'associative processor consists of a number of identical memory ce l l s . In addition, the retrieval time should be largely independent of the number of cells and the memory should be modularly expandable. There are four classifications of associative processors; f u l l y p a r a l l e l , b i t s e r i a l , word s e r i a l and block oriented. Those that are f u l l y p a r a l l e l contain comparison logic within the associative memory for every b i t - c e l l of every word. These are very expensive to implement. In a b i t - s e r i a l associative processor one bit-column of a l l the words i n associative memory is operated on at a time. A word-serial associative processor Is., essentially a hardware implemention of a program loop to search for a special value. These machines have the advantage, over standard sequential processors, of reduced instruction decoding time since only a single instruction i s required to execute the search. However word-serial associative processors are slow i n comparison with the other classes. Block- oriented associative processors use a mass rotating storage device such as a disk that has some logic associated with each track. Thus tracks can be search i n par a l l e l . Some example of associative processors w i l l be discussed i n the next section. 9 3 7.2 CASSM — CONTEXT ADDRESSED SEGMENT SEQUENTIAL MEMORY The basic concept behind CASSM is that a l l operations to be performed on the data base are done directly i n disk memory. This eliminates the need to schedule paging of data between disk and main memory of the processor. In addition, "parallelism i s used to make the time to search the data base independent=of the data base size." *- -* Thus the entire data base can be searched by hardware for each search instruction. Most high level retrieval languages allow the user to express parallelism i n their queries, an architecture that provides parallelism i n retrieval eliminates the need to translate the query into a long complicated set of procedures (consider IMS for example). CASSM consists of identical c e l l s , connected through an 10 bus. In addition each c e l l can connect to two neighboring c e l l s . A c e l l consists of a segment of memory (for example, a track) and a large section. " A l l segments of memory circulate concurrently and in synchronization, while each logic section reads, searches, modifies and rewrites i t s segment of memory from one end to the other. Thus, a l l segments of memory are operated on in one Z~32 7 circulatftonodf memory." - / A query against a data base can be characterized as having a specification part and a qualification part. For example, SUBSCRIBER . SIN : SUBSCRIBER . SEX = M, specifies that a l l SINs for SUBSCRIBERS are to be retrieved, given the qualification 94 that the SEX of the SUBSCRIBER is male. This query implemention in CASSM involves searching and marking a l l occurrences that satisfy the above condition. There are two comparator registers within each c e l l , one for the specification and one for the qualification thus allowing the query to be satisfied in one sweep of memory. A RAM (one for each cel l ) that is one bit wide i s used to mark the data items satisfying a given query. One bit is maintained for each data item i n the c e l l and the association i s maintained by relative position. 5 eoMPfitZiroa. CcMPAfiiraR. /?AM Consider how the SUBSCRIBER relation might be represented in CASSM. 95 SET TYPE LEVEL WO. INFORMATION FIELDS A 1 SUBSCRIBER A ' 2 SIN STATUS NAME LOCATION • BD . "SEX V 2 S1 ACTIVE JONES VANCOUVER 31 01 50 M V 2 CS2 ACTIVE FORD VICTORIA 15 06 46 F V 2 S3 . CANCELLED KING t • VANCOUVER 03 03 35 F • V 2 Sk ACTIVE BOOTH ' KAMLOOPS' 27 11 42 M A set type of A indicates an attribute, whereas V indicates a value. The level numbers show how the rows of information f i t i n with their corresponding table. If a second table were included the level number would be reset to 1 and the attributes of that table shown before values would be given. The query SUBSCRIBER . SIN : SUBSCRIBER . SEX = M ,would,be answered i n three revolutions of the memory. SEG; REV. 1 . 2. 3. SET TYPE A A V LEVEL NO. 1 2 2 S INFO. SUBSCRIBER SIN ' DON'T CARE Q. INFO. SUBSCRIBER SEX M 96 7.3 RASES : ROTATING ASSOCIATIVE RELATIONAL STORE RARES can "be implemented on a computer system that contains a CPU, random-access high speed main memory, and head- per-fcrack rotating secondary memories. Data i s transferred from main to secondary memories via channels. An associative memory is constructed by adding content-addressing hardware to the secondary memories. Selection of tuples in response to a query is performed at the secondary storage device ; thus only correct tuples are sent via the channel to main memory. RARES searches a l l tracks on the storage device simultaneously. "The net result is that RARES can decrease the average u t i l i z a t i o n of CPU, main storage, channels and secondary storage devices by a query. In many cases this w i l l allow the interface to assume a heavier query load without degrading response time, or alternatively, to offer a /~33 7 reduced response time with the same query load." u -* RARES i s implemented i n much the same fashion of CASSM. However, the method of storing tuples on the secondary storage device i s different. Tuples can be read from storage concurrently, however, the channel can only receive information (tuples) sequentially. In CASSM tuples are output one at a time from the device thus requiring several revolutions before a l l marked tuples are output. This i s an inexpensive solution to the problem. RARES takes the approach of laying out tuples on the storage device across tracks rather than along tracks. When a search operation is complete tuples are then already in a form suitable for fast output. 97 Relations are often stored.'.imnsort order because this allows queries to be process more ef f i c i e n t l y in many cases. RARES because of i t s orthogonal layout can preserve this sort order and process against i t more eff i c i e n t l y than CASSM. In fact the only way for CASSM to preserve a sort order i s to search one track, at a time. Using this technique the output rate for CASSM would be much slower than RARES. 98 7.h RAP : A RELATIONAL ASSOCIATIVE PROCESSOR This particular system w i l l be described i n more detail than CASSM or RARES. A vi r t u a l memory system for RAP which allows large relational data bases w i l l be examined. Then some performance evaluation statistics comparing RAP to a conventional relational DBMS w i l l be shown. Certain features are essential to data base management and RAP has been implemented with these features in mind: " (a) A la^ge capacity and modular storage with low cost per b i t . (b) A b i l i t y to directly map logical data structures into physical data structures without using auxiliary storage structures. (c) Variable length data formats. (d) Fast retrieval and update suitable for on-line concurrent environment. (e) Context . . . search operations assisted by t o t a l associativity. (f) Simple in-place arithmetic computations and update. " 99 Consider the following overview diagram of RAP: tell 1 purpose. Cenfro/kr T set -fanei. leu Unit # « * 0 ce// RAP is a special purpose processor that communicates with a general purpose computer. Each c e l l within RAP, as i n the other architectures, contains a memory component and a logic component. The memory component i s a track of a disk and the logic component "is a micro processor which acts as a 'search machine1 on data, directs data manipulation, and performs limited numeric computations required by /~35 y data base processing."' -' The set function unit provides logic to combine results of a search, for example COUNT, SUM, MAXIMUM, MINIMUM, AVERAGE?, The controller co-ordinates c e l l searches and initiates the set function unit i f required. 100 As i n the other associative systems, each c e l l is searched simultaneously. Thus response time is largely independent of data base size. A typical query consists of one or more RAP instructions and each instruction i s executed with one rotation / "~36 7 ofrthe,RAP memory. A c e l l is organized as follows: L~ ceU i-l tcniro /ler £>&t — Unit ceil L-t-/ 101 Data i s stored i n the memory area and is read and written (R and W) via fixed heads. Within one revolution the entire contents of memory can be read. A buffer with a length of 1024 bits i s used to provide a time delay between reading and writing information to memory. This time delay allows the ISMU, information search and manipulation unit, to perform the necessary logic on the data. The ISMU is also responsible for i n t e r - c e l l communication, command decoding, i/O data transfers and ALU control. RAP instructions are provided for retrieval, update, insertion, deletion, data base create and destroy, control (eg. branching) and set functions such as SUM̂  COUNT, etc. These are implemented through the controller. Each RAP instruction i s executed in parallel by each c e l l within the processor. Thus a data base operation is executed against every piece of data i n the data base i n one revolution of the disk. A general purpose computer interfacing with RAP must provide data communication f a c i l i t i e s for users, compilation of user queries into basic RAP instructions, transfer of instructions to the RAP controller, support of a concurrent processing environment, and maintenance of data base structure information. "RAP accomplishes relational data base management without complex data structures and software aids such as inverted l i s t s and hashing for multi-key searching required i n conventional systems. This i s especially important for applications which have extensive /~37 7 update activity. " L" u 102 To understand more about RAP .it is useful to see how- relations are stored within memory. ~̂ TM relation domain tuple 1 tuple 2 • • • TKE name - names Jn contrast with RARES, RAP stores tuples i n a linear fashion along a track rather than across several tracks. TM marks the beginning of the track, following this are stored the relation name, the domain names and the tuples of that relation. One tuple, TKE, delimits the end of a relation. Relation and domain names are repeated once on.each track of a relation. Also, no c e l l can have tuples from more than one relation although a relation could easily spread over more than one track. Each tuple has a delete flag, and four mark bits A, B, C and D. A tuple with a delete flag on i s liable for garbage collection. The four mark bits are used to further qualify sets of tuples and to allow the results of one instruction to be used by another instruction. For example, i f a query i s given with multiple qualification phrases then the f i r s t qualification is implemented with one RAP instruction and appropriate mark bits are set, the second qualification tests for those mark bits as well as the second boolean condition of the query. 103 RAP with VIRTUAL MEMORY One of the problems with current associative memories i s their size, for example, a RAP processor that can be r e a l i s t i c a l l y implemented given current LSI technology would contain 10 to 10 bits of memory. Large data bases require much more than this. To solve this problem a vir t u a l memory system was designed for RAP. The overall architecture of a vi r t u a l memory for RAP i s shown below C^J. u s e 1*5 104 Bulk memory contains the data base and is divided into fixed size pages equal to the capacity of one c e l l . Each page contains information from one relation only so that when a page i s transferred to a c e l l , the c e l l then contains data from one relation only, as mentioned in the previous section. Data i s transferred from hulk memory to RAP under control of the general purpose computer. The basic concept is that enough data to answer a query is paged from bulk memory into RAP ce l l s . However, in order to make this as efficient as possible paging is overlapped with query processing, and to this end each c e l l contains two memories, JVLEL, buffer memory for c e l l ^ , that is being loaded with information and MP , processor memory for c e l l , which i s being used to execute the currently active query. When a query is complete buffer memory becomes processor memory and the next query can be executed. A controller, C, functions as previously explained, receiving programs from GPC and transferring query results to GPC. CIO, the 10 controller is connected to GPC via a separate channel. Overlapping of paging and query processing can be achieved relatively easily. Each page holds information for one relation only and each page has a unique identification. When a query is specified the relatidntmust also be specified thus i t i s a simple matter to know, based on a relation name, what pages are to be sent to RAP for processing. Note that there is an assumption that RAP 105 can contain a l l information required to process any given query although i t i s not able to contain the entire data base. A v i r t u a l memory system i s significant to the f e a s i b i l i t y of a design such as RAP especially when paging can be overlapped with query processing to improve efficiency. Results of a simulation study done for this v i r t u a l memory system "̂̂ -7 showed that response time for an average query is directly proportional to the size of relations. Considering that a l l pages of a relation must be i n RAP memory before a query is executed this result i s not surprising. Locality of information has some bearing on response time since i t i s assumed that the higher the l o c a l i t y the more chance there i s that relevant relations w i l l already be i n RAP memory. If the data base i s very large however, the effect of lo c a l i t y w i l l diminish. 106 7.5 RAP PERFORMANCE A study has been done to compare the performance of RAP with a conventional relational data base management system. Models developed for the conventional system assumed that inverted l i s t s were provided for selected attributes of each relation. (The selection would be made based on frequency of access via any particular attribute.) The models developed quantified the times required to perform basic operations such as simple retrieval, update, .qualified retrieval and implicit join. On retrieval "RAP's advantage grows in almost direct proportion to the number of records satisfying the qualification ^̂ *41 T when inverted l i s t s are used i n the conventional system." ~̂ The advantage ranges from 40$ to 90$ depending on the level of qualification. For select and update operations RAP also has an advantage. The operation modelled i n this case involves selecting a set of tuples based on some qualification and then updating the selected tuples. Use of inverted l i s t s in the conventional system causes problems because when an attribute that i s modified (on update) is an inverted attribute the appropriate inverted l i s t must also be modified. In contrast, RAP can perform selection and update (remember the read write heads) within one revolution of RAP memory i n most cases. Selection followed by deletion involves the same problem with inverted l i s t s i n a conventional system (remember links and images in System R). 107 Another operation studied consists of a select, computation of some value based on the tuples selected and then retrieval of other tuples based on that value. Because of the need to access data i n the conventional system through inverted l i s t s RAP has the advantage i n this operation as well. In fact, the advantage i s particularly significant when the number of tuples involved in the computation is large. As associative processor such as RAP has a significant performance advantage over conventional hardware i n supporting data base management. The characteristics of associative processors that lead to such an advantage are: 1. an a b i l i t y to search data fields i n parallel. 2. search logic imbedded i n the storage device which allows data comparison to take place without passing data back to the central processor. 3. retrieval based on content which means that complicated pointer schemes and index data bases and their maintenance are not required. k. reduced interface problems because the physical storage of data more closely resembles the users view of data. 10&V> 8.0 CONCLUSIONS Data bases are becoming increasing important to today's business environment. As companies become aware that data i s a resource to be developed, shared and ef f i c i e n t l y managed, data bases are increasing i n size and access to them i s more varied. It i s essential to find efficient ways of accessing data both i n terms of people time (eg. how long does i t take a person to get access to required information) and i n terms of machine time. This paper has described a hierarchical data base management system, IMS and a relational data base management system, System R. The focus has been on showing that IMS, because of i t s structured approach to modelling data, i s more efficient than System R as far as machine time i s concerned. System R, on the other hand, with i t s high level data management language SQL is more efficient in terms of people time. In discussing associative processors i t i s shown that implementing a relational data base management system using - associative processors allows both efficient machine time and people time. 109 REFERENCES 1. McGee, W.C., "Generalization: Key to Successful - ..Electronic Data Processing", JACM 6, January 1959? pp. 1 - 23 2 . Senko, M.E. et a l , "Data Structures and Accessing Database Systems", IBM Systems Journal, Vol. 1 2 , No. 1, ,1,973, page 2 1 7 3. "Feature Analysis of Generalized Data Base Management Systems", CODASYL Systems Committee, May 1971 ; . , , 4. Date, C.J., "An Introduction to Database Systems", Addison-Wesley Publishing Co., 1975, page 21 5. Ibid, page 180 6. Ibid, page 183 7. McGee, W.C., "The Information Management System .IMS/VS, Part I and Part II", IBM Systems Journal, Vol. 16, No. 2,.1977? page 109 8. Ibid, page 110 9. "IMS/VS Version I System/Application Design Guide", IBM Form No. SH20 - 9025-4, page 4 - 40 10. Ibid, page 4.51 11 Codd, E.F., "A Relational Model of Data for Large Shared Data Banks", CACM Vol.... 13, No. 6, June 1970, page 377 110 Ibid, page 379 Date, C.J., "An Introduction to Database.-.Systems", .. Addison-Wesley Publishing Co., 1975 Ibid, page 5k Codd, E.F., "Relational Completeness of Data Base . Sublanguages", Data Base Systems, Courant Computer Science Symposia Series, Vol. 6, Prentice-Hall, 1972 Denny, G.H., "Ah Introduction to SQL, A Structured Query Language", IBM Research Laboratory, San Jose, California, RA93, 1977 Date, C.J., "An Introduction to Database Systems", J Addison-Wesley Publishing Co., 1975 Blasgen, M.W. et a l , "System R : A Relational Approach to Data.,Base Management, Part 3 : The Relational Storage System", IBM Technical Report, 1977 Astrahan, M.M. et a l , "System R : Relational Approach to Database Management", ACM TODS, Vol. 1, No. .2, June 1976, page 110 Gotlieb, L.R., "Computing Joins of Relations", ACM SIGMOD Conference, San Jose, California, May 1975 Astrahan, M.M. et a l , "System R : Relational Approach to Database ...Management", ACM TODS, Vol. 1, No. 2, June 1976, page 100, Lochovsky, F.H. and Tsichr i t z i s , D.C, "User Performance . Considerations i n DBMS Selection", ACM SIGMOD International Conference on Management of Data, Toronto 1977 111 Michaels, A.S. et a l , "A Comparison of the Relational and CODASYL Approaches to Data-Base Management", Computing Surveys, Vol. 8 , No. 1, March 1976, page 146 Proceedings of the Third International Conference on Very Large Data.Bases, Tokyo, Japan, 1 9 7 7 ? page 1 9 5 Ibid, page 196 Ibid, page 1 9 7 Ibid, page 1 9 8 Thurber, K.J. and'Wald, L.D.^ "Associative and Parallel Processors"', Computing Surveys, Vol. 7, No. k, December 1 9 7 5 ? page 2 1 5 Flynn, M.J., "Some Computer Organizations and their .Effectiveness", IEEE Trans. Computers C - 2 1 , No. 9 , September 1 9 7 2 Yau, S.S. and Fung, H.S., "Associative Processor Architecture.-- A Survey", Computing Surveys, Vol. 9 , No. 1 , March 1 9 7 7 ? page 3 Copeland G.P. et a l , "The Architecture of CASSM : A Cellular System for Non-Numeric Processing", Proc. of the First Annual Symposium on Computer Architecture, 1 9 7 3 , page 1 2 1 Ibid, page 123 Lin, C.S. et a l , "The Design of a Rotating Associative Memory for Relational Database Applications", ACM TODS, Vol. 1, No. 1, March 1976, page.54 112 Ozkarahan, E.A. et a l , "A Data Base Processor", Computer Systems Research Group, University of Toronto, Technical Report CSRG - 43, page 8 Ibid, page 10 Ibid, page 14 Schuster, S.A. et a l , "A Virtual Memory System for a Relational Associative Processor", Computer Systems Research Group, University of Toronto Technical Report CSRG - 64, page 6 Ozkarahan, E.A. et a l , "A Data Base Processor", Computer Systems Research Group, University of Toronto, Technical Report CSRG - 43, page 28 Schuster, S.A. et a l , "A Virtual Memory System for a Relational Associative Processor", Computer Systems Research Group, University of Toronto Technical Report CSRG - 64, page 12 Ibid, general reference Ozkarahan, E.A. and Schuster, S.A., "A High-Level Machine Oriented Language for a Data Base Machine", Computer Systems Research Group, University of Toronto, Technical Report CSRG - 65, page 20 113 BIBLIOGRAPHY 1 . Aron, J.D., "Information Systems in Perspective", .Computing Surveys, Vol. 1, No. 4, . December 1 9 6 9 Astrahan, M.M. et a l , "System R: Relational Approach to Database Management"', ACM TODS, Vol. 1, No. 2 , June 1 9 7 6 Bachman, C.W., "The Evolution of Storage Structures", CACM, Vol. 1 5 , -No. 1, July 1972 k. Bachman, C.W., "The Programmer as Navigator", CACM, Vol. 16, No. 11, November 1 9 7 3 5 . Blasgen, M.W. et a l , "System R : A Relational Approach to Database Management, Part 3 : The Relational Storage System", IBM Research Laboratory, San Jose, California 6. Blasgen, M.W. and Eswaran, K.P., "Storage and Access in Relational.Data Bases", IBM Systems Journal, Vol. 16, No. k,_ 1 9 7 7 7 . Chamberlin, D.D. et a l , "SEQUEL 2 : A Unified Approach to data definition, Manipulation and Control", IBM Journal of Research and Development, Vol. 2 0 , No. 6 , , 1 9 7 7 8 . Chamberlin, D.D., "Relational Data Base Management . Systems", Computing Surveys, Vol. 8 , No. 1 , March 1 9 7 6 9 . Codd, E.F. "A Relational Model of Data for Large Shared Data Banks", CACM, Vol. 13, No. 6 , June 1 9 7 0 114 1 0 . Codd, E.F., "Relational Completeness of Data Base Sublanguages"j Data Base Systems, Courant Computer Science Sumposia Series, Vol. 6 , Prentice-Hall, 1972 1 1 . Copeland, G.P. et a l , "The Architecture of CASSM : A Cellular System for Non-numeric Processing", Proceedings of the f i r s t annual Sumposium on Computer Archietecture, 1 9 7 3 1 2 . Date, C.J., "An Introduction to Database Systems", Addison-Wesley Publishing Co., 1975 13. Denny, G.H., "An Introduction to SQL, A Structured Query Language", IBM Research Laboratory, San Jose, California, R A 9 3 , 1977 14. Dodd, G.C., "Elements of Data Management Systems", Computing Surveys, Vol. 1, No. 2 , June 1 9 6 9 1 5 . Flynn, M.J., "Some Computer Organization and their Effectiveness", IEEE Trans, Computers, C - 2 1 , No. 9 , .September 1 9 7 2 16. Fry, J.P. and Sibley, E.H., "Evolution of Data Base Management Systems", Computing Surveys, Vol. 8, No. 1 , March 1 9 7 6 1 7 . Gotlieb, L.R., "Computing Joins of Relations", ACM SIGMOD Conference, San Jose, California, May 1 9 7 5 18. Hollander, G.L., "Architecture for Large Computer Systems", .Proc. AFIPS, SJCC, 1 9 6 7 19. Lin, C.S. et a l , "The design of a Rotating Associative Memory for Relational Database Applications", ACM TODS, Vol. 1 , No. 1 , March 1 9 7 6 115 Lochovsky, F.H. and Tsichritzis, D..C-., "User Performance Considerations in DBMS Selection", ACM SIGMOD International Conference on Management of Data, Toronto, 1977 Lorie, R.A. and Wade, B.'W., "The Compilation of a Very High Level Data ..Language", IBM Research Laboratory, San Jose, California, R J 2 0 0 8 , 1977 Martin, J.A., "Computer Data Base Organization", Prentice- Hall Inc., New Jersey, 1975 McGee, W.C., "Generalization : Key to Successful Electronic Data Processing", JACM, Vol. 6 ; January 1959 - McGee, W.C. "The Information Management System IMS/VS". IBM Systems Journal, Vol. 16, No. 2 , 1.977 Michaels, A.S. et a l , "A Comparison of Relational and CODASYL Approaches to Data Base Management", Computing Surveys, Vol, 8 , No. 1 , March 1 9 7 6 Minker, J., "An Overview of Associative or Context- Addressable Memory Systems and a KWIC Index to the Literature : 1 9 5 6 - 1 9 7 0 " , Computing Reviews, October 1 9 7 1 Mowshowitz, A., "The Conquest of Will : Information Processing i n Human Affairs", Addison-Wesley Publishing Co., 1 9 7 6 Ozkarahan, E.A. and Sereik, K.C., "Analysis of Archetectural Features for Enhancing the Performance of a Database Machine", ACM TODS, Vol. 2 , No. k, Decefber 1977 Ozkarahan, E.A. et a l , "A Data Base Processor", University of Toronto, Technical Report, GSRG-43 116 Ozkarahan, E.A. and Schuster, S.A., "A High Level Machine - Oriented Assembler Language for ..a Data Base Machine", University of Toronto, Technical Report,. CSRG-74 Ozkarahan, E.A. et a l , "Performance Evaluation of a Relational.Associative Processor", University of Toronto, Technical Report, CSRG-65 Rosen, S., "Electronic Computers : A Historical Survey", Computing Surveys, Vol. 1, No. 1, March 1969 Schuster, S.A. et a l , "A Virtual Memory System for a Relational Associative Processor", University of Toronto, Technical Report, CSRG-64 Senko, M.E. et a l , "Data Structures and Accessing in Database Systems", IBM Systems Journal, Vol. 12, No. 1, 1973 Senko, M.E., "Data Structures and Data Accessing in Database Systems, Past, Present, Future", IBM Systems Journal, Vol. 16, No. 3,. 1977 Stonebraker, M. and Held, G., "Networks, Hierarchies and Relations i n Data Base Management Systems", Proceedings ACM Pacific Conference, San Francisco Ap r i l 1975 Thurber, K.J. and Wald, L.D., "Associative and Parallel . .. Processors",..Computing Surveys, Vol. 7> Wo. k, December 1975 Yau, S.S. and Fung, H.S., "Associative Processor Architecture A Survey", Computing Surveys, Vol. 9> Wo. 1, March 1977 "Feature Analysis of Generalized Data Base Management Systems", CODASYL Systems Committee, May 1971 117 40. "IMS/VS Version 1, System/Application Design Guides, IBM Form Wo. SH20-9025-4. 41. Third International Conference on Very Large Data Bases, Tokyo, Japan, 1977


Citation Scheme:


Usage Statistics

Country Views Downloads
United States 1 0
City Views Downloads
Unknown 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}


Share to:


Related Items