UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A lexical scanner generator for a modular compiler generation system Venema, Tjeerd 1975

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1976_A6_7 V45.pdf [ 4.06MB ]
Metadata
JSON: 831-1.0051776.json
JSON-LD: 831-1.0051776-ld.json
RDF/XML (Pretty): 831-1.0051776-rdf.xml
RDF/JSON: 831-1.0051776-rdf.json
Turtle: 831-1.0051776-turtle.txt
N-Triples: 831-1.0051776-rdf-ntriples.txt
Original Record: 831-1.0051776-source.json
Full Text
831-1.0051776-fulltext.txt
Citation
831-1.0051776.ris

Full Text

A LEXICAL SCANNER GENERATOR FOR A MODULAR COMPILER GENERATION SYSTEM by TJEERD VENEMA B . S c , University of B r i t i s h Columbia, 1974. A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE i n the Department of Computer Science Be accept t h i s thesis as conforming to the required standard The University of B r i t i s h Columbia DECEMBER, 1975 (c) Tjeerd Venema, 1975 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f t h e r e q u i r e m e n t s f o r an a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e a n d s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d by t h e H e a d o f my D e p a r t m e n t o r by h i s r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . ' D e p a r t m e n t o f C.or>n'f> ^ C / f i c e The U n i v e r s i t y o f B r i t i s h C o l u m b i a 2075 W e s b r o o k P l a c e V a n c o u v e r , C a n a d a V6T 1W5 D a t e - A> / 7 5 " i ABSTRACT Huch work has been done i n the many asp e c t s of c o m p i l e r g e n e r a t i o n . He examine the problems a s s o c i a t e d with the ge n e r a t i o n of a f u l l compiler and present a method of modular c o n s t r u c t i o n which would s o l v e many of the problems which occur i n p r e v i o u s g e n e r a t i o n systems. As an example of t h i s modular c o n s t r u c t i o n , a l e x i c a l scanner generator i s designed t o produce l e x i c a l scanners which are e a s i l y i n t e r f a c e a b l e with the other components of a co m p i l e r . i i Chapter 1 1 1.1 The World of Compiler Generators ...................... 1 1.2 The Block Compiler Generator 2 1.3 The Buiding Block Compiler ............................ 8 Chapter 2 ................................................... 13 2.1 Design of a Lexeme E x t r a c t i o n Program 13 2.2 The L e x i c a l Machine 15 2.3 The Design of a L e x i c a l Scanner 26 Chapter 3 36 3.1 Designing the L e x i c a l Scanner Input ..................36 3.2 The Input Language 39 3.3 Implemntation of the L e x i c a l Scanner Generator 48 3.3.1 The D e s c r i p t i o n Language 49 3.3.2 The I n i t i a l Machine (NDFSMHET) 53 3.2.2 The Empty T r a n s i t i o n Graph ...................... 56 3.2.3 The T r a n s i t i v e C l o s u r e of the ETG ...............,56 3.2.4 Computation of the NDFSM 57 3.2.5 Computation of the DFSM ......................... 58 3.2.6 Checking the c o r r e c t n e s s of the DFSM 66 3.2.7 Punching the Machine ............................ 69 Chapter 4 70 4.1 C o n c l u s i o n 70 B i b l i o g r a p h y ................................................ 72 Appendix A - An Input D e s c r i p t i o n f o r ALGOL-W ............... 73 Appendix B - Using the L e x i c a l Scanner Generator ............ 76 i i i ACKNOWLEDGMENTS I would l i k e t o thank a l l those who, over a cup of c o f f e e , sere w i l l i n g to argue with me about many l i t t l e t h i n g s , some of which grew i n t o b i g g e r t h i n g s and found t h e i r way i n t o t h i s t h e s i s . I would e s p e c i a l l y l i k e to thank my s u p e r v i s o r s , Harvey Abramson and John Peck, without whose h e l p and guidance t h i s t h e s i s would not have been p o s s i b l e . l e x i c a l Scanner Generator 1 Chapter J 1.1 The World of Compiler Generators With the i n c r e a s e i n the use of computers f o r both b u s i n e s s and r e s e a r c h a p p l i c a t i o n s , the problems i n v o l v e d i n the e f f i c i e n t c o n s t r u c t i o n of a compiler f o r a given source computer language has become of i n t e r e s t to many areas of the computer i n d u s t r y . The d e d i c a t e d mini-computer system i s an example of t h i s i n that i n t e r a c t i o n with a d e d i c a t e d system i s b e s t accomplished v i a a language which has been designed to provide a u s e f u l i n t e r f a c e between the user and the system. T h i s " u s e f u l i n t e r f a c e " becomes i n c r e a s i n g l y e s s e n t i a l as the degree of the user's understanding of the computer system decreases, something which i s becoming more of a s i g n i f i c a n t f a c t o r i n today's world. For t h i s reason, s p e c i a l purpose languages f o r such s p e c i f i c t a s k s as data base management are beginning t o appear with i n c r e a s i n g frequency. In order to handle the i n c r e a s i n g demand f o r new compilers and t r a n s l a t o r s , the compiler w r i t e r r e q u i r e s i n c r e a s i n g l y s o p h i s t i c a t e d t o o l s t o a s s i s t i n the task of generating a compiler or t r a n s l a t o r f o r a new or modified computer language, The i d e a l t o o l would be a method f o r the automatic production of a c o m p i l e r from a design s p e c i f i c a t i o n of the language f o r which the compiler i s d e s i r e d . A computer program which r e a l i z e s t h i s i s the cojnjailer a e n e i a t o r , a program which takes as i n p u t a language s p e c i f i c a t i o n and produces as output a c o m p i l e r f o r the language. L e x i c a l Scanner Generator 2 1.2 The Block Compiler Generator The e a r l i e s t approach to compiler g e n e r a t i o n programs was t o w r i t e the compiler g e n e r a t i o n system as a s i n g l e program or " b l o c k " . In t h i s type of system, the language d e s i g n e r s p e c i f i e s c e r t a i n p r o p e r t i e s of the language f o r which the compiler i s d e s i r e d ; p r o p e r t i e s which the compiler g e n e r a t i o n system then uses to produce a compiler. The word " b l o c k " comes from the f a c t t h a t the only user i n t e r a c t i o n with the system i s through the s p e c i f i c a t i o n of the i n i t i a l i n p u t ; beyond t h i s , the user has no c o n t r o l over the g e n e r a t i o n process. Since a compiler i s i t s e l f a complex program, the i n p u t s p e c i f i c a t i o n s must be very long i f they are going t o d e s c r i b e the c o m p i l e r completely. The problem with t h i s i s t h a t long i n p u t s p e c i f i c a t i o n s are both an i n v i t a t i o n f o r user e r r o r s and r e s u l t i n l a r g e and unmodifiable compiler g e n e r a t o r s . A l l " b l o c k " compiler generators developed to date a v o i d t h i s problem by t a k i n g the q u e s t i o n a b l e approach of only a l l o w i n g c e r t a i n a spects of the f i n a l compiler t o be s p e c i f i e d . The Compiler-compiler developed by Jerome A. Feldman* l> i n 1966 i s an example of t h i s type of compiler g e n e r a t i o n system. T h i s system i s best viewed v i a the diagram i n F i g u r e 1. The Compiler-compiler f i r s t takes the formal syntax of a source language L, expressed i n a s y n t a c t i c meta-language (such as BHF) and feeds i t i n t o the syntax l o a d e r . T h i s program b u i l d s the t a b l e s T1 which w i l l c o n t r o l the p a r s i n g of programs i n the source language Then the semantics o f L, w r i t t e n i n a semantic metalanguage, i s fed i n t o the semantic l o a d e r . T h i s l e x i c a l Scanner Generator 3 T T > SYNTAX LOADER I H I I T | i in i I ! I r—' I I BASIC I I > SEMAKTIC LOADEfi I h-1 HT| I 12) I I I I I—1 i I COMPILES Fi g u r e 1 program b u i l d s another t a b l e T2, t h i s one c o n t a i n i n g a d e s c r i p t i o n of the meaning of statements i n L. When • p r o c e s s i n g i s completed, e v e r y t h i n g to the l e f t of the double l i n e i s d i s c a r d e d , l e a v i n g a b a s i c compiler f o r L. At f i r s t g l a n c e , t h i s system appears t o s o l v e the problem of compiler g e n e r a t i o n g u i t e n i c e l y . On c l o s e r o b s e r v a t i o n , however, a few problems begin to show. The Compiler-compiler does not allow the user t o supply a l e x i c a l s p e c i f i c a t i o n of L (a d e s c r i p t i o n of what sequences of i n p u t c h a r a c t e r s a r e to be r e c o g n i z e d as a s i n g l e u n i t ) , i t imposes upon the user the p r e - d e f i n e d l e x i c a l s p e c i f i c a t i o n the Basic Compiler i s capable of h a n d l i n g ; one which has been i n c l u d e d with the B a s i c Compiler by the implementor of the Compiler-compiler. A l s o , i f the user d i s c o v e r s that a newly developed type of parser i s more s u i t a b l e t o h i s needs, there i s no method whereby the parser produced by the Compiler-compiler can be r e p l a c e d by the n e s l y developed L e x i c a l Scanner Generator 4 p a r s e r . A t h i r d problem i s that the Basic Compiler i s not extendable, a B a s i c Compiler f o r a language c o n s i s t s of a p a r s i n g s e c t i o n and a semantic i n t e r p r e t a t i o n s e c t i o n ; but not a l l languages can be handled by a two phase compiler of t h i s s o r t . As an example of t h i s , c o n s i d e r the language ALG0L-68 < 2>. The ALGOL-68 language d e f i n i t i o n s p e c i f i e s t h a t the open (*(*) and c l o s e (')') symbols can be used i n many ways. They can be used to open and c l o s e c l o s e d - c l a u s e s ( r e p l a c e the symbols * BEGIN' and •END*), to open and c l o s e a c o n d i t i o n a l - c l a u s e (replace the symbols •IF* and •FI ,)» or t o open and c l o s e a d i s p l a y ; these being j u s t some of the ways the open and c l o s e symbols can be used. The open and c l o s e symbols can be used i n so many ways t h a t i n some cases i t i s impossible to determine the meaning of an open symbol u n t i l the corresponding c l o s e symbol has been encountered. The problem i s t h a t a c l o s e symbol can be found a r b i t r a r i l y f a r away from i t s corresponding open symbol. Thus, i n order to determine the semantics of an open symbol, i t may be necessary to look a r b i t r a r i l y f a r ahead i n the input stream, something which no c o n v e n t i o n a l p a r s i n g system can do. In order to s o l v e t h i s probem, i t i s necessary e i t h e r t o have a s o p h i s t i c a t e d , non-conventional parser or t o i n s e r t a s p e c i a l phase i n t o the compiler between i n p u t and p a r s i n g which r e s o l v e s any ambiguity surrounding the open and c l o s e symbols. C l e a r l y , with the Compiler-compiler n e i t h e r a l t e r n a t i v e i s v i a b l e . The f i r s t a l t e r n a t i v e i s not v i a b l e i n t h a t the B a s i c Compiler uses l e x i c a l Scanner Generator 5 only one type of p a r s e r , the one the system s u p p l i e s ; a p a r s e r which i n order to achieve g e n e r a l i t y must have a c o n v e n t i o n a l s t r u c t u r e . The second a l t e r n a t i v e i s a l s o not v i a b l e as the B a s i c Compiler i s a s i n g l e u n i t ; t h e r e i s no allowance made i n 6 the Compiler-compiler f o r the user to s p e c i f y t h a t a c t i o n s other than p a r s i n g or semantic i n t e r p r e t a t i o n are to occur as an i n t e r m e d i a t e phase. The Compiler-compiler i s capable of compiler g e n e r a t i o n , but o n l y i f the language designer i s w i l l i n g t o accept the r e s t r i c t i o n s the Compiler-compiler p l a c e s on him as to the type and s t r u c t u r e of the f i n a l c o mpiler. Another much more recent system i s the J o s s l e 4 3 > system developed at the U n i v e r s i t y of C a l i f o r n i a a t Santa Barbara i n 1973-74 which i s diagrammed i n F i g u r e 2. The syntax of the source language L, expressed i n BNF, i s fed i n t o the grammar a n a l y s i s program which produces t a b l e s to be used by an extended precedence p a r s e r . The semantics of expressed i n the semantics language J o s s l e , i s f e d i n t o the J o s s l e t r a n s l a t o r which produces semantic r o u t i n e s based on the r u l e r e d u c t i o n mechanism of the p a r s e r . A d e s c r i p t i o n of the l e x i c a l s t r u c t u r e of the tokens i s a l s o f e d i n t o the system and a l e x i c a l scanner i s generated as part of the p a r s e r . These two modules are then combined with two other modules, one f o r e r r o r h a n d l i n g and the other f o r code g e n e r a t i o n to form a compiler f o r the source language. The J o s s l e system, being more r e c e n t , has been i n f l u e n c e d by the theory of s t r u c t u r e d programming and as such the s i n g l e block approach has been d i s c a r d e d i n favour of a m u l t i - b l o c k L e x i c a l Scanner Generator 6 I I J SYNTAX OFJ-| L IN BNF | I I >l I GRAMMAR |-ANALYSIS| I L I PARSER H FOB L | I i I USEfi J I I I | ERROR J-| ROUTINES! i 1 i I | CODE f-|GENERATOR| I I J 1 | SEMANTICSH |IN JOSSLEI I I i i ->J JOSSLE J-|TRANSLATOR I I I I •>| SEMANTIC |-J ROUTINES| ! I i i I I i I LINKER | I ! I I I SOURCE IN L I I I -> J TRANSLATOJ-| FOR L | I I OBJECT PROGRAM Fig u r e 2 L e x i c a l Scanner Generator 7 system. The J o s s l e system r e c o g n i z e s the f a c t t h at a compiler i s not a s i n g l e u n i t and thus an e a s i e r say to generate a compiler i s t o handle each module s e p a r a t e l y and combine them as a l a s t s t e p . Should the s u p p l i e d e r r o r r o u t i n e s prove l a c k i n g f o r some language, i t i s p o s s i b l e , although r a t h e r d i f f i c u l t , t o modify t h a t part of the compiler generator which generates the e r r o r module without d i s t u r b i n g the i n t e g r i t y of the e n t i r e system. T h i s i s s t i l l a r a t h e r i n f l e x i b l e approach i n t h a t i t i s necessary t o modify the compiler generator p h y s i c a l l y i f a change i s d e s i r e d . In the J o s s l e system, there would be nothing automatic about the generation of the new e r r o r r o u t i n e s ; i t would be the replacement of one s e t of hand coded r o u t i n e s with another. The same o b s e r v a t i o n i s t r u e about the code g e n e r a t i o n r o u t i n e s . although the J o s s l e system has broken the compile r g e n e r a t i o n system down i n t o f o u r modules, the user only has c o n t r o l over two of them; the parser and the semantic r o u t i n e s . Even t h i s c o n t r o l i s l i m i t e d i n t h a t when s u p p l y i n g the l e x i c a l s p e c i f i c a t i o n f o r the l e x i c a l scanner which i s generated as pa r t of the parser module, i t i s only p o s s i b l e to d e f i n e the appearance of items which the system has determined ahead of time i t w i l l r e c o g n i z e . I t i s p o s s i b l e t o d e f i n e the appearance of a comment, a s t r i n g , an i d e n t i f i e r , a r e a l number, or an i n t e g e r but i f the source language c o n t a i n s two types of comments, more than one kind of s t r i n g ( i . e . one enc l o s e d i n apostrophes, the other i n quotes) or allows i d e n t i f i e r s with embedded blanks, then i t i s not p o s s i b l e to use the J e s s i e L e x i c a l Scanner Generator 8 system to generate a compiler f o r t h i s language. The J o s s l e system has advanced from p r e v i o u s systems i n t h a t i t r e c o g n i z e s the f a c t t h a t c o m p i l e r s c o n s i s t of s e p a r a t e modules, but i t f a i l s to r e c o g n i z e t h a t how many modules or which modules should form the generated compiler i s a d e c i s i o n which only the user can make. The modularity of a compiler should g i v e a g r e a t e r f l e x i b i l i t y t o the language d e s i g n e r , but he i s only given a very r e s t r i c t e d form of c o n t r o l over the s p e c i f i c a t i o n s of the compiler which i s produced. 1.3 The B u i l d i n g Block Compiler What the user r e a l l y wants from a compiler g e n e r a t i o n system i s t o t a l c o n t r o l . I f a compiler g e n e r a t i o n system which c o n s i s t s of one l a r g e mass of programs t a k i n g two, t h r e e , or f o u r i n p u t s chews v i c i o u s l y f o r a while and s p i t s out a compiler about which the user knows very l i t t l e due to changes and r e s t r i c t i o n s which have been p l a c e d on the compiler as the g e n e r a t i o n procedure progresses, then what i s achieved i s c e r t a i n l y not the compiler which was d e s i r e d . In c o n s t r u c t i n g a " b l o c k " system, the designer must make many d e c i s i o n s as he c o n s t r u c t s the system, d e c i s i o n s which although c o r r e c t f o r a l a r g e c l a s s of source languages can h a r d l y be expected to be a c c e p t a b l e f o r a l l languages. D e c i s i o n s such as the type of p a r s e r , type of e r r o r c o r r e c t i o n or method of semantic a n a l y s i s are b e s t l e f t up t o the user of the compiler generator. In order to allow the user to make these d e c i s i o n s , i t i s necessary to break the l a r g e block of the t r a d i t i o n a l c o m p i l e r l e x i c a l Scanner Generator 9 g e n e r a t i o n system down i n t o many compiler component genera t i o n systems. The components produced by each of these component generators should conform to the components which are a r r i v e d a t by breaking down th e design o f a compiler i n t o l o g i c a l components. For i n s t a n c e , a simple system which c o n v e r t s the stream i n p u t of a source language with r e s e r v e d words i n t o a t r e e s t r u c t u r e d r e p r e s e n t a t i o n could be d e s c r i b e d as i n F i g u r e 3. I LEXEME | SCANNER I 1 i j BES10BD | | LOOKUP | i i I f DRIVER | I — i | I I 1 u i ] PARSER | 1 ! I | TREE | 1 BUILDER | i i I I I ERROR } | ROUTINES| i i F i g u r e 3 In t h i s system, the d r i v e r program invokes the lexeme scanner which r e t u r n s the next lexeme i n the i n p u t stream. The d r i v e r may or may not choose - t o pass t h i s lexeme on to the r e s e r v e d word lookup f o r attempted i d e n t i f i c a t i o n as a r e s e r v e d word i n the source language. T h i s s t e p completed, the lexeme has been converted i n t o a token, a t e r m i n a l symbol f o r the L e x i c a l Scanner Generator 10 grammar on which the parser i s based and as such i s passed on t o the p a r s e r f o r s y n t a c t i c a n a l y s i s . The p a r s e r c o u l d r e p o r t t h a t the token i s i n e r r o r , i n which case t h e d r i v e r would invoke the e r r o r h a n d l i n g r o u t i n e s to attempt some form of e r r o r c o r r e c t i o n . On the other hand, the p a r s e r might r e t u r n an i n d i c a t i o n t h a t a r e d u c t i o n a c c o r d i n g t o some r u l e of the source language grammar has o c c u r r e d , i n which case the d r i v e r would pass t h i s i n f o r m a t i o n on to the t r e e b u i l d i n g r o u t i n e which would use t h i s i n f o r m a t i o n i n the c o n s t r u c t i o n of the t r e e r e p r e s e n t a t i o n of the i n p u t . With t h i s b a s i c s t r u c t u r e , i f the source language does not have reserved words, then the r e s e r v e d word i d e n t i f i c a t i o n module can be removed with minimal m o d i f i c a t i o n t o the d r i v e r program and no m o d i f i c a t i o n to any other module. Returning t o the language &LGQL-68, both of the p r e v i o u s l y proposed s o l u t i o n s t o the open and c l o s e symbol ambiguity c o u l d be a p p l i e d : e i t h e r the parser c o u l d be r e p l a c e d with a more s o p h i s t i c a t e d v e r s i o n or an e x t r a component c o u l d be i n s e r t e d between reserved word i d e n t i f i c a t i o n and the p a r s e r g i v i n g a system of the form s p e c i f i e d i n F i g u r e 4. I t i s obvious t h a t a " b l o c k " compiler g e n e r a t i o n system i s unable to achieve t h i s f l e x i b i l i t y ; a s o l u t i o n to t h i s i n f l e x i b i l i t y i s to have separate component generators f o r each of the l o g i c a l components which make up a compiler and to allow the user to write h i s own d r i v e r program, a simple program which handles the i n t e r a c t i o n between the v a r i o u s generated components. Each of the compiler component generators would L e x i c a l Scanner Generator 11 I I I DRIVES | I I I J LEXEME | I SCANNER j T T T —« I | BESHOBD | I LOOKUP J I I |AMBIGUITY| | SCANNER | j i l l -I i | PASSER | I i ! I I | TREE | | BUILDER I I I 1 ERROR | | ROUTINES J i 1 F i g u r e 4 generate a component which performs a s i n g l e l o g i c a l task and i s e a s i l y i n t e r f a c e a b l e , v i a the d r i v e r , with the r e s t of the system. Much i s known about the theory of one of the above components, the p a r s e r , and many parser g e n e r a t i o n systems e x i s t which, with s l i g h t m o d i f i c a t i o n , would conform to the standards given above. The theory surrounding t h e other components i s not as w e l l developed, both e r r o r a n a l y s i s and the h a n d l i n g of semantics are s t i l l matters o f debate. An example of a compiler component generator which can be used as part of a f u l l compiler g e n e r a t i o n system i s a l e x i c a l scanner generation system. Since some s o r t of l e x i c a l scan i s necessary i n any c o m p i l e r , t h i s component i s r e g u i r e d f o r a l l c o m p i l e r s and as such i t i s important t h a t the component generator be f l e x i b l e enough t o handle the l e x i c a l d e s c r i p t i o n L e x i c a l Scanner Generator 12 of any language. On the other hand, the component generated should not c o n t a i n code s e c t i o n s which are not r e g u i r e d , a problem which i s o f t e n encountered when a gen e r a t i o n system i s made " g e n e r a l " . As an example of the ge n e r a t i o n of a component of a modular compiler generation system, a co n c e n t r a t e d examination w i l l be made of a l e x i c a l scanner generator. L e x i c a l Scanner Generator 13 Chapter 2 2.1 Design of a Lexeme E x t r a c t i o n Program The f i r s t process which must occur i n any compiler i s the i d e n t i f i c a t i o n of symbols from the input stream. T h i s process i s b a s i c a l l y an i n t e r f a c i n g between the e x t e r n a l world and the compiler i t s e l f . Thus a l e x i c a l scanner i s a program which e x t r a c t s from the i n p u t stream a seguence of symbols, combining them i n t o lexemes. I t i s very important t h a t a s e t of symbols which i s to be recognized as a lexeme i s r e c o g n i z e d c o n s i s t e n t l y as t h a t same lexeme. D e f i n i t i o n _1 A lexeme i s a seguence of symbols which are to be c o n s i s t e n t l y r e c o g n i z e d as a s i n g l e u n i t from an input stream I over an alphabet A. A lexeme e x t r a c t i o n program i s a program which breaks an i n p u t stream I over an alphabet A i n t o lexemes. D e f i n i t i o n 3 A seguence of c h a r a c t e r s B over an alphabet A i s a p r e f i x of a seguence of c h a r a c t e r s S i f S=BR where S i s a {possibly empty) sequence of c h a r a c t e r s over A. Lexical Scanner Generator 14 Figure 5 shows some examples of lexemes which could be recognized from some input stream. In the case of the second lexeme 'THISISONE* i t should be noted that the input stream "THIS ISONE :=* could also begin with •THISISONE'. The extraction of a lexeme from the beginning of the input stream may be just a matter of recognizing n consecutive symbols and returning these n symbols, but i t could also require a more complicated approach as i s seen i n the deletion of embedded blanks from within i d e n t i f i e r s . A severe p i t f a l l which occurs i n many lexeme extraction programs i n t h i s regard i s that i f the lexeme extraction program i s simple (for instance, does not allow deletion of blanks within i d e n t i f i e r s ) then there i s a large class of languages for which t h i s type of lexeme extraction program w i l l not s u f f i c e . On the other hand, i f the program i s too complex (for instance, i s responsible f o r the recognition of reserved words) then such a program i s going to contain modules which are not always required. An examination of t h i s shows that i t i s just a minor form of the problem associated with the "block" compiler generator; trying to accomplish too much at one time. Reverting to the l o g i c a l Input Lexeme BEGIN X THIS IS ONE "XY""Z" BEGIN THISISONE XI "Z Figure 5 L e x i c a l Scanner Generator 15 design of a c o m p i l e r , an a n a l y s i s of the two o p e r a t i o n s -r e c o g n i z i n g a lexeme and i d e n t i f y i n g a r e s e r v e d word - y i e l d s the o b s e r v a t i o n t h a t they are l o g i c a l l y d i s t i n c t and are best d e a l t with as separate components of a compiler g e n e r a t i o n system. what i s d e s i r e d then i s a lexeme e x t r a c t i o n program which i s powerful enough to handle the l e x i c a l s p e c i f i c a t i o n s of almost a l l source languages and yet s p e c i f i c enough so t h a t i f the program i s a u t o m a t i c a l l y generated i t i s understandable by the user and does not c o n t a i n f e a t u r e s the user does not want. T h i s can be accomplished by the d e f i n i t i o n of a l e x i c a l machine. 2.2 The L e x i c a l Machine The b a s i c problem i n d e f i n i n g the o p e r a t i o n of a l e x i c a l e x t r a c t i o n program i s a q u e s t i o n of the amount of power which should be given to the program. Lexeme e x t r a c t i o n programs have been w r i t t e n which, on c l o s e examination, turn out to be p a r s e r s whose a c t i o n i s to parse the i n p u t c h a r a c t e r by c h a r a c t e r , thus p u t t i n g the lexeme t o g e t h e r . The problem with t h i s approach i s t h a t i f these p a r s e r s are a u t o m a t i c a l l y generated, they are u s u a l l y designed with lookahead i n mind. a parser generated by such a system may look up to n symbols ahead ( f o r some f i n i t e n ). In order to generate a parser with up to n symbols of lookahead, i t i s necessary to make the parser complex enough to handle such lookahead, the r e s u l t being t h a t the lexeme e x t r a c t i o n i s r a t h e r slow, a l s o , i t i s not d i f f i c u l t to see that i f a compiler uses one parser t o do lexeme e x t r a c t i o n and L e x i c a l Scanner Generator 16 another to do syntax a n a l y s i s , t h e r e i s a c e r t a i n d u p l i c a t i o n of e f f o r t . I f , i n order t o reco g n i z e a lexeme, the lexeme e x t r a c t i o n program needs to look f a r t h e r than one c h a r a c t e r ahead i n the i n p u t stream, then a m o d i f i c a t i o n should be made to the syntax s p e c i f i c a t i o n of the language so t h a t any lookahead of t h i s s o r t i s done by the syntax p a r s e r which, f o r any n o n - t r i v i a l language, i s designed e s p e c i a l l y with t h i s s o r t of lookahead i n mind. The problem with m u l t i - c h a r a c t e r lookahead i n a lexeme e x t r a c t i o n program can be seen by d e f i n i n g a simple (and no n - s e n s i c a l ) language LL as f o l l o w s : <PEOGRAM> :;= IF ( <NUMBEB> .EQ. <N0HBER> ) STOP where the s e t of lexemes i s d e f i n e d as: {IF,{,),.EQ.,STOP,<N0HBEB>} The lexeme <NDfiBEH>, i n c o n t r a s t t o the other lexemes i n the s e t , i s not the seguence o f c h a r a c t e r s •<NDMBEB>* but i s a more complicated lexeme which can be d e s c r i b e d as f o l l o w s : <HDMBEB> ::= <SIGN> <IHTEGER> | <SIGN> <INTEGER> . <INTEMPTY> <EXPONENT> <INTEGER> ::= <DIGIT> | <IHTEGER> <DIGIT> <DIGIT> ::= 0 | 1 | 2 J 3 | 4 | 5 | 6 l 7 | 8 | 9 L e x i c a l Scanner Generator 17 <INTEMPTY> ::= <INTEGER> | empty <EXPONENT> ::= E <SIGN> <INTEGEB> | empty <SIGN> ::= + I - | empty T h i s d e f i n i t i o n of the lexeme <U*IMBER> d e s c r i b e s the FORTRAN r e a l and i n t e g e r c o n s t a n t s . As the lexeme e x t r a c t i o n program scans over an i n p u t stream such as * 2.EQ.7)STOP 1, the f o l l o w i n g seguence of events occurs (the * | *. r e p r e s e n t s the point at which the the lexeme e x t r a c t i o n program i s scanning) : J2.EQ.7)STOP 2 | .EQ.7) STOP At t h i s p o i n t , the lexeme e x t r a c t i o n program i s unable to r e s o l v e i t s next a c t i o n w i t h i n one lookahead symbol ( i n t h i s case, the * . * ) . I f the program h a l t s r e t u r n i n g »2* as the next lexeme, then the parse w i l l c o n t i n u e c o r r e c t l y . On the other hand, i f the program con t i n u e s t o accept c h a r a c t e r s ( t r y i n g to b u i l d a r e a l constant) then the p r o c e s s i n g w i l l c o n t i n u e with: 2|.EQ.7)STOP 2. | EQ.7) STOP 2.E|Q.7)STOP At t h i s p o i n t the lexeme e x t r a c t i o n program would r e a l i z e t h a t too much of the i n p u t stream had been accepted s i n c e the s t r i n g '2.E» i s not a v a l i d lexeme and the next symbol of the L e x i c a l Scanner Generator 18 i n p u t stream ( i n t h i s case 'Q') cannot be accepted s i n c e no lexeme of the language has the s t r i n g •2.EQ1 as a p r e f i x . The program would then have to "unwind" i t s e l f t o d i s c o v e r which lexeme to r e t u r n . Going back to the c o n f i g u r a t i o n : ! 2.|EQ.7)ST0P would g i v e the v a l i d lexeme »2.*, But an e r r o r w i l l c e r t a i n l y occur when an attempt i s made to e x t r a c t the next lexeme s i n c e the s t r i n g *EQ* i s not a p r e f i x of any lexeme i n LL. The c o r r e c t a c t i o n would be to r e t u r n t o the c o n f i g u r a t i o n : 2|.EQ.7)STOP I t would r e q u i r e a complex lexeme e x t r a c t i o n program t o make these d e c i s i o n s c o r r e c t l y . What t h i s means i s that t o handle a language such as LL using the lexeme d e s c r i p t i o n s given r e q u i r e s the l e x i c a l e x t r a c t i o n program t o be a l a r g e , complicated system performing strange a c t i o n s i n order t o e x t r a c t the lexeme. T h i s problem can be removed simply by not a l l o w i n g lexemes such as <NOMBER> as p a r t of the l e x i c a l s p e c i f i c a t i o n o f the source language. In the case of LL, t h i s would mean modifying the syntax s p e c i f i c a t i o n t o : <PBOGBAH> ::= IF ( <N0HBEB> . EQ . <NUMBEB> ) STOP <HUJ9BER> : := <SIGN> <INTEGEB> | L e x i c a l Scanner Generator 19 <SIGN> <INTEGER> . <INTEMPTY> <EXPONENT> <INTEMPTY> :: = <IHTEGEH> | empty <EXPOHENT> ::= E <SIGN> <INTEGER> <SIGS> ::= + | - | empty and the s e t of lexemes t o : {IF,(,),.,EQ,STOP,<INTEGER>,E,+,-} where <INTEGER> i s d e f i n e d as: <INTEGER> ::= <DIGIT> } <INTEGER> <DIGZT> <DIGIT> : : = 0 | 1 | 2 | 3 | 4 | 5 | 6 - | 7 - | 8 ' | 9 The n o t i o n of a r e a l number as a s i n g l e lexeme has disappeared; a r e a l number i s now t r e a t e d as a seguence of lexemes. For example, the s t r i n g f3.14E-1* i s no l o n g e r one lexeme, but a sequence of the lexemes ' 3 1, M 4 1 , *E*, '-*, •1*. Looking once again at the a c t i o n of a lexeme e x t r a c t i o n program on the s t r i n g »2.EQ* the a c t i o n s taken a r e : | 2.EQ.7) STOP 2|.EQ.7)STOP At t h i s p o i n t , t h e r e i s no quest i o n as to the a c t i o n which must be taken. The program r e t u r n s •2* as a lexeme s i n c e t h i s i s the only t h i n g i t can do given the modified s e t of lexemes. The L e x i c a l Scanner Generator 20 lexeme e x t r a c t i o n program f o r t h i s modified v e r s i o n of LL i s o b v i o u s l y s i m p l e r than the program which handles the o r i g i n a l v e r s i o n of LL. I t should be noted t h a t the above a n a l y s i s does not mean t h a t i t i s never p o s s i b l e t o c o n s i d e r a r e a l number as a lexeme, but that the c h o i c e of whether or not t o r e p r e s e n t a r e a l number as one lexeme or s e v e r a l lexemes depends on the other lexemes being r e c o g n i z e d . In the case above, i f *.EQ.' was not a lexeme of LL, then the problem would not have oc c u r r e d . I t i s p o s s i b l e to modify the syntax and l e x i c a l s p e c i f i c a t i o n s of any language i n the above manner to the p o i n t where the lexeme e x t r a c t i o n program r e g u i r e s no lookahead, however t h i s i n t u r n w i l l mean that the syntax pars e r w i l l be doing a c h a r a c t e r by c h a r a c t e r parse of the i n p u t stream, a very i n e f f i c i e n t method f o r doing lexeme e x t r a c t i o n . C e r t a i n a b i l i t i e s l o g i c a l l y belong w i t h i n the lexeme e x t r a c t i o n program. For i n s t a n c e , many languages have wit h i n them f a c i l i t i e s f o r h a n d l i n g the data type s t r i n g , a s t r i n g being d e f i n e d as any seguence of symbols enclosed w i t h i n quotes (»"*). Since i t may be d e s i r a b l e to have the quote w i t h i n the s t r i n g i t s e l f , any quote wit h i n the s t r i n g must be marked to i n d i c a t e t h a t i t i s j u s t another c h a r a c t e r w i t h i n the s t r i n g and not the t e r m i n a t i n g quote. T h i s i s u s u a l l y done by s p e c i f y i n g t h a t a quote w i t h i n a s t r i n g must be represented by two quotes ( " M , , ) S any occurrence of two quotes being i n t e r p r e t e d as the occurrence o f one quote w i t h i n the s t r i n g . For example: L e x i c a l Scanner Generator 21 "XYZ" r e p r e s e n t s the s t r i n g XYZ i i XY««z« represents the s t r i n g XY"Z From the viewpoint of the l o g i c a l f u n c t i o n of a lexeme e x t r a c t i o n program, i t should be able to handle t h i s form of l e x i c a l a n a l y s i s . I t w i l l be shown t h a t such an a n a l y s i s can be done with one symbol lookahead, so that a lexeme e x t r a c t i o n program with the a b i l i t y to do one symbol lookahead w i l l handle completely the l o g i c a l n o t i o n of a lexeme. In order t o d e f i n e f o r m a l l y a system i n which a lexeme e x t r a c t i o n program has p r e c i s e l y the c o r r e c t amount of power, i t i s u s e f u l t o d e f i n e the n o t i o n of a l e x i c a l machine, a machine which i s d r i v e n by a l e x i c a l e x t r a c t i o n program. D e f i n i t i o n 4 A simple l e x i c a l machine & i s an ordered t r i p l e (H,A,B) where: together with the f o l l o w i n g machine i n s t r u c t i o n s (where 'a' r e p r e s e n t s a s i n g l e c h a r a c t e r ) : H - i s a one c h a r a c t e r hold A - i s the lexeme p a r t accumulated thus f a r B - i s the i n p u t stream not yet analyzed 1) ACCEPT - (empty,A,aB) > (empty,Aa,B) 2) IGNCBE - (empty,A,aB) > (empty,A,B) L e x i c a l Scanner Generator 22 3) HOLD - (empty,ft,aB) => (a,A,B) H) ftCCEPTHOLD - (h,A,B) => (empty,Ah,B) 5) IGNOBEHOLD - (h,A,B) => (empty,A,B) 6) RETURN n - (empty,A,B) => (empty,empty,B) and A i s r e t u r n e d together with n 7) ERBOB - machine executes a f a t a l h a l t The simple l e x i c a l machine i s a machine which e x t r a c t s the next lexeme from an i n p u t stream I. The machine has w i t h i n i t a •h o l d ' , a l o c a t i o n which can be used t o hold the l e a d i n g c h a r a c t e r of the i n p u t stream while l o o k i n g at the next c h a r a c t e r . Since the HOLD i n s t r u c t i o n w i l l only execute i f the hold i s empty, i t i s not p o s s i b l e t o hold more than one ch a r a c t e r of the i n p u t stream thus making lookahead of more than one symbol i m p o s s i b l e . A l s o , s i n c e the ACCEPT, and IGNORE i n s t r u c t i o n s w i l l o n l y execute i f the hold i s empty, i t f o l l o w s t h a t i f a c h a r a c t e r has been placed i n the h o l d , the next i n s t r u c t i o n must be one of ACCEPTHOLD or IGNOREHOLD. In t h i s way i t i s impossible to add another c h a r a c t e r to 'A*, the lexeme b u i l t thus f a r , without e i t h e r d i s c a r d i n g the c h a r a c t e r i n the ho l d or adding i t to »A'. In t h i s manner the l e x i c a l machine r e s t r i c t s the a c t i o n s which are p o s s i b l e at any given time. As an example, the sequence of i n s t r u c t i o n s given i n F i g u r e 6 w i l l e x t r a c t the lexeme ,XY ,»Z* from the i n p u t stream •"XY"«Z";'. D i a g r a m a t i c a l l y , the f o l l o w i n g happens with the i n p u t L e x i c a l Scanner Generator 23 IGNORE ACCEPT ACCEPT HOLD IGNOREHOLD ACCEPT ACCEPT HOLD IGNOREHOLD RETURN 1 F i g u r e 6 stream o f •"XY , , HZ";». Step Hold Lexeme Input Stream Number (H) (A) (B) 1 H= A= B=««XY"»'Z"; 2 H= A= B=XY""Z"; 3 H= A=X B=Y n MZ"; 4 H= A=XY B='" ,Z«; 5 H = « A=XY B="Z"; 6 H= A=XYn B=»Z"; 7 H= A=XY'» B=Z"; 8 H= A=XY"Z B="; 9 H=» A=XY"Z B=; 10 H= A= B=; and r e t u r n XY»Z together with 1 The i n s t r u c t i o n s e t of the simple l e x i c a l machine al l o w s sequences of i n s t r u c t i o n s which w i l l r e c o g n i z e many lexemes, however i t i s not a b l e to handle a l l t he lexemes which a r i s e i n v a r i o u s source languages. For example, c o n s i d e r again the language ALGOL-68 which has as pa r t of i t s l e x i c a l s p e c i f i c a t i o n the lexemes: An i n p u t t o a lexeme e x t r a c t i o n program f o r Algol-68 c o u l d be the s t r i n g •+:='. The best which can be done using the simple L e x i c a l Scanner Generator 24 l e x i c a l machine i s the i n s t r u c t i o n : ACCEPT which l e a v e s the machine i n the c o n f i g u r a t i o n : H= A=+ 8=:= Since the c u r r e n t v a l u e of A i s a v a l i d lexeme, the d e c i s i o n as to whether t o r e t u r n or to accept the next c h a r a c t e r of the i n p u t stream (':•) must be made. I f the next i n s t r u c t i o n executed i s a BETURN, then an i n c o r r e c t a n a l y s i s of the i n p u t stream w i l l have been made. On the other hand, i f the t h i r d c h a r a c t e r of the i n p u t stream i s any c h a r a c t e r other than *=» (e.g. the input i s *+=a*), then the c o r r e c t a c t i o n would be to r e t u r n '+» as the lexeme. Since the a c t i o n which the simple l e x i c a l machine must take a f t e r the f i r s t ACCEPT i s dependent on the second c h a r a c t e r of the input stream and the simple l e x i c a l machine i s designed to allow only s i n g l e c h a r a c t e r lookahead, t h i s s i t u a t i o n r e p r e s e n t s an ambiguity f o r the simple l e x i c a l machine. In t h i s case, the problem o c c u r r e d a t the second c h a r a c t e r of the i n p u t stream, but i t can occur a r b i t r a r i l y f a r ahead when s e v e r a l lexemes o v e r l a p as they do here. A s l i g h t m o d i f i c a t i o n to the simple l e x i c a l machine w i l l p r o v i d e a s o l u t i o n t o t h i s problem. L e x i c a l Scanner Generator 25 D e f i n i t i o n 5 a complex l e x i c a l machine CM i s an ordered quadruple (m,H,A,B) where H, A and B are d e f i n e d as i n a simple l e x i c a l machine and m i s a p r e f i x of A. The i n s t r u c t i o n s e t of CM i s tha t of M together w i t h : 7) MABKTOKEN - (m,h,A,B) => (A,h,A,B) 8) BACKUPRETDBN n - (m,h,mA,B) => (empty,empty rempty,AB) and m i s r e t u r n e d together with n A complex l e x i c a l machine i s a simple l e x i c a l machine with the a b i l i t y to mark the lexeme thus f a r b u i l t (the value of A) i n one spot at any given time. At any time the d e c i s i o n can be made to e i t h e r r e t u r n the value of A ( v i a a BETUBN) or to r e t u r n a p r e f i x of A, the remainder of A being pushed back i n t o the inp u t stream. For the two sample i n p u t s given above (*+:=• and *+:a'), the sequence of i n s t r u c t i o n s given i n Figure 7 d i s p l a y s the a c t i o n which w i l l occur. Input »+:=• Input »+:a» ACCEPT MABKTOKEN ACCEPT MABKTOKEN ACCEPT BACKUPBETUBN 2 ACCEPT ACCEPT BETUBN 1 F i g u r e 7 F i g u r e 7 a l s o d i s p l a y s a property of the l e x i c a l machine, a L e x i c a l Scanner Generator 26 l a c k of c o n t r o l , In order to d i s p l a y the a c t i o n of the complex l e x i c a l machine on two i n p u t s , i t i s necessary t o d i s p l a y two s e t s of i n s t r u c t i o n s . Although the complex l e x i c a l machine, i n c o n t r a s t t o the simple l e x i c a l machine, has an i n s t r u c t i o n s e t which r e s o l v e s o v e r l a p p i n g lexemes w i t h i n one symbol lookahead, n e i t h e r machine has the c a p a b i l i t y of brea k i n g the s e q u e n t i a l execution of i n s t r u c t i o n s . T h i s a b i l i t y i s provided by the d e f i n i t i o n of a c o n t r o l l e r . D e f i n i t i p n 6 A simple (complex) c o n t r o l l e r of a simple (complex) l e x i c a l machine i s a machine which feeds a sequence of simple (complex) l e x i c a l machine i n s t r u c t i o n s i n t o a simple (complex) l e x i c a l machine. The c o n t r o l l e r i s r e s p o n s i b l e f o r determining what i n s t r u c t i o n s are to be executed by e i t h e r the simple or complex l e x i c a l machines, In the case of the i n p u t s i n F i g u r e 7, i t would be the r e s p o n s i b i l i t y of the c o n t r o l l e r to i n s u r e t h a t the f o u r t h i n s t r u c t i o n executed by the simple l e x i c a l machine was e i t h e r an ACCEPT i f the next symbol i n the i n p u t stream was •=' or BACKUPRETUBM otherwise. 2.3 The Design o f a L e x i c a l Scanner There are many combinations of l e x i c a l machine i n s t r u c t i o n s which could be put together by a c o n t r o l l e r and executed on a L e x i c a l Scanner Generator 27 l e x i c a l machine, however on l y some of these combinations a c t u a l l y r epresent d e s i r a b l e combinations i n the c r e a t i o n cf a lexeme e x t r a c t i o n program. For i n s t a n c e , the seguence of simple l e x i c a l machine i n s t r u c t i o n s : w i l l , by the d e f i n i t i o n o f the simple l e x i c a l machine, cause a machine e r r o r when e x e c u t i o n of the 'ACCEPT * i n s t r u c t i o n occurs s i n c e the hold i s not empty. In a s i m i l a r manner, i t i s p o s s i b l e to c o n s t r u c t other sequences of i n s t r u c t i o n s which w i l l not execute on e i t h e r the simple or complex l e x i c a l machines. To i n s u r e t h a t t h i s does not happen a l e x i c a l scanner i s d e f i n e d . A l e x i c a l scanner i s a c o n t r o l l e r which i s d e f i n e d i n such a manner t h a t any sequence o f i n s t r u c t i o n s produced by a l e x i c a l scanner f o r the l e x i c a l machine has the pr o p e r t y t h a t i t w i l l execute s u c c e s s f u l l y when run on the l e x i c a l machine. D e f i n i t i o n 7 A simple l e x i c a l scanner L i s a c o n t r o l l e r d e f i n e d by the ordered s e x t u p l e (S,delta,F,E,E',s•) over an alphabet A where: HOLD ACCEPT S i s a s e t of s t a t e s F i s a subset of S, the s e t o f f i n a l s t a t e s E i s the s e t of subsets of simple l e x i c a l machine i n s t r u c t i o n s c o n s i s t i n g o f : {ACCEPT} L e x i c a l Scanner Generator 28 {IGNORE} {ACCEPTHOLD ACCEPT} {ACCEPTHOLD IGNORE} {IGNOREHOLD ACCEPT} {IGNOREHOLD IGNORE} {HOLD} {ERROR} E 1 - i s the s e t of subsets of simple l e x i c a l machine i n s t r u c t i o n s c o n s i s t i n g o f : {RETURN n} {IGNOREHOLD RETURN n} {ACCEPTHOLD RETURN n} d e l t a - i s the s t a t e t r a n s i t i o n f u n c t i o n , d e l t a = d e l t a 1 U d e l t a * where d e l t a 1 and d e l t a 2 are d e f i n e d as the f u n c t i o n a l mappings: d e l t a i : AxS => ExS d e l t a 2 : A'xF => E'xs* A»={a € A 1 f € F => (a,f) £ Domain (delta*)} s' - i s an element of S, the s t a r t s t a t e A simple l e x i c a l scanner i s r e a l l y a modified v e r s i o n of a f i n i t e s t a t e s e q u e n t i a l machine. The simple l e x i c a l scanner moves from one s t a t e to another under the c o n t r o l of the f u n c t i o n d e l t a which determines the new . s t a t e to e n t e r as a f u n c t i o n of the o l d s t a t e and the next i n p u t symbol. In a t r u e f i n i t e s t a t e machine, t h i s completely d e f i n e s the purpose of the s t a t e t r a n s i t i o n f u n c t i o n whereas i n the case of a simple L e x i c a l Scanner Generator 29 l e x i c a l scanner, the s t a t e t r a n s i t i o n f u n c t i o n i n a d d i t i o n to c o n t r o l l i n g the s t a t e causes simple l e x i c a l machine i n s t r u c t i o n s t o be sent to the simple l e x i c a l machine f o r e x e c u t i o n ; i n s t r u c t i o n s which may vary the contents of the i n p u t stream, the lexeme being b u i l t , or the hold of the simple l e x i c a l machine. T h i s i n t e r a c t i o n i s d i s p l a y e d i n F i g u r e 8. Fi g u r e 8 -> LEXEMES The main c o n t r o l of the l e x i c a l scanner i s giv e n by the f u n c t i o n d e l t a 1 which d e f i n e s t r a n s i t i o n s between v a r i o u s s t a t e s of the l e x i c a l scanner. In a d d i t i o n t o t h i s , i f S € F ( i s a f i n a l s t a t e ) , then d e l t a 2 (S,a) might a l s o be d e f i n e d f o r some i n p u t c h a r a c t e r a. I f S 6 F and d e l t a 1 ( S , a ) e x i s t s , then d e l t a 2 (S,a) cannot e x i s t ; but i f d e l t a 1 (S,a) does not e x i s t , then d e l t a 2 (S,a) may- e x i s t . In terms of a l e x i c a l r e c o g n i t i o n system, the f u n c t i o n L e x i c a l Scanner Generator 30 d e l t a 1 i s i n c o n t r o l at any time when the lexeme thus f a r b u i l t (the value of A i n the simple l e x i c a l machine) i s not a v a l i d lexeme. At any time when A i s a v a l i d lexeme (the l e x i c a l scanner being i n s t a t e S with the head of the input stream being the c h a r a c t e r ' a 1 ) , then d e l t a 1 (S,a) might be d e f i n e d i n which case t h i s t r a n s i t i o n must be taken ( i t i s the only one s i n c e d e l t a 2 (S,a) cannot be d e f i n e d i f d e l t a 1 (S,a) i s ) . I f d e l t a 1 (S,a) i s not d e f i n e d , then i t i s p o s s i b l e t h a t the t r a n s i t i o n d e l t a 2 (s,a) i s d e f i n e d i n which case i t i s taken. Once a t r a n s i t i o n i n v o l v i n g d e l t a 2 has been taken, the next s t a t e i s always the s t a r t s t a t e , and a lexeme i s output from the simple l e x i c a l machine s i n c e the e f f e c t of a d e l t a 2 t r a n s i s t i o n on the l e x i c a l machine i s to cause execution of a sequence of simple l e x i c a l machine i n s t r u c t i o n s , the l a s t o f which i s always a * RETURN n». A simple l e x i c a l scanner which would serve as a lexeme e x t r a c t i o n program f o r a s t r i n g constant as d e s c r i b e d p r e v i o u s l y can be d e f i n e d as f o l l o w s : S { s 1 , s 2 , s 3 , s * » , s 5 } F {s5} S« {s3} A {a,b,c , .. . , z,»} and d e l t a i s d e f i n e d as: L e x i c a l Scanner Generator 31 State * _ _ Input f A - ' l s1 | (H0LD,s5) (ACCEPT, s2) s2 j (HOLD,s5) (ACCEPT,s2) s3 | (IGNORE, s1) (ERROR,empty) s4 | (HOLD,s5) (ACCEPT,s2) s5 | (IGNOREHOLD ACCEPT,s5) (IGNOREHOLD RETURN Under c o n t r o l o f t h i s l e x i c a l scanner, the simple l e x i c a l machine w i l l undergo the f o l l o w i n g t r a n s i t i o n s when the in p u t i s H l X y i l M 2 « « ; H A B STATE empty empty "XY""Z"; s3 empty empty XY"»Z"; S1 empty X Y""Z"; s2 empty XY «"Z"; s2 11 XY «Z"; s5 empty XY" Z"; sH empty XY"Z •t • • s2 XY»Z » s5 empty empty • s3 together with 1 In a d d i t i o n t o a simple l e x i c a l scanner as a c o n t r o l l e r f o r a simple l e x i c a l machine, there i s a l s o a corresponding complex l e x i c a l c o n t r o l l e r f o r a complex l e x i c a l machine. L e x i c a l Scanner Generator 32 D e f i n i t i o n 8 A complex l e x i c a l scanner LC i s a c o n t r o l l e r d e f i n e d by the s e x t u p l e (S,delta,F,EC,E«,S») where S, d e l t a , F, E» and S» are d e f i n e d as i n a simple l e x i c a l scanner and EC i s d e f i n e d as the union of E and the f o l l o w i n g set of subsets of the i n s t r u c t i o n s of MC: {BACKUPBETORN n} {ACCEPT HARKTOKEN} {IGNORE MARKTOKEN} A complex l e x i c a l scanner i s very much l i k e a simple l e x i c a l scanner except t h a t the complex l e x i c a l scanner i s a b l e t o i s s u e i n s t r u c t i o n s which mark the lexeme being b u i l t and r e t u r n only the lexeme up to the p o i n t of the mark i f i t so chooses. The simple and complex l e x i c a l scanners are a b l e to r e c o g n i z e almost a l l p o s s i b l e lexeme d e s c r i p t i o n s . An example of a case where a lexeme cannot be r e c o g n i z e d by e i t h e r a simple or complex l e x i c a l scanner i s the FORTRAN H o l l e r i t h format code. T h i s format code i s s p e c i f i e d i n the form: nnHcccc....c where 'nn' r e p r e s e t s an i n t e g e r number and the 'nn' c h a r a c t e r s immediately f o l l o w i n g the ' H* are taken as a l i t e r a l s t r i n g to be p r i n t e d on the output d e v i c e . Such a lexeme i s not r e c o g n i z a b l e by a simple or complex l e x i c a l scanner as i t L e x i c a l Scanner Generator 33 r e q u i r e s the numerical value of Vnn* i n order t o e x t r a c t the next p a r t of the lexeme (or the next lexemes, i f Vnn' * H' and • c c c . c * are c o n s i d e r e d to be s e p a r a t e lexemes) from the i n p u t stream. In order to do t h i s , the lexeme e x t r a c t i o n program must r e c e i v e i n f o r m a t i o n as to the numeric value of * nn» and use t h i s i n f o r m a t i o n to modify i t s e l f so t h a t i t w i l l p i c k up Vnn* c h a r a c t e r s f o l l o w i n g the * H *. The next time a H o l l e r i t h format i s encountered i n the i n p u t , i t w i l l once again be necessary to e x t r a c t a new value of 'nn* and use t h i s new value i n p i c k i n g up the remainder of the format code. I n h e r e n t l y t h i s type of "dynamic" lexeme g i v e s r i s e t o s e v e r a l problems; d e s i g n i n g a s e t of lexemes which r e g u i r e any lexeme e x t r a c t i o n program t o be s e l f - m o d i f y i n g i s u n d e s i r a b l e and such lexemes can produce unusual r e s u l t s f o r the unwary user of such a program. Since the m o d i f i c a t i o n r e q u i r e d to c o r r e c t t h i s d i f f i c u l t y i s t r i v i a l ( i n s t e a d of nnHccc...c use ' ccc»•* c' and the r e s u l t , a seguence of c h a r a c t e r s between apostrophes, i s now l e x i c a l l y simple) i t seems t o be a l o g i c a l r e s t r i c t i o n not t o a l l o w the s p e c i f i c a t i o n of lexemes which invoke such strange a c t i o n s . I t should be noted t h a t the H o l l e r i t h format i s a l s o a source of problems to the users of the FOBTRAN language i n which i t occurs f o r the very f a c t that the *nn' c h a r a c t e r s a f t e r the L e x i c a l Scanner Generator 34 •H* are always taken as a l i t e r a l s t r i n g f o r output and i f the user has miscounted (the i n a b i l i t y t o count above 1 i s a property of those who d e a l with computers) strange r e s u l t s and e r r o r messages o f t e n appear. D e f i n i t i o n 9 A s e t of lexemes i s l e x i c a l l y simple i f i t can be accepted by a simple l e x i c a l scanner. D e f i n i t i o n JO A s e t of lexemes i s l e x i c a l l y , complex i f i t can be recognized by a complex l e x i c a l scanner, but cannot be recognized by a simple l e x i c a l scanner. VenemaVs P o s t u l a t e I f a s e t of lexemes i s n e i t h e r l e x i c a l l y simple nor l e x i c a l l y complex, then i t i s l e x i c a l l y r i d i c u l o u s . The p o s t u l a t e i s r a t h e r i l l d e f i n e d ; but i t does p o r t r a y the meaning which was intended. I f a s e t of lexemes i s c o n s t r u c t e d so as to be n e i t h e r l e x i c a l l y simple nor l e x i c a l l y complex, then as has been shown with the H o l l e r i t h format, such a s e t of lexemes i s l o g i c a l l y q u e s t i o n a b l e and i s c e r t a i n to cause problems t o the user of the language f o r which the lexemes were designed. In t h i s manner, r i d i c u l o u s means a r i d i c u l o u s c h o i ce of the r e p r e s e n t a t i o n o f the lexemes. I f the language designer has w i t h i n h i s or her s e t of lexemes one or more L e x i c a l Scanner Generator 35 lexemes which i s l e x i c a l l y r i d i c u l o u s , then the d e s i g n e r can expect the usage of the language i n q u e s t i o n t o drop a l i t t l e f o r each l e x i c a l l y r i d i c u l o u s lexeme used. L e x i c a l Scanner Generator 36 Chapter 3 3.1 Designing the L e x i c a l Scanner Inp,ut The simple and complex l e x i c a l scanners d e f i n e a s e t of c o n t r o l l e r s which are able t o act as lexeme e x t r a c t i o n programs f o r a source language. As l e x i c a l l y complex languages, with a s l i g h t m o d i f i c a t i o n to the lexeme s e t , can be made l e x i c a l l y s i m p l e ; the simple l e x i c a l scanner i s adeguate as a lexeme e x t r a c t i o n program f o r v i r t u a l l y any source language. For t h i s reason, the implementation of the l e x i c a l scanner g e n e r a t i o n system attempts only to generate simple l e x i c a l scanners f o r simple l e x i c a l machines. A simple l e x i c a l scanner can act as a t a r g e t f o r the l e x i c a l scanner g e n e r a t i o n program. T h i s program takes as i n p u t a d e s c r i p t i o n of the lexemes which are to be e x t r a c t e d and outputs a simple l e x i c a l scanner which w i l l do the e x t r a c t i o n . The a c t i o n of the simple l e x i c a l scanner on the simple l e x i c a l machine can then be si m u l a t e d , e f f e c t i n g a lexeme e x t r a c t i o n program. Although i n many cases the s i m u l a t i o n of a b s t r a c t machines i s a com p l i c a t e d process, the a c t i o n s i n v o l v e d i n both the simple l e x i c a l machine and a simple l e x i c a l scanner are so simple and s u i t e d t o the task at hand -t,hat they can be simulated with no overhead due to the s i m u l a t i o n . Because of t h i s , the generated l e x i c a l scanner i s able t o execute at the same speed (not j u s t the same order of magnitude) as a hand w r i t t e n lexeme e x t r a c t i o n program f o r the same s e t of lexemes. Since a simple l e x i c a l scanner i s c l o s e l y l i n k e d t o a L e x i c a l Scanner Generator 37 f i n i t e s t a t e machine, i t Mould seem d e s i r a b l e to allow the user t c s p e c i f y the i n p u t to the l e x i c a l scanner generator - a d e s c r i p t i o n of a set o f lexemes - i n terms of the f i n i t e s t a t e machine o p e r a t i o n s Kleene c l o s u r e , c a t e n a t i o n and union. T h i s approach was taken i n the development of the RWORD<*> system i n 1960. B i t h t h i s system, the user d e s c r i b e s each lexeme as an express i o n i n terms of the three f i n i t e s t a t e o p e r a t i o n s . For i n s t a n c e , "a d i g i t " would be expressed as: DIGIT = 0 D 1 U 2 0 3 U 4 0 5 0 6 0 7 0 8 0 9 and a non-empty sequence of d i g i t s would be expressed as: SEQDIGIT = DIGIT DIGIT Using t h i s method, the language designer i s able t o b u i l d any lexeme which can be d e s c r i b e d using r e g u l a r e x p r e s s i o n s . U n f o r t u n a t e l y , a lexeme e x t r a c t i o n program based on r e g u l a r e x p r e s s i o n s does not permit any form o f lookahead and so r e c o g n i t i o n of the s t r i n g »«tx¥",,Z"» as the lexeme •XY"Z' i s not p o s s i b l e . The other problem with t h i s approach i s purely one of n o t a t i o n which assumes that the user of the RWORD system i s f a m i l i a r with the n o t a t i o n o f r e g u l a r e x p r e s s i o n s ; which may not be t r u e . Even f o r the user f a m i l i a r with r e g u l a r e x p r e s s i o n s , the r e g u l a r e x p r e s s i o n s r e q u i r e d to b u i l d up c e r t a i n r e g u l a r lexemes are complicated and q u i c k l y become unreadable. The RHORD system, although i t does have the a b i l i t y t o igno r e an L e x i c a l Scanner Generator 38 i n p u t c h a r a c t e r , does not have a mechanism whereby the user can s p e c i f y t h a t a c h a r a c t e r i s to be ignored throughout the e x t r a c t i o n of a p a r t i c u l a r lexeme. For i n s t a n c e , i f a language has a lexeme f o r which i n t e r n a l blanks are d e l e t e d (•THIS IS OHE» becomes •THISISGNE*), then such a lexeme could not be expressed as a r e g u l a r e x p r e s s i o n . The l e x i c a l scanner i s able to handle such s i t u a t i o n s i n tha t i t i s not j u s t a f i n i t e s t a t e machine, but a program which operates under f i n i t e s t a t e c o n t r o l , i n tu r n c o n t r o l l i n g a simple l e x i c a l machine which does the a c t u a l lexeme e x t r a c t i o n . I t i s the simple l e x i c a l machine which g i v e s the e x t r a power needed to recog n i z e c o m p l i c a t e d , but w e l l d e f i n e d lexemes. The use of r e g u l a r e x p r e s s i o n s f o r d e s c r i b i n g the lexeme s e t to the l e x i c a l scanner gener a t i o n program i s not ge n e r a l enough t o make f u l l use of the power of a l e x i c a l machine. The type of d e s c r i p t i o n f o r the input t o the gen e r a t i o n system which i s wanted i s one i n which the user f e e l s c o mfortable, one i n which he i s not r e g u i r e d t o l e a r n e x t r a f a c t s , but i n which he can express h i m s e l f n a t u r a l l y . The guestion of what i n p u t s p e c i f i c a t i o n t o use then becomes one of determining what comes n a t u r a l l y t o the user (or f o r t h a t matter, t o anyone a t a l l ) . Taking as an example the n o t i o n o f' an i d e n t i f e r (a lexeme common t o most languages) , i t could be d e s c r i b e d completely by the f o l l o w i n g BNF syntax d e s c i p t i o n : <IDENTIFIEB> ::= <LETTEB> <TAIL> <TAIL> ;:= <LETTEB> <TAIL> | <DIGIT> <TAIL> 1 empty L e x i c a l Scanner Generator 39 <LETTER> A | B | C | D 1 E | F | G | H | I | J | K | L I U ] K | 0 ] P | Q I B - I S | T | U j V { w I X J ¥ | Z <DIGIT> ::= 1 J 2 | 3 | K | 5 | 6 | 7 | 8 | 9 f 0 T h i s d e s c r i p t i o n d e s c r i b e s completely an i d e n t i f i e r , but the e n t i r e syntax can be expressed i n E n g l i s h as "a l e t t e r f o l l o w e d by any sequence, p o s s i b l y empty, of l e t t e r s or d i g i t s " . Although E n g l i s h i s o f t e n ambiguous, i n t h i s case the E n g l i s h d e s c r i p t i o n p o r t r a y s the exact meaning of what i s wanted and i s more e a s i l y understood. The l e x i c a l scanner g e n e r a t i o n system attempts to take t h i s i n t o account, and uses a d e s c r i p t i o n of the lexemes which c l o s e l y , p a r a l l e l s the E n g l i s h form given above. 3.2 The Input Lancjuacje The input t o the l e x i c a l scanner gener a t i o n system i s a d e s c r i p t i o n of the.ways i n which s e v e r a l p r e - d e f i n e d b a s i c u n i t s are assembled. These b a s i c u n i t s are d e s c r i b e d as f o l l o w s : ONE OF "charseg" The value of 'charseg 1 i s any sequence of c h a r a c t e r s and the next c h a r a c t e r of the i n p u t stream must be a c h a r a c t e r i n 'charseq'. The next c h a r a c t e r of t h e i n p u t stream i s added to the lexeme being b u i l t . Lexical Scanner Generator 40 ANY OF "charseq" The value of 'charseg* i s any seguence of characters. As long as the next character of the input stream i s a character in •charseg* i t i s added to the lexeme being b u i l t . The statement ANY OF includes the p o s s i b i l i t y that there may be none at a l l . NONE OF "charseg" The value of 'charseg* i s any seguence of characters and the next character of the input stream must not be a character i n 'charseg*. The next character of the input stream i s added to the lexeme being b u i l t . NOTANY OF "charseg" The value of 'charseg' i s any seguence of characters. As long as the next character of the input stream i s not a character in 'charseq' i t i s added to the lexeme being b u i l t * The statement NOTANY OF includes the p o s s i b i l i t y that there may be none at a l l . IGNORE "charseg" The value of 'charseg' i s any seguence of characters and the next character of the input stream must be one of 'charseg'. The next character of the input stream i s discarded. "charseq , l l The value of 'charseg' i s any sequence of characters. I f B i s the input stream, then the value of 'charseq' must be a L e x i c a l Scanner Generator 41 p r e f i x o f B. 'charseg* i s added to the lexeme being b u i l t and de l e t e d from the head of the in p u t stream. These s i x b a s i c u n i t s form the b a s i s f o r b u i l d i n g more complex lexeme d e s c r i p t i o n s by c a t e n a t i o n , using the »,* as a c a t e n a t i o n o p e r a t o r . The d e s c r i p t i o n , u s i n g these p r i m i t i v e s , of the i d e n t i f i e r d e f i n e d above i s : ONE OF "ABCDEFGHIJKLMNOPQBSTUVHXYZ", ANY OF "ABCDEFGHIJKLMNOPQRSTUVHXY20123456789" which corresponds very c l o s e l y t o the E n g l i s h d e s c r i p t i o n which was H a l e t t e r f o l l o w e d by any seguence, p o s s i b l y empty, of l e t t e r s or d i g i t s " . A s e r i e s o f b a s i c u n i t s catenated by *,'s i s known as a sequence. There are two miner problems which a r i s e i n the use of the 'charseg* r e p r e s e n t a t i o n i n the above s t r u c t u r e s . The f i r s t i s th a t s i n c e 'charseg* can be any seguence o f c h a r a c t e r s , the d e l i m i t i n g c h a r a c t e r {the quote "") can be one of the c h a r a c t e r s of 'charseg*. The s o l u t i o n used by the l e x i c a l scanner generator i s the standard one f o r s t r i n g s with a s i n g l e d e l i m i t i n g c h a r a c t e r , i f one of the c h a r a c t e r s w i t h i n 'charseg' i s t o be a quote, then d o u b l i n g i t causes i t t o appear once w i t h i n the s t r i n g . The second problem i s what t o do about c h a r a c t e r s which are r e q u i r e d as part of a lexeme but which cannot be used s i n c e the de v i c e being used t o i n p u t the d e s c i p t i o n s does not have these c h a r a c t e r s . T h i s problem i s L e x i c a l Scanner Generator 42 s o l v e d by a l l o w i n g the user to s p e c i f y between apostrophes the decimal value of the c h a r a c t e r to be r e c o g n i z e d . Thus: "AB«1«C" used as a b a s i c u n i t s p e c i f i e s t h a t the i n p u t stream must c o n t a i n an 'A', f o l l o w e d by a 'B*, f o l l o w e d by the c h a r a c t e r which has an i n t e r n a l numeric value of one, f o l l o w e d by a 'C». Because of t h i s a b i l i t y , i f an apostrophe i s d e s i r e d as one of the c h a r a c t e r s i n 'charseg', i t i s necessary to use two apostrophes. Thus: "AB«»C" used as a b a s i c u n i t s p e c i f i e s t h a t the i n p u t stream must c o n s i s t of an VA* f o l l o w e d by a •B' f o l l o w e d by a *'* f o l l o w e d by a « C T h i s a b i l i t y i s a l s o u s e f u l f o r the d e t e c t i o n of c e r t a i n "event" lexemes. An event lexeme i s not a t r u e lexeme, but i s a method whereby a lexeme e x t r a c t i o n program i s a b l e t o r e p o r t the occurrence of events such as the end of a source l i n e or source f i l e . T h i s c a p a b i l i t y i s needed i n languages such as FOBTfiAN or BCPL< 5^ where knowledge of the end of source l i n e i s r e q u i r e d i n order to continue s y n t a c t i c a n a l y s i s . In the l e x i c a l scanner ge n e r a t i o n system, i t i s p o s s i b l e to w r i t e the r o u t i n e r e s p o n s i b l e f o r the r e a d i n g of source l i n e s i n such a manner t h a t the r o u t i n e appends to each l i n e a (machine dependent) L e x i c a l Scanner Generator 43 c h a r a c t e r which cannot (or i s h i g h l y u n l i k e l y to) occur anywhere i n the i n p u t stream (e.g. the c h a r a c t e r which has an i n t e r n a l numeric value of one on an IBM 370). T h i s s i n g l e c h a r a c t e r can then be d e f i n e d as a lexeme, and every time t h i s lexeme i s returned i s an i n d i c a t i o n t h a t the end of a source l i n e has been encountered. O c c a s i o n a l l y i t i s u s e f u l t o be abl e t o give two s p e c i f i c a t i o n s f o r the same lexeme. For i n s t a n c e , a language might have two forms of a comment, one of which i s the word 'COMMENT1 f o l l o w e d by any seguence o f symbols not i n c l u d i n g a semi-colon (;) and terminated by a semi-colon, the other a per-cent ('55') s i g n f o l l o w e d by any seguence of c h a r a c t e r s not i n c l u d i n g a per-cent s i g n and terminated by another per-cent s i g n . The f i r s t form of the comment can be expessed as: "COMMENT", NOTANY OF ";", ";" and the second form can be expressed as: NOTANY OF "%", "%« The user t h i n k s of a comment as being "the f i r s t d e s c r i p t i o n or the second d e s c r i p t i o n " , and the input n o t a t i o n f o r the l e x i c a l scanner generator r e f l e c t s t h i s t h i n k i n g by a l l o w i n g t h i s to be expressed as: "COMMENT", NOTANY OF ";", ";" OR »%", NOTANY OF »%", n % » L e x i c a l Scanner Generator 44 where the use of 'OB' means e i t h e r the f i r s t sequence or the second sequence. Since many users o f such a system f o r gene r a t i n g lexeme e x t r a c t i o n programs are a l s o f a m i l i a r with BNF n o t a t i o n , the word *0B* can be r e p l a c e d by the symbol '|' as i n : "COMMENT", NOTANY OF ";", ";" | "%», NOTANY OF "%", "%" I t i s p o s s i b l e t o add f u r t h e r a l t e r n a t i v e s by adding the word * OB• followed by another sequence as o f t e n as i s needed. A s e r i e s of sequences separated by 'OS's i s known as a s e c t i o n . The b a s i c u n i t s can a l s o be b u i l t up i n t o more complex s t r u c t u r e s by the use of the statement: i d IS s e c t i o n . where ' i d 1 i s an i d e n t i f i e r which i s to be a s s o c i a t e d with the name ' i d ' . Thus: subchar IS NOTANY OF »"«" OB IGNOBE «"»«, •""•«. a s s o c i a t e s with the name 'subchar* p a r t o f a lexeme d e s c r i p t i o n f o r a s t r i n g as d e f i n e d p r e v i o u s l y . Any s t r i n g which i s enc l o s e d w i t h i n quotes can have the c h a r a c t e r sequence between the e n c l o s i n g quotes broken down i n t o p i e c e s , each of which i s e i t h e r a sequence of c h a r a c t e r s without quotes or two quotes the f i r s t of which i s i g n o r e d . Once a name has been a s s o c i a t e d with a s e c t i o n , i t i s L e x i c a l Scanner Generator 45 p o s s i b l e to use t h a t name i n any one of the f o l l o w i n g statements. ONE OF i d T h i s i s e q u i v a l e n t t o • s e c t i o n 1 where • s e c t i o n 1 i s the s e c t i o n which has been a s s o c i a t e d with the name ' i d * . ANY OF i d T h i s i s e q u i v a l e n t to ' s e c t i o n , s e c t i o n ^ ...• sequenced as many times ( i n c l u d i n g zero) as i s r e q u i r e d to continue a c c e p t i n g c h a r a c t e r s from the i n p u t stream where ' s e c t i o n * i s the s e c t i o n a s s o c i a t e d with the name ' i d * . NOTONE OF i d I f • s e c t i o n 1 i s the s e c t i o n a s s o c i a t e d with the name ' i d 1 , then •NOTONE OF i d 1 i s e q u i v a l e n t t o ' n o t - s e c t i o n * where ' n o t - s e c t i o n * i s d e r i v e d from ' s e c t i o n * by redu c i n g ' s e c t i o n * to i t s b a s i c u n i t s and a l t e r i n g each occurrence o f 'charseq* to {A - *charseq*} where A i s the alphabet c o n s i s t i n g of those c h a r a c t e r s with an i n t e r n a l decimal r e p r e s e n t a t i o n n, 0<=n<=255. NOTANY OF i d I f * s e c t i o n ' i s the s e c t i o n a s s o c i a t e d with the name ' i d * , then 'NOTANY OF i d ' i s e q u i v a l e n t t o ' n o t - s e c t i o n , n o t - s e c t i o n , ...,' seguenced as many times ( i n c l u d i n g zero) as i s r e q u i r e d to continue a c c e p t i n g c h a r a c t e r s from the in p u t stream where ' n o t - s e c t i o n * i s d e r i v e d from ' s e c t i o n * by r e d u c i n g ' s e c t i o n ' to L e x i c a l Scanner Generator 46 i t s b a s i c u n i t s and a l t e r i n g each occurrence of *charseq» to {A - •charseg 1} where A i s the alphabet c o n s i s t i n g of those c h a r a c t e r s with an i n t e r n a l decimal r e p r e s e n t a t i o n n, 0<=n<=255. IGNOBE i d I f * s e c t i o n * i s the s e c t i o n a s s o c i a t e d with ' i d * , then •IGNORE i d * i s e q u i v a l e n t to * i g - s e c t i o n * where * i g - s e c t i o n ' i s d e r i v e d from ' s e c t i o n * by r e d u c i n g ' s e c t i o n * to i t s b a s i c u n i t s and causing a l l t r a n s i t i o n s of ' s e c t i o n * to d i s c a r d any c h a r a c t e r s normally accepted. Any occurrence of the above f i v e p r i m i t i v e s i s again a sequence, thus the d e f i n i t i o n system i s r e c u r s i v e . Since many p a r s e r s r e q u i r e t h a t a number be a s s o c i a t e d with each lexeme, i t i s u s e f u l to be able to s p e c i f y t o the l e x i c a l scanner generator t o a s s o c i a t e with each lexeme a number which the l e x i c a l scanner i s to r e t u r n i f that lexeme i s encountered. In t h i s manner, the l e x i c a l scanner can r e t u r n both the lexeme detected as w e l l as the number a s s o c i a t e d with t h a t lexeme. The l e x i c a l scanner generator al l o w s t h i s by p r o v i d i n g a statement of the form: LEXEME number IS s e c t i o n . where ' s e c t i o n * i s as s p e c i f i e d above. The value of *number' cou l d be an i n t e g e r , f o r example: L e x i c a l Scanner Generator 47 LEXEME 7 IS s e c t i o n . or i t c o u l d be an i d e n t i f i e r : LEXEHE t h i s o n e IS s e c t i o n . In which case the i d e n t i f i e r 'thisone* must be a s s o c i a t e d with an i n t e g e r value v i a a statement of the form: t h i s o n e := i n t e g e r . which must occur before the statement »LEXEME number IS s e c t i o n . ' i s encountered ('integer' being an i n t e g e r c o n s t a n t ) . In a d d i t i o n t o any of the above s p e c i f i c a t i o n s , i t i s p o s s i b l e t o i n c l u d e the f o l l o w i n g two p r i m i t i v e s anywhere i n a seguence as a b a s i c u n i t . HULL "charseg" The value of 'charseg' i s any seguence of c h a r a c t e r s which are to be t r e a t e d as n u l l - c h a r a c t e r s (hence ignored i f they occur at the head of the in p u t stream) from t h i s point t i l l the end of the lexeme d e s c r i p t i o n . For example: LEXEME 9 IS NULL " ", ONE OF "A", ANY OF "ABCDEF". w i l l accept any seguence of the c h a r a c t e r s "ABCDEF" which begins with an "A", i g n o r i n g a l l blanks encountered while b u i l d i n g the L e x i c a l Scanner Generator 48 c h a r a c t e r sequence. NOTNOLL "charseq" The value of *charseq* i s any sequence of c h a r a c t e r s . I f any of •charseq' i s being t r e a t e d as a n u l l c h a r a c t e r , from t h i s p o i n t t i l l the end of the lexeme d e s c r i p t i o n i t H i l l again be re c o g n i z e d as a r e g u l a r c h a r a c t e r . A sequence of d e s c i p t i o n s of t h i s form then make up the t o t a l i n p u t t o the l e x i c a l scanner generator. For s y n t a c t i c reasons, the sequence of d e s c r i p t i o n s must be en c l o s e d between the symbols 'BEGIN* and 'END'. Returning t o the problem of r e c o g n i z i n g the lexeme *XY"Z* from the i n p u t «»x¥ , , , ,Z"» as a s t r i n g , the r e p r e s e n t a t i o n of the lexeme • s t r i n g * would be given as: BEGIN SUBCHAR IS NOTANY OF """", IGNORE "»"•", n t i i i t i . STRING := 1. LEXEME STRING IS IGNORE """", ANY OF SUBCHAB, IGNORE ""»«". END 3.3 Implementation of the L e x i c a l Scanner Generator The program ( w r i t t e n i n PL360<*>) which does the l e x i c a l scanner gene r a t i o n t a k e s as input a d e s c r i p t i o n of a s e t of lexemes i n the form d e s c r i b e d above. As output, i t produces e i t h e r a program i n a " d e s c r i p t i o n language" d e s c r i b i n g the o p e r a t i o n of the l e x i c a l scanner or a program i n the language L e x i c a l Scanner Generator 49 PL360 which can be compiled i n t o an o b j e c t program which, when run, e f f e c t s the l e x i c a l scanner. The main purpose of using the d e s c r i p t i o n language i s to enable the user to check the o p e r a t i o n of the f i n a l l e x i c a l scanner t o v e r i f y t h a t the l e x i c a l scanner produced r e f l e c t s the e x p e c t a t i o n s as to what lexemes were s p e c i f i e d . 3.3.1 The D e s c r i p t i o n Language The d e s c r i p t i o n language i s a readable method of e x p r e s s i n g the s t a t e t r a n s i t i o n f u n c t i o n (delta) of the simple l e x i c a l scanner. There are f o u r forms of i n s t r u c t i o n s which a f f e c t c o n t r o l flow w i t h i n the d e s c r i p t i o n language (anything w i t h i n braces i s o p t i o n a l ) . {Sn} IF "charseg" THEN ( i n s t r - s e g - 1 ) {ELSE ( i n s t r - s e g - 2 )} The value of »n* i f VSn' i s present i s an i n t e g e r , i n which case »Sn* i s a s t a t e l a b e l . I f the next c h a r a c t e r of the i n p u t stream i s any one of 'charseg* then the seguence of i n s t r u c t i o n s • i n s t r - s e g - 1 ' i s t o be executed, otherwise the seguence of i n s t r u c t i o n s • i n s t r - s e g - 2 * i s to be executed. The values of • i n s t r - s e q - 1 * and *instr-seg-2» may be any element of the set E as d e f i n e d i n the simple l e x i c a l scanner (with the same meaning) together with the i n s t r u c t i o n 'GO Sn* where 'Sn» i s a s t a t e l a b e l which t r a n s f e r s c o n t r o l to the i n s t r u c t i o n l a b e l l e d 'Sn'. {Sn} IFNOT "charseg" THEN ( i n s t r - s e g - 1 ) {ELSE ( i n s t r - s e g - 2 )} The value of »n' i f 'Sn* i s present i s an i n t e g e r , i n which L e x i c a l Scanner Generator 50 case 'Sn' i s a s t a t e l a b e l . I f the next c h a r a c t e r of the i n p u t stream i s not any one of •charseq* then the sequence of i n s t r u c t i o n s • i n s t r - s e g - 1 • i s t o be executed, otherwise the sequence of i n s t r u c t i o n s * i n s t r - s e g - 2 * i s t o be executed. The values of ' i n s t r - s e g - 1 ' and * i n s t r - s e q - 2 f may be any element of the s e t E as d e f i n e d i n the simple l e x i c a l scanner (with the same meaning) together with the i n s t r u c t i o n *GO Sn' where *Sn* i s a s t a t e l a b e l which t r a n s f e r s c o n t r o l t o the i n s t r u c t i o n l a b e l l e d *Sn». {Sn} WHILE "charseg" DO ( i n s t r - s e q ) The value of *n* i f *Sn* i s present i s an i n t e g e r , i n which case 'Sn* i s a s t a t e l a b e l . As long as the next c h a r a c t e r of the i n p u t stream i s any one of 'charseq* then the sequence of i n s t r u c t i o n s • i n s t r - s e q * i s to be executed. The value of ' i n s t r - s e q * may be any element of the s e t E as d e f i n e d i n the simple l e x i c a l scanner (with the same meaning). {Sn} WHILENOT "charseg" DO ( i n s t r - s e q ) The value of *n* i f 'Sn' i s present i s an i n t e g e r , i n which case 'Sn' i s a s t a t e l a b e l . As long as the next c h a r a c t e r of the i n p u t stream i s not any one of 'charseg' then the sequence of i n s t r u c t i o n s * i n s t r - s e q ' i s t o be executed. The value o f * i n s t r - s e q ' may be any element of the s e t E as d e f i n e d i n the simple l e x i c a l scanner (with the same meaning). In a d d i t i o n to the fou r c o n t r o l i n s t r u c t i o n s as d e f i n e d L e x i c a l Scanner Generator 51 above, the f i f t h and f i n a l i n s t r u c t i o n i n the d e s c r i p t i o n language i s : ( i n s t r - s e q ) The seguence of i n s t r u c t i o n s • i n s t r - s e q * i s to be executed. The value of * i n s t r - s e q * may be any element of the s e t E as d e f i n e d i n the simple l e x i c a l scanner (with the same meaning) . The l e x i c a l scanner generator can a l s o produce a source program f o r the language PL360 to enable the user to get a working v e r s i o n of the l e x i c a l scanner. The PL360 source code generated i s machine dependent i n t h a t the language PL360 i s e s p e c i a l l y designed f o r the IBM 360 and 370 s e r i e s computers and so PL360 c o m p i l e r s do not e x i s t f o r most other computers. I f the r e s u l t i n g l e x i c a l scanner i s meant to run on a computer other than the IBM 360 or 370, then i t i s necessary to write a t r a n s l a t o r f o r the machine independent d e s c r i p t i o n language t o a language f o r the computer on which the l e x i c a l scanner i s to run. The purpose of the l e x i c a l scanner generator i s to t r a n s l a t e from the i n p u t language t o the d e s c r i p t i o n language. The a n a l y s i s done to accomplish t h i s i s s i m i l a r to t h a t done f o r f i n i t e s t a t e machines. For example, the f o l l o w i n g i s a l e g a l i n p u t t o the system. BEGIN LEXEME 1 IS ••:». LEXEME 2 IS ":=". L e x i c a l Scanner Generator 52 END which produces as i n s t r u c t i o n s i n the d e s c r i p t i o n language: 51 IF ":» THEN (ACCEPT GOTO S2) ELSE (EBROE) 52 IF «=" THEN (ACCEPT EETDBN 2) (BETUfiN 1) i • I f each lexeme i s regarded as a sequence of t r a n s i t i o n s , then t h e r e i s the problem of what to do when a * appears at the beginning of the i n p u t stream s i n c e a *:* co u l d be the s t a r t of e i t h e r lexeme 1 or lexeme 2; a non-determinism t h a t must be r e s o l v e d . T h i s i s s i m i l a r to the same s i t u a t i o n a r i s i n g i n the case of f i n i t e - s t a t e machines. Since the l e x i c a l scanner i s s i m i l a r to a f i n i t e s t a t e machine, the program i s broken down i n t o seven phases resembling o p e r a t i o n s on f i n i t e s t a t e machines: the c o n s t r u c t i o n of an i n i t i a l f i n i t e s t a t e machine with empty t r a n s i t i o n s (NDFSH8ET), the c o n s t r u c t i o n of an empty t r a n s i t i o n graph (ETG), the c o n s t r u c t i o n of the t r a n s i t i v e c l o s u r e of the empty t r a n s i t i o n graph (TCETG), the c o n s t r u c t i o n of the n o n - d e t e r m i n i s t i c f i n i t e s t a t e machine (NDFSH), the c o n s t r u c t i o n of the d e t e r m i n i s t i c f i n i t e s t a t e machine (DFSM), the a n a l y s i s t o check f o r c o r r e c t n e s s of the DFSfl, and the punching of the f i n a l machine i n the d e s c r i p t i o n language. Although each of these phases bears the name o f i t s corresponding t r a n s f o r m a t i o n with f i n i t e s t a t e machines, the phases are not completely the same as the L e x i c a l Scanner Generator 53 corresponding t r a n s f o r m a t i o n s s i n c e the e f f e c t s on the simple l e x i c a l machine, which must be c o n s i d e r e d i n each of the seven phases, do not enter i n t o the corresponding f i n i t e s t a t e machine t r a n s f o r m a t i o n s . 3.3.2 The I n i t i a l Machine JNDFSMWET1 The f i r s t phase i n the g e n e r a t i o n of a l e x i c a l scanner i s the t r a n s l a t i o n of the i n p u t to an i n t e r n a l r e p r e s e n t a t i o n . T h i s i s done by c r e a t i n g an i n t e r n a l r e p r e s e n t a t i o n of the f o l l o w i n g form f o r each of the i n p u t p r i m i t i v e s . ONE OF "charseg" * | charseg | I S1 |— >| S2 | I I I i i 1 ANY OF "charseg" charseg i 1 f I I I S I H 1 I L e x i c a l Scanner Generator 54 IGNORE "charseg" ignore , 1 charseg | | >l S2 | i I i j "charseg" (e.g. "abc") i— 1 i 1 i 1 i 1 I | a | I b | | c | | I S1 j >| S2 | >| S3 i >\ S4 \ i I I I ! I I I i 1 i 1 i i i j For the f i r s t two, the negation forms (* NOTONE* f o r 'ONE*, •NOTANY* f o r * ANY*) are c r e a t e d by c r e a t i n g a s t r u c t u r e f o r 'ONE OF {A - charseg }• f o r * NOTONE OF "charseg"' and 'ANY OF [k -"charseg"}* f o r * NOTANY OF "charseg"*; where A i s the alphabet c o n s i s t i n g of the c h a r a c t e r s whose i n t e r n a l r e p r e s e n t a t i o n s have the numerical values 0 to 255. In a d d i t i o n , i f the c u r r e n t s e t of NULLed c h a r a c t e r s (those c h a r a c t e r s being ignored as the r e s u l t of a * NULL "charseg"' having been encountered) i s not empty, than t o each a r c i n any of the above i s added the l a b e l l i n g i g n o r e n u l l c h a r s I I I S1 fr I I i i The " g l u e " which i s used t o b u i l d the lexeme d e s c r i p t i o n s L e x i c a l Scanner Generator 55 from these simple b u i l d i n g b l o c k s i s the empty t r a n s i t i o n . Thus, the lexeme d e s c r i p t i o n ONE OF "ABC", ANY OF "ABC_" i s put tog e t h e r as: ABC S1 |-ABC i I ->l S2 |-I I empty I I I ->| S3 i—• and the lexeme d e s c r i p t i o n : ONE OF "ABC" OB ANY OF "D" i s put together as: ABC i I ->l S3 | i I i i In a d d i t i o n t o t h i s , one s t a t e , the s t a r t s t a t e , i s b u i l t which has an empty t r a n s i t i o n to the f i r s t s t a t e of the i n t e r n a l L e x i c a l Scanner Generator 56 r e p r e s e n t a t i o n of each of the lexeme d e s c r i p t i o n s . With the excep t i o n of the notion of t r a n s i t i o n s being l a b e l l e d 'accept* ( r e t a i n the c h a r a c t e r at the head of the i n p u t stream) or 'ignore' ( d i s c a r d the c h a r a c t e r at the head of the i n p u t stream), the i n t e r n a l s t r u c t u r e b u i l t i s a f i n i t e s t a t e machine. 3.2.2 The Empty T r a n s i t i o n Graph Once the HDFSHWET has been c r e a t e d , i t i s necessary to remove the empty t r a n s i t i o n s b e f o r e attempting any form of al g o r i t h m f o r c o n v e r t i n g from a n o n - d e t e r m i n i s t i c t o a d e t e r m i n i s t i c form of the machine. In order t o achieve t h i s , i t i s necessary t o know, f o r any two s t a t e s s1 and s2 i f i t i s p o s s i b l e t o reach s2 from s1 v i a an empty t r a n s i t i o n path. T h i s i n f o r m a t i o n i s obtained by f i r s t c o n s t r u c t i n g an empty t r a n s i t i o n matrix H, a b i n a r y matrix f o r which H ( i , j ) i s 1 i f and only i f there i s an empty t r a n s i t i o n from s t a t e s i to s j and then f i n d i n g the t r a n s i t i v e c l o s u r e H' of M. The matrix H i s c o n s t r u c t e d by f i r s t s e t t i n g M t o zero and then s e t t i n g fl(i,j) t o 1 i f an empty t r a n s i t i o n e x i s t s between s t a t e s i and s j . 3.2.3 The T r a n s i t i v e C l o s u r e of the Empty T r a n s i t i o n Graph In order to determine not onl y i f s t a t e s1 has an empty t r a n s i t i o n to s t a t e s2, but i f s t a t e s2 can be reached from s1 by some seguence of empty t r a n s i t i o n s , i t i s necessary to compute the t r a n s i t i v e c l o s u r e o f the ETG. The a l g o r i t h m used to do t h i s c a l c u l a t i o n i s a m o d i f i c a t i o n of the Harshall< 7> a l g o r i t h m developed by Warren* 8>. In t h i s a l g o r i t h m , the f a c t L e x i c a l Scanner Generator 57 th a t the system i s implemented on an IBM 370 with both a T r a n s l a t e and Test and an Or Character i n s t r u c t i o n i s taken i n t o account. A l s o , t h i s a l g o r i t h m r e g u i r e s t h a t only one pass be made over the matrix. I f i t i s p o s s i b l e t o t e s t w e n t r i e s of M at a time f o r an a l l - z e r o value and t o " o r " v e n t r i e s at a time (both v and w are 255 i n t h i s implementation), and M i s an nxn matrix, then f o r l a r g e n, the l i m i t i n g cases give n 2 t o n 3 / v op e r a t i o n s f o r H a r s h a l l ' s a l g o r i t h m and n2/w to n 3/v f o r Barren's a l g o r i t h m . The f a c t t h a t M i s reasonably sparse means t h a t i f n i s the number of s t a t e s , the a l g o r i t h m r e g u i r e s approximately 0 ( n 2)/255 o p e r a t i o n s r a t h e r than the 0(n 2) op e r a t i o n s r e q u i r e d by the S a r s h a l l a l g o r i t h m u s i n g the T r a n s l a t e and Test and Or Character i n s t r u c t i o n s , or the 0(n 3) o p e r a t i o n s r e q u i r e d by the standard S a r s h a l l a l g o r i t h m T h i s i s an important f a c t s i n c e the number of s t a t e s c r e a t e d i n the t r a n s l a t i o n of the i n p u t c o u l d be q u i t e l a r g e ; i n i t i a l t i mings f o r an ALG0L-I< 9> l e x i c a l scanner have i n d i c a t e d t h a t the system spends 90 to 95 per-cent of i t s time i n t h i s phase i f the standard H a r s h a l l a l g o r i t h m i s used and o n l y 10 per-cent of i t s time i f the warren a l g o r i t h m i s used. 3.2.4 Computation of the NDFSM With the computation o f the t r a n s i t i v e c l o s u r e of the ETG comes the necessary i n f o r m a t i o n t o form the NDFSM. The method used i s the standard method f o r removing t r a n s i t i o n s from a NDFSMHET which can be o u t l i n e d as f o l l o w s : L e x i c a l Scanner Generator 58 For each s t a t e S i , i f M(i,j)=1 then add a l l the non-empty t r a n s i t i o n s of s t a t e S j to S i . I f S j i s f i n a l , then mark S i as f i n a l . A f t e r t h i s process has been c a r r i e d out on a l l the s t a t e s i n the system, those s t a t e s which can no longer be reached from the s t a r t s t a t e may be d i s c a r d e d . The proof of the c o r r e c t n e s s of t h i s a l g o r i t h m i s g i v e n by Aho, Hopcroft and O i l m a n * 1 0 * . The a l g o r i t h m as i t stands works f o r the NDFSWET even though the NDFSM i s not r e a l l y a f i n i t e s t a t e machine provided the i n f o r m a t i o n to accept o r ignore the i n p u t c h a r a c t e r i s c a r r i e d along with a t r a n s i t i o n when t h a t t r a n s i t i o n i s added t o another s t a t e . 3.2.5 Computation of the DFSM With the assurance t h a t the NDFSM has no empty t r a n s i t i o n s , i t i s p o s s i b l e t o remove n o n - d e t e r m i n i s t i c t r a n s i t i o n s . I t i s again necessary to c o n s i d e r the f a c t t h a t the NDFSM i s not g u i t e a f i n i t e s t a t e machine i n t h a t the t r a n s i t i o n s are l a b e l l e d e i t h e r ' s t r i n g 1 (accept the c h a r a c t e r of the t r a n s i t i o n ) or 'ignore' (ignore the c h a r a c t e r of the t r a n s i t i o n ) . In the removal of n o n - d e t e r m i n i s t i c t r a n s i t i o n s i n a f i n i t e s t a t e machine F, a new machine F* i s c r e a t e d whose s t a t e s are the subsets of the s t a t e s of F. I f F c o n t a i n s a s t a t e S which has a t r a n s i t i o n on a c h a r a c t e r a to each of n s t a t e s s1, s2, Sn then F* has a t r a n s i t i o n from {S} to {s1,s2,...,sn} on a. The e f f e c t of t h i s i s then to " c o l l a p s e " the n t r a n s i t i o n s of F to L e x i c a l Scanner Generator 59 one t r a n s i t i o n i n F*. In the NDFSM, t h i s i s not q u i t e so simple i n that i n a d d i t i o n to the t r a n s i t i o n s of the NDFSM being l a b e l l e d by the c h a r a c t e r of the t r a n s i t i o n , they are a l s o l a b e l l e d e i t h e r • s t r i n g ' or ' i g n o r e * . From the meaning of ' s t r i n g * and "ignore* given above, i t i s easy to see t h a t i f a s e t of n t r a n s i t i o n s from s i to s j on a c h a r a c t e r a are a l l l a b e l l e d ' s t r i n g * then the r e s u l t i n g t r a n s i t i o n i n the new machine should be l a b e l l e d ' s t r i n g * . The same i s true i f a l l the t r a n s i t i o n s are l a b e l l e d ' i gnore', the problem being In the case where some of the t r a n s i t i o n s are l a b e l l e d * s t r i n g * and some are l a b e l l e d ' i g nore*. In order t o handle t h i s , a new type of t r a n s i t i o n i s cr e a t e d , the »limbo' t r a n s i t i o n . I f a t r a n s i t i o n i s l a b e l l e d •limbo', then t h i s t r a n s i t i o n has been c r e a t e d by " c o l l a p s i n g " a number of t r a n s i t i o n s , some of which were o r i g i n a l l y l a b e l l e d ' s t r i n g ' and some of which were o r i g i n a l l y l a b e l l e d ' i gnore*. Aside from t h i s l a b e l l i n g change, the NDFSM can be t r e a t e d as i f i t were a n o n - d e t e r m i n i s t i c f i n i t e s t a t e machine and the standard power-setting algorithms can be used. In order t o determine which subset of the s t a t e s of the NDFSM i s to be the s t a t e of the DFSM t o which a t r a n s i t i o n i s t o go, i t i s necessary t o examine a l l p o s s i b l e combinations of the t r a n s i t i o n s of each p a r t i c u l a r s t a t e of the NDFSM t o determine i f the i n t e r s e c t i o n of the c h a r a c t e r s l a b e l l i n g each t r a n s i t i o n i n the combination i s empty or not. I f i t i s not empty, there i s a non-determinism between those t r a n s i t i o n s which make up the L e x i c a l Scanner Generator 60 combination being examined. although the combinations c o u l d be looked at i n any o r d e r , f o r reasons of e f f i c i e n c y i t i s b e t t e r to examine the l a r g e combinations before the s m a l l ones. For example, suppose s t a t e s1 has the f o l l o w i n g t r a n s i t i o n s : to s t a t e s2 on a,b,c to s t a t e s3 on a to s t a t e s4 .on - a/b I f the f i r s t combination examined was the combination c o n s i s t i n g of the f i r s t two t r a n s i t i o n s , then the non-determinism i n the c h a r a c t e r 'a' would be d e t e c t e d , g i v i n g a new s e t of t r a n s i t i o n s : to s t a t e {s2,s3} on a to s t a t e s2 on b,c to s t a t e s4 on a,b The non-determinism between s t a t e s {S2,S3} and SU would then be detected l a t e r , but i f the combinations had been examined i n a d i f f e r e n t o r d e r , from the l a r g e s t combination t o the s m a l l e s t , then i t i s obvious t h a t i t would never be necessary to i n c l u d e any newly c r e a t e d s t a t e s i n the powersetting process. In the above example, i f the f i r s t combination of t r a n s i t i o n s examined was the combination c o n s i s t i n g of a l l t h r e e t r a n s i t i o n s , then the r e s u l t i n g t r a n s i t i o n s a f t e r a n a l y s i s would be: L e x i c a l Scanner Generator 61 to s t a t e {s2,s3,s4} on a to s t a t e s2 on b,c to s t a t e s4 on b The t h r e e combinations i n v o l v i n g two t r a n s i t i o n s would then be examined g i v i n g the d e t e r m i n i s t i c t r a n s i t i o n s : to s t a t e {s2,s3,s4} on a to s t a t e £s2,s4} on b to s t a t e {s2} on c I f c i s the (approximately constant) number of o p e r a t i o n s r e q u i r e d to examine one combination of t r a n s i t i o n s , then i n order to examine a l l the combinations of the n t r a n s i t i o n s of a given s t a t e i t i s necessary to perform 0 (c2**n) o p e r a t i o n s . S i n c e the number of op e r a t i o n s i n v o l v e d i n keeping t r a c k of which t r a n s i t i o n s are c u r r e n t l y being examined tends t o make c r a t h e r l a r g e , the power-setting process becomes very slow f o r l a r g e n. A quick examination shows t h a t the process of c r e a t i n g the NDFSM from the NDFSMWET can only i n c r e a s e , never decrease, the number of t r a n s i t i o n s out of a given s t a t e . S i n c e , by the r u l e s of the c o n s t r u c t i o n of the NDFSMWET the s t a r t s t a t e of the system must i n i t i a l l y have m empty t r a n s i t i o n s where m i s the number of lexemes being d e f i n e d , i t f o l l o w s t h a t the same s t a t e i n the NDFSM must have at l e a s t m t r a n s i t i o n s . Since f o r many languages m i s i n the range of 25 t o 50, t h i s means t h a t the L e x i c a l Scanner Generator 62 i time r e q u i r e d t o power-set the s t a r t s t a t e i s 0 { c 2 3 0 ) or approximately 1,000,000,000. Thus i t becomes i m p r a c t i c a l ( f o r 30 lexemes, approximately 20 minutes on the IBM 370) t o power set the s t a r t s t a t e . The example mentioned above i s r e a l l y a worst case f o r power-setting i n t h a t the c h a r a c t e r 'a* occurs i n each t r a n s i t i o n . In many cases (and f o r the s t a r t s t a t e of the NDFSM, most cases) i t i s p o s s i b l e to break down the s e t of t r a n s i t i o n s T f o r a p a r t i c u l a r s t a t e i n t o subsets T1, T2, ...» Tn such t h a t : [{a j a l a b e l s Ti} i n t e r s e c t £a J a l a b e l s Tj}}=# V i , j D e f i n i t i o n 2.1 I f T i s the s e t of t r a n s i t i o n s of a s t a t e S, the s e t of t r a n s i t i o n s obtained a f t e r power-setting Theorem Let S be a s t a t e , T be the s e t of t r a n s i t i o n s {t1,t2,...,tn} out of S and A i be the s e t of c h a r a c t e r s l a b e l l i n g t i (1<=i<=n). I f there e x i s t T1, 12, ...» Tk subsets of T such t h a t T=T1 o 12 11 ... 0 Tk G * r ? s t i € T r , t j € Ts => {Ai i n t e r s e c t t hen P(T)=P(T1) 0 P(T2) U ... 0 P(Tk) then PJT1 i s T. L e x i c a l Scanner Generator 63 Proof I f s u f f i c e s t o show t h a t i f d e l t a i s the s t a t e t r a n s i t i o n f u n c t i o n d e f i n i n g T, d e l t a ' the s t a t e t r a n s i t i o n f u n c t i o n d e f i n i n g P (T) and d e l t a * ' the s t a t e t r a n s i t i o n f u n c t i o n d e f i n i n g P(T1) U P(T2) U ... D P(Tk) then d e l t a • = d e l t a " . Suppose the t r a n s i t i o n s out of S on a a r e : t t i s d e l t a (a,S) =S1 t2 i s d e l t a (a,S)=S2 tm i s d e l t a (a,S) =Sm then by d e f i n i t i o n of powersetting, P (T) y i e l d s : d e l t a ' (a,S) = {S1,S2,... ,Sm} and t h i s i s the only t r a n s i t i o n on (a,S) . By the d e f i n i t i o n of T i , {t1,t2,...tmj i s a subset of T i f o r onl y one i s i n c e i f {t1 , t 2 , . . . ,t.o} i s a subset of two t r a n s i t i o n s e t s T i and T j then a e A i , a € Aj (by d e f i n i t i o n of A i , Aj) and {Ai i n t e r s e c t hj}?p c o n t r a d i c t i n g the d e f i n i t i o n of T i and T j . T h e r e f o r e , {t1,t2,...,tm} i s a subset of T i f o r only one i . By the d e f i n i t i o n of power-setting, P(T) y i e l d s d e l t a ' (a,S)= {S1,S2,...,Sm} and s i n c e {t1,t2,...,tm} i s a subset of only one T i , i t f o l l o w s t h a t P ( T i) y i e l d s d e l t a ' • (a,S)= {S1,S2,...,Sm}, the unique t r a n s f o r m a t i o n on a. Thus d e l t a * = d e l t a " . The important p a r t o f t h i s theorem i s t h a t while the order L e x i c a l Scanner Generator 6H of magnitude of powersetting T i s 0(c2**|T|) (where |T| denotes the number of elements of T ) , the order of magnitude of powersetting T1, T2, ..., Tn i s : 0 ( c 2 * * | T l j ) + 0(c2**jT2J) + ... + 0(c2**|Tn|) which i s c o n s i d e r a b l y s m a l l e r s i n c e JTj = |T1| + |T2J + ... + |Tn|. In the case of the s t a r t s t a t e of the NDFSM, t h i s s aving can be g u i t e s i g n i f i c a n t . For example, c o n s i d e r the s t a r t s t a t e of the NDFSM f o r the lexeme s e t o f the language ALGOL-a gi v e n i n Appendix B. To s t a t e S1 to s t a t e S2 to s t a t e S3 to s t a t e S4 to s t a t e S5 to s t a t e S6 to s t a t e S7 to s t a t e S8 to s t a t e S9 to s t a t e S10 To s t a t e S11 to s t a t e S12 to s t a t e S13 to s t a t e S14 to s t a t e S15 to s t a t e S16 t o s t a t e S17 to s t a t e S18 to s t a t e S19 to s t a t e S20 to s t a t e S21 to s t a t e S22 to s t a t e S23 to s t a t e S24 to s t a t e S25 to s t a t e S26 to s t a t e S27 to s t a t e S28 to s t a t e S29 on on on on on on on on on on on on on on on on on on on on on on on on on on on on on ( s t a r t of the lexeme b l a n k s t r i n g ) A-Z ( s t a r t o f the lexeme i d e n t i f i e r ) C ( s t a r t of the lexeme *COMMENTcharseg;« 0-9 ( s t a r t of the lexeme i n t e g e r ) # ( s t a r t of the lexeme hexadecimal constant) " ( s t a r t of the lexeme s t r i n g ) ; ( s t a r t of the lexeme ';*) , ( s t a r t of the lexeme »,') : ( s t a r t o f the lexeme •: ( s t a r t of the lexeme ( ( s t a r t of the lexeme ) + < < > > ( s t a r t o f the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t o f the lexeme ( s t a r t o f the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t o f the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme ( s t a r t of the lexeme »01» ( s t a r t o f t h e lexeme e n d - o f - f i l e ) ') ,-') C ) ) ») + ') -') *«) /•) **«) I') = M -,= •) <•) <=») >') >=') ::») : = ») '») L e x i c a l Scanner Generator 65 For t h i s s t a t e , the s e t T i s : {S1,S2,...,S29} which s p l i t s up i n t o the subsets: T1 • =• {S1} T2 = {S2,S3} T3 = {S4} T4 = {S5} T5 = {S6} T6 = {S7} T7 = {S8} T8 = {S9,S26,S27} T9 = [S10> T10 = {S11} T11 = {S12) T12 = {S13} T13 = {S14} T14 = {S15,S17} T15 = {S16} T16 = {S18,S21} T17 = {SI 9} T18 = {S20} 119 = {S22,S23} T20 = {S24,S25) T21 = {S28} T22 = {S29} Since there i s no need t o powerset any T i which c o n t a i n s only one element, i t f o l l o w s t h a t the e x e c u t i o n time of the modified a l g o r i t h m i s : T I M E = 0(C2**|T2|) + 0(C2**|T8|) + 0(C2**|T14|) + 0(C2**|T16|) + 0{C2**1T19I) + O(c2**|T20|) = 0{4c) + 0(8c) + 0(4c) • 0(4c) + 0(4c) • 094c) = 0(28c) which i s much b e t t e r than the e x e c u t i o n time f o r the o r i g i n a l Lexical Scanner Generator 66 algorithm, 0(c22«) or 0 (c*500,000,000) . , 3.2.6 Checking the correctness of the DFSM although the modified algorithm f o r the construction of the DFSM does form limbo t r a n s i t i o n s , i t s t i l l does not take into account the fact that creating the DFSM can cause t r a n s i t i o n seguences to occur which are not possible to emulate i n the l e x i c a l scanner. The phase of the processing which checks t h i s occurs immediately a f t e r the DFSM has been created. The problem which occurs i s that a limbo t r a n s i t i o n i s effected i n the f i n a l l e x i c a l scanner by having the l e x i c a l scanner issue a HOLD i n s t r u c t i o n to the l e x i c a l machine. As was previously mentioned, once a character i s i n the hold of the l e x i c a l machine, i t must be removed before further processing of the input stream can occur. In terms of the DFSM, t h i s means that a seguence of two or more limbo ins t r u c t i o n s cannot be allowed. The phase which creates the DFSM does not check for t h i s , I f the above algorithm for the construction of the DFSM encounters a structure of the form: L e x i c a l Scanner Generator 67 IGNORE A J | i >l S2 h 1 i I ACCEPT B i 1 1 I I ' > | S4 t--ACCEPT A I 1 i i 1 i >l S3 | I I IGNORE B | | >, S 5 , I i i t m o d i f i e s i t t o be of the form: LIMBO A j " 1 -->JS2,S* I f -ACCEPT B | 1 , >1 S3 | I I I I I I • >| S5 | IGNORE B | | and then t o : I | LIMBO A | | LIMBO B | | j s i j >|S2,S T >1S3,S$ I I I I I I L e x i c a l Scanner Generator 68 The f i n a l s t r u c t u r e cannot be represented i n the l e x i c a l machine s i n c e i t r e q u i r e s two c h a r a c t e r s to be put i n t o the h o l d . I t should a l s o be noted that there would be a l u r k i n g ambiguity i f a s t r u c t u r e of t h i s form were to be allowed i n the system s i n c e once s t a t e {s3,s5} was reached there would be no way to t e l l whether to ignore a and accept b or ignore b and accept a; the problem coming from the f a c t t h a t i f the f i v e o r i g i n a l s t a t e s were l a b e l l e d : 51 t o S2 ignore a 52 to S3 accept b S1 to SU accept a S4 to S5 ig n o r e b then the computation of the DFSM would y i e l d the same f i n a l s t r u c t u r e as above. In a d d i t i o n to the a n a l y s i s o f the limbo t r a n s i t i o n s c r e a t e d during c r e a t i o n of the DFSM, t h i s phase a l s o checks f o r any a m b i g u i t i e s i n t h e simple l e x i c a l machine a c t i o n f o r a f i n a l s t a t e . Since the c r e a t i o n of the DFSM could j o i n two s t a t e s which are f i n a l and each f i n a l s t a t e i n the system stands f o r the r e t u r n of a p a r t i c u l a r lexeme, an ambiguity c o u l d a r i s e i n the lexeme which i s t o be returned from the combined s t a t e . As an example of t h i s , c o n s i d e r the d e s c r i p t i o n s : ONE OF "ABCDEFGHIHJKLMN", ANY OF "ABCDEFGHIJKLMN" and L e x i c a l Scanner Generator 69 "BEGIN" both of which w i l l accept the word "BEGIN", a f a c t which i s de t e c t e d by checking f o r the f i n a l s t a t e ambiguity d e s c r i b e d above. A f t e r t h i s phase has been completed, i t i s then p o s s i b l e to punch out e i t h e r a d e s c r i p t i o n language or PL360 source code r e p r e s e n t a t i o n of the l e x i c a l scanner. 3.2.7 Punching the Machine T h i s phase i s i n c l u d e d here f o r completeness r a t h e r than f o r i t s s i g n i f i c a n c e . The o n l y p o i n t of i n t e r e s t i s t h a t as p a r t of the punching phase* the t r a n s i t i o n s are s o r t e d so t h a t t r a n s i t i o n s from the s t a t e to i t s e l f (•SHILE* t r a n s i t i o n s ) are punched f i r s t . A l s o an attempt i s made to ensure, i f p o s s i b l e , t h a t i f a t r a n s i t i o n e x i s t s on every c h a r a c t e r out of a s t a t e , then the l a s t t r a n s i t i o n punched t r a n s f e r s c o n t r o l t o the next s t a t e by " f a l l i n g i n t o i t " i n the s e q u e n t i a l e x e c u t i o n of i n s t r u c t i o n s . In other words, a statement *G0 s2* i s not punched f o r the l a s t t r a n s i t i o n of a s t a t e s1 i f i t i s p o s s i b l e to punch s t a t e s2 immediately a f t e r punching s t a t e s1. The a n a l y s i s done i n t h i s phase i s independent of whether the i n t e n t i s to punch the machine i n the d e s c r i p t i o n language or i n PL360 source code. L e x i c a l Scanner Generator 70 Chapter 4 4.1 Conclusion The l e x i c a l scanner which i s produced v i a the above s e t of t r a n s f o r m a t i o n s provides the language designer with one of the t o o l s r e q u i r e d i n the c o n s t r u c t i o n of a co m p i l e r . The system a l l o w s f o r the ge n e r a t i o n of a very l a r g e c l a s s of l e x i c a l scanners and as such i s u s e f u l i n an e q u a l l y l a r g e number of a p p l i c a t i o n s . Since the in p u t language to the system i s s i m i l a r to the user's n o t i o n of what i t i s he i s working with, the system i s com f o r t a b l e to work with f o r even the i n e x p e r i e n c e d u s e r . There a r e , of course, many other components r e q u i r e d i n order to b u i l d a compiler: r e s e r v e d word r e c o g n i z e r s , p a r s e r s , tr e e b u i l d e r s and code generators are j u s t some of the components which c o u l d make up a co m p i l e r . Some of these components, such as p a r s e r s , have caused much i n t e r e s t and as such have been analyzed i n d e t a i l . On the other hand, the problems i n v o l v e d i n l i n k i n g a p a r s e r generator i n t o a compile r generator have r e c e i v e d l i t t l e a t t e n t i o n . Other components have been i n v e s t i g a t e d i n d i r e c t l y ; a ge n e r a t i o n program f o r the r e c o g n i t i o n of r e s e r v e d words does not e x i s t , but the theory of hashing f u n c t i o n s which would be u s e f u l to the des i g n e r of such a generator does e x i s t . S t i l l o ther components have not been i n v e s t i g a t e d a t a l l , such as ambiguity scanners ( f o r the s o l u t i o n of the open and c l o s e symbol problem i n ALGOL-68). F i n a l l y , t h e r e are those components f o r which a formal methodology i s s t i l l l a c k i n g , L e x i c a l Scanner Generator 71 these being i n the aspects o f the semantic i n t e r p r e t a t i o n of a language and how t o i n c l u d e t h i s i n t o a compiler g e n e r a t i o n system. Each of these components opens a door to an area i n which f u r t h e r r e s e a r c h i s needed. Once some of these areas have been entered and g e n e r a t i o n systems developed f o r each of these components, then and only then w i l l the t r u l y modular compil e r g e n e r a t i o n system have been achieved. Lexical Scanner Generator 72 BIBLIOGRAPHY [1] Feldman, J . A. A Formal Semantics for Computer Languages and i t s Application i n a Compiler-Compiler, Communications of the ACM, Volume 9, Number 1, 1966 £2] van Wijngaarden, A., Mailloux, B. J . , Peck, J. E. L., Koster, C. H. A., S i n t z o f f , M., Lindsay, C. H., Meertens, L. G. L. T. , Fisker, R. G. (editors) Revised Report on the Algorithmic Language ALGOL-68, Springer-Verlag, New York, 1974 [3] Beautier, D, A. PSAP: A Guide f o r Users, University of C a l i f o r n i a at Santa Barbara, 1973 [4] Johnson, H. L., Porter, J. H., Ackley, S. , 1 . , Boss, D. T. Automatic Generation of E f f i c i e n t L e x i c a l Processors Osin3 F i n i t e State Technigues, Communications of the ACM, Volume 11, Number 12, 1968 £5] Richards, M. BCPL: A Tool f o r Compiler Writing and System Programming, Spring Joint Computer Conference, AFIPS Press, 1969 [6] PL36Q Programming Manual, University Printing Service, Newcastle, 1972 [7] Marshall, S. A Theorem on Boolean Matrices. Communications of the ACM, Volume 9, Numer 1, 1962 [8] warren, H. S. J r . A Modification of Harshall^s Algorithm for the Transitive Closure of Binary, fiSiitions, Communications of the ACM, Volume 18, Number 4, 1975 £9] ALGOL-w Programming Manual, University P r i n t i n g Service, Newcastle, 1970 [10] Aho, A., Hopcroft, J. E., Ullman, J . D. The Design and Analysis of Computer Algorithms, Addison-Hesiey, Don Mills,"1974 L e x i c a l Scanner Generator 73 Appendix A An Input D e s c r i p t i o n f o r ALGOL-g The f o l l o w i n g i s an example of the type of i n p u t which i s accepted by the l e x i c a l scanner generation system. T h i s example i s f o r the language ALGOL-*. BEGIN LEXEME 1 IS " ", ANY OF " n . LEXEME 2 IS ONE OF «ABCDEFGHIJKLMNOPQRST0V8XYZ", ANY OF "ABCDEFGHIJKLMNOPQBSTUVHXYZ0123456789_". LEXEME 3 IS '•COMMENT", NOTANY OF ";", ";". LEXEME 4 IS ONE OF "0123456789", ANY OF "0123456789". LEXEME 5 IS "#", ONE OF "0123456789", ANY OF "0123456789". SUBCHAB IS NOTANY OF """" OR IGNORE «"«", «»»««. LEXEME 6 IS IGNORE ««»", ANY OF SUBCHAR, IGNORE »""". LEXEME 7 IS ";". LEXEME 8 IS ",«. LEXEME 9 IS ":". LEXEME 10 LEXEME 11 LEXEME 12 LEXEME 13 LEXEME 14 LEXEME 15 LEXEME 16 LEXEME 17 LEXEME 18 LEXEME 19 LEXEME 20 LEXEME 21 LEXEME 22 LEXEME 23 LEXEME 24 LEXEME 25 LEXEME 26 LEXEME 27 LEXEME 28 LEXEME 28 END IS ".". IS " {». IS " ) " . IS "+". IS IS f t * It IS "/". IS n $ * l l • IS l i _ i l IS "| ". IS n = t t IS l i - , — l l • IS "<". IS ••<=!! IS ">". IS II > = « • IS •1 . . II • IS II - =11 • IS t i 1 • It * IS It 1 ^ 1 II The f o l l o w i n g i s the l e x i c a l scanner (given i n the d e s c r i p t i o n language) f o r the ALGOL-W lexeme d e s c r i p t i o n given L e x i c a l Scanner Generator 74 above. S95 IF " " (ACCEPT GO S2) IF "C» (ACCEPT GO S97) IF "ABDEFGHIJKLMNOPQRSTUVSXYZ" (ACCEPT GO S5) IF "0123456789" (ACCEPT GO S19) IF "#" (ACCEPT GO S22) IF """ (IGNORE GO S33) IF ";« (ACCEPT RETURN 7) IF "," (ACCEPT RETURN 8) IF "." (ACCEPT RETURN 10) IF " (" (ACCEPT RETURN 11) IF ") " (ACCEPT RETURN 12) IF •»••" (ACCEPT RETURN 13) IF "-» (ACCEPT RETURN 14) IF »/" (ACCEPT RETURN 16) IF "*" (ACCEPT GO S98) IF "|" (ACCEPT RETURN 19) IF "=" (ACCEPT RETURN 20) IF "-*" (ACCEPT GO S99) IF "<" (ACCEPT GO S50) IF ">" (ACCEPT GO S51) IF ":" (ACCEPT GO S52) IF "«" (ACCEPT RETURN 28) IF " ? " (ACCEPT RETURN 29) ELSE ERROR 552 IF ":" (ACCEPT RETURN 26) IF "=" (ACCEPT RETURN 27) (RETURN 9) S51 IF "=» (ACCEPT RETURN 25) (RETURN 24) S50 IF "=" (ACCEPT RETURN 23) (RETURN 22) S99 IF "=" (ACCEPT RETURN 21) (RETURN 18) S98 IF "*" (ACCEPT RETURN 17) (RETURN 15) S33 IFNOT """ (ACCEPT GO S36) (IGNORE ) 553 IF """ (ACCEPT GO S40) (RETURN 6) S40 IFNOT »"" (ACCEPT GO S36) (IGNORE GO S53) S36 WHILENOT """ (ACCEPT ) (IGNORE GO S53) S22 IF "0123456789" (ACCEPT ) ELSE ERROR 524 IF "0123456789" (ACCEPT GO S25) (RETURN 5) 525 WHILE "0123456789" (ACCEPT ) (RETURN 5) 519 IF "0123456789" (ACCEPT GO S20) (RETURN 4) 520 WHILE "0123456789" (ACCEPT ) L e x i c a l Scanner Generator (RETURN 4) 55 IF "_ABCnEFGHIJKLMNOPQfiSTUVwXYZ0123456789" (ACCEPT GO S6) (RETURN 2) 56 WHILE "_ABCDEFGHIJKLHNOPQRSTUV8XYZ0123456789" (ACCEPT ) (RETURN 2) S97 IF "0" (ACCEPT GO S54) IF "_ABCDEFGHIJKLtlNPQRSTUVWXYZ0123456789" (ACCEPT GO S6) (RETURN 2) 554 IF "M" (ACCEPT GO S55) IF "_ABCDEFGHIJKLNOPQRSTUVWX¥Z0123456789" (ACCEPT GO S6) (RETURN 2) 555 IF "M" (ACCEPT GO S56) IF "_ABCDEFGHIJKLNOPQRSTUV8XYZ0123456789" (ACCEPT GO S6) (RETURN 2) 556 IF "E" (ACCEPT GO S57) IF "_ABCDFGHIJKLMNOPQRSTUVWXYZ0123456789" (ACCEPT GO S6) (RETURN 2) 557 IF "N" (ACCEPT GO S58) IF »_ABCDEFGHIJKLMOPQRSTUVWX¥Z0123456789" (ACCEPT GO S6) (RETURN 2) 558 IF "T" (ACCEPT GO S59) IF "_ABCDEFGHIJKLHNOPQRSUVHXYZ0123456789« (ACCEPT GO S6) (RETURN 2) 559 IF "_ABCDEFGHIJKLMNOPQRSTUViXYZ0123456789" (ACCEPT GO S60) IFNOT ";_ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" (ACCEPT GO S15) (ACCEPT RETURN 3) (RETURN 2) S15 SHILENOT •";« (ACCEPT ) (ACCEPT RETURN 3) 560 WHILE "_ABCDEFGHIJKLMNOPQBSTUVHXYZ0123456789" (ACCEPT ) IFNOT ";_ABCDEFGHIJKLMNOPQRSTUVHXYZ0123456789" (ACCEPT GO S15) (ACCEPT RETURN 3) (RETURN 2) 52 IF " » (ACCEPT GO S3) (RETURN 1) 53 WHILE " " (ACCEPT ) (RETURN 1) L e x i c a l Scanner Generator 76 Appendix B Using the Lexical Scanner Generator The f o l l o w i n g i s a d e s c r i p t i o n of how to use the l e x i c a l scanner generator as i t i s implemented at OBC. The purpose of the l e x i c a l scanner g e n e r a t i o n system i s t o supply the compiler w r i t e r with an easy method f o r generating l e x i c a l scanners f o r v a r i o u s programming or command languages. A v a i l a b i l i t y The system can be invoked v i a the f o l l o w i n g MTS run command: $B0N TEBV:LEXGENEBATOB SCABDS=inputfile S P B I N T = l i s t i n g f i l e SPUHCH=outputfile {PAS=parfield} where each of the values expressed i n lower case l e t t e r s i s de f i n e d as f o l l o w s : i n p u t f i l e - T h i s i s the f i l e which c o n t a i n s the d e s c r i p t i o n of the lexemes i n accordance with the syntax o f the L e x i c a l Scanner Generation System. l i s t i n g f i l e - T h i s i s the f i l e onto which the L e x i c a l Scanner Generation System w i l l both echo . the L e x i c a l Scanner Generator 77 i n p u t l i n e s and p r i n t any i n f o r m a t i o n which was reguested v i a the PAR= f i e l d of the $RUN command. o u t p u t f i l e - T h i s i s the f i l e onto which the l e x i c a l Scanner Generation System w i l l punch e i t h e r the i d e a l machine or the PL360 coded v e r s i o n of the i d e a l machine. p a r f i e l d - The value of ' p a r f i e l d ' determines whether t o produce i d e a l machine code or PL360 code and a l s o determines whether any t r a c i n g i n f o r m a t i o n i s t o he produced. The ' p a r f i e l d * i s made up of a seguence of items separated by commas and i s ev a l u a t e d from l e f t to r i g h t . An item can be any of the f o l l o w i n g : TRACE(charseg) - The 'charseq' i s scanned from l e f t to r i g h t and any d i g i t n (0 <= n <= 6) which i s found causes t r a c e f l a g n to be turne d on. The a c t i o n s of the va r i o u s f l a g s a r e : 0 - A l l t r a c e f l a g s are turned o f f . 1 - A l l t r a c e f l a g s are turned on. L e x i c a l Scanner Generator 78 2 - The i n i t i a l machine before any t r a n s i t i o n s have been performed i s p r i n t e d out. 3 - The empty t r a n s i t i o n matrix i s p r i n t e d out. 4 - The t r a n s i t i v e c l o s u r e of the empty t r a n s i t i o n matrix i s p r i n t e d out. 5 - The n o n - d e t e r m i n i s t i c machine i s p r i n t e d out. 6 - The d e t e r m i n i s t i c machine i s p r i n t e d out. PL360 - Instead of punching out the i d e a l machine, the L e x i c a l Scanner Generation System w i l l punch out a PL360 source program which can then be compiled g i v i n g a F o r t r a n c a l l a b l e module. I f the l e x i c a l scanner g e n e r a t i o n system i s run with PL360 i n the PAR= f i e l d , then the r e s u l t i n g PL360 source program can be compiled using the f o l l o w i n g HTS $run command: L e x i c a l Scanner Generator 79 $BUN ALGW:PL360M SCABDS=outputfile SPRINT=listingfile SPUNCH=deckfile where each of the values expressed i n lower case l e t t e r s i s defined as follows: o u t p u t f i l e - This i s the ou t p u t f i l e as s p e c i f i e d from the SPUNCH=outputflie on the $RUN command of the Lexi c a l Scanner Generation system. l i s t i n g f i l e - This i s the f i l e on which the PL360 compiler w i l l produce a compilation l i s t i n g . d eckfile - This i s the f i l e onto which the PL360 compiler w i l l punch the object deck. The object deck which results from t h i s compilation i s a Fortran c a l l a b l e module c a l l e d GETLEX with the following c a l l i n g seguence: INTEGER LEXNUM,LENGTH L0GICAL*1 LEXEME(n) CALL GETLEX(LEXNUM,LENGTH,LEXEME) The GETLEX module sets the value of LEXNUM to the i n t e g r a l L e x i c a l Scanner Generator 80 value a s s o c i a t e d with the lexeme, LENGTH t o the l e n g t h of the symbolic s t r i n g which composes the lexeme and LEXEME to the symbolic r e p r e s e t a t i o n of the token. The user must use a value of *n* i n the d e c l a r a t i o n of LEXEME such t h a t n i s g r e a t e r than or equal t o the maximum l e n g t h of any LEXEME which c o u l d be returned. The module a l s o r e q u i r e s that the user supply two F o r t r a n c a l l a b l e modules with the f o l l o w i n g names and c a l l i n g seguences: SUBROUTINE NEWLIN(LENGTH,BUFFER) INTEGER LENGTH LOGICAL*1 BUFFER (256) SET THE VALUE OF BUFFER TO THE NEXT INPUT BUFFER AND LENGTH TO ITS LENGTH END SUBROUTINE LEXERR(CHAR) INTEGER CHAR CHAR HAS CAUSED AN ERROR IN GETLEX. IF UNCHANGED THE CHAR IN ERROR HILL BE DELETED, OTHEBHISE THE OLD VALUE HILL BE REPLACED BY. THE NEW VALUE END C C c c c c c 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051776/manifest

Comment

Related Items